I. Introduction
In the era of information explosion, we have generated massive amounts of multimedia data, which are growing exponentially, including various types of data such as images, videos, audio, and text. Information presents distinct characteristics, with heterogeneous modalities, diverse forms, and scattered content distribution. Multimedia data containing different modalities contains homologous semantic associations, but the presentation of modal information sources across modalities varies. Text, video, audio, and images all belong to different modalities, and data from different modalities can represent the same information. It is necessary to be able to retrieve different forms of the same information, i.e. multiple modalities, and the development trend of cross modal retrieval tasks cannot be ignored. However, traditional cross modal retrieval faces some undeniable problems, mainly manifested in the following aspects.
Firstly, feature extraction tasks are difficult. In multimodal retrieval tasks, researchers usually first represent images as visual features and use manually designed features, such as extracting images based on color histograms, textures, or text features based on word frequency statistics. These features often cannot extract deep level image information. Text is represented through linguistic semantics. The semantic expression difference between images and text is huge, and there is a semantic gap. As a result, the accuracy of cross modal retrieval is low.
Learning cross modal associations is difficult, and early cross modal retrieval methods often relied on simple linear mappings or distance based metrics, which made it impossible to establish complex mapping relationships between modalities. Models using these methods have limited expressive power and cannot effectively establish semantic associations between different modalities, making it difficult to capture complex semantic relationships between modalities.
The cost of data annotation is high, and early cross modal retrieval tasks usually require supervised learning methods to annotate datasets. However, when conducting large-scale data training, it is inevitable to incur high costs, and if there is insufficient annotation, it can lead to performance issues with the model. These issues are particularly evident due to the higher difficulty and complexity of annotating images and text across modal domains.
The efficiency of cross modal retrieval is low. Early cross modal retrieval methods usually used high-dimensional feature vectors to represent data of different modalities. When these high-order feature vectors are calculated on large-scale datasets, the time consumption is huge. Meanwhile, the indexing techniques of traditional methods are not efficient, and the time required for queries and their memory usage further increase time consumption.
Lack of datasets and evaluation criteria, early cross modal retrieval research was mainly based on smaller datasets, and there was no unified standard for evaluating retrieval performance. Researchers used different datasets and evaluation methods in their research. This has hindered the practical application and comparison of the method, limiting the improvement and development of cross modal retrieval techniques.