Loading [MathJax]/extensions/MathZoom.js
Cross-Modal Hash Retrieval Based on Deep Learning | IEEE Conference Publication | IEEE Xplore

Cross-Modal Hash Retrieval Based on Deep Learning


Abstract:

The exponential growth of multimedia data increasingly requires retrieval technology across different data modalities, such as searching for videos through images and sea...Show More

Abstract:

The exponential growth of multimedia data increasingly requires retrieval technology across different data modalities, such as searching for videos through images and searching for sounds through text. This cross-modal retrieval technology aims to explore the intrinsic semantic connections between different modalities, so that information of one modality can be retrieved from data of another modality. However, due to the difficulty of feature extraction and the high cost of data annotation, cross-modal learning is difficult and the retrieval efficiency is low. Deep learning methods have demonstrated powerful capabilities in both data processing and image processing, bringing new possibilities to solving the above cross-modal retrieval problems. Hashing methods have also been widely used in multimodal retrieval technology due to their advantages such as low storage cost and fast retrieval speed. This article summarizes the main requirements of cross modal retrieval tasks, reviews some mainstream research methods based on deep learning for cross modal hash retrieval, and finally lists some existing problems and possible future research trends in the field of cross modal retrieval, which will provide meaningful references for future researchers.
Date of Conference: 22-23 December 2024
Date Added to IEEE Xplore: 27 January 2025
ISBN Information:

ISSN Information:

Conference Location: Indore, India

I. Introduction

In the era of information explosion, we have generated massive amounts of multimedia data, which are growing exponentially, including various types of data such as images, videos, audio, and text. Information presents distinct characteristics, with heterogeneous modalities, diverse forms, and scattered content distribution. Multimedia data containing different modalities contains homologous semantic associations, but the presentation of modal information sources across modalities varies. Text, video, audio, and images all belong to different modalities, and data from different modalities can represent the same information. It is necessary to be able to retrieve different forms of the same information, i.e. multiple modalities, and the development trend of cross modal retrieval tasks cannot be ignored. However, traditional cross modal retrieval faces some undeniable problems, mainly manifested in the following aspects.

Firstly, feature extraction tasks are difficult. In multimodal retrieval tasks, researchers usually first represent images as visual features and use manually designed features, such as extracting images based on color histograms, textures, or text features based on word frequency statistics. These features often cannot extract deep level image information. Text is represented through linguistic semantics. The semantic expression difference between images and text is huge, and there is a semantic gap. As a result, the accuracy of cross modal retrieval is low.

Learning cross modal associations is difficult, and early cross modal retrieval methods often relied on simple linear mappings or distance based metrics, which made it impossible to establish complex mapping relationships between modalities. Models using these methods have limited expressive power and cannot effectively establish semantic associations between different modalities, making it difficult to capture complex semantic relationships between modalities.

The cost of data annotation is high, and early cross modal retrieval tasks usually require supervised learning methods to annotate datasets. However, when conducting large-scale data training, it is inevitable to incur high costs, and if there is insufficient annotation, it can lead to performance issues with the model. These issues are particularly evident due to the higher difficulty and complexity of annotating images and text across modal domains.

The efficiency of cross modal retrieval is low. Early cross modal retrieval methods usually used high-dimensional feature vectors to represent data of different modalities. When these high-order feature vectors are calculated on large-scale datasets, the time consumption is huge. Meanwhile, the indexing techniques of traditional methods are not efficient, and the time required for queries and their memory usage further increase time consumption.

Lack of datasets and evaluation criteria, early cross modal retrieval research was mainly based on smaller datasets, and there was no unified standard for evaluating retrieval performance. Researchers used different datasets and evaluation methods in their research. This has hindered the practical application and comparison of the method, limiting the improvement and development of cross modal retrieval techniques.

Contact IEEE to Subscribe

References

References is not available for this document.