Conferences >2022 IEEE 30th Annual Interna...

A Dual-Mode Similarity Search Accelerator based on Embedding Compression for Online Cross-Modal Image-Text Retrieval

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Image-text retrieval (ITR) that identifies the relevant images for a given text query, or vice versa, is the fundamental task in emerging vision-and-language machine lear...Show More

Metadata

Abstract:

Image-text retrieval (ITR) that identifies the relevant images for a given text query, or vice versa, is the fundamental task in emerging vision-and-language machine learning applications. Recently, the cross-modal approach that extracts image and text features in separate reasoning pipelines but performs the similarity search on the same embedding representation is proposed for the real-time ITR system. However, the similarity search that finds the most relevant data in huge data embeddings for a given query becomes the bottleneck of the ITR system.In this paper, we propose a dual-mode similarity search accelerator that can solve the computational hurdle for online image-to-text and text-to-image retrieval service. We propose an embedding compression scheme that removes the sparsity in the text embeddings, further eliminating the time-consuming masking operations in the later processing pipeline. Combining with the data quantization from 32-bit floating-point to 8- bit integer, we reduce the target dataset size by 95.1% with less than 0.1% accuracy loss for 1024-dimensional embedding features. In addition, we propose a streamlined similarity search data flow for both query types, which minimizes the required memory bandwidth with maximal data reuse. The query and data embeddings are guaranteed to be fetched only once from the external memory with the optimized data flow. Based on the proposed data representation and flow, we design a scalable similarity search accelerator that includes multiple ITR kernels. Each ITR kernel has modular design, composed of a separate memory access module and a computing module. The computing module supports pipelined operations of the four similarity search tasks: dot product calculation, data reordering, partial score aggregation, and ranking. We double the number of processing operations in the computing module with the DSP packing technique. Finally, we implement the proposed accelerator with six ITR kernels on the Xilinx Alveo U2...

Published in: 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Date of Conference: 15-18 May 2022

Date Added to IEEE Xplore: 03 June 2022

ISBN Information:

ISSN Information:

DOI: 10.1109/FCCM53951.2022.9786159

Conference Location: New York City, NY, USA

Contents

I. Introduction

Computer vision (CV) and natural language processing (NLP) have been developed with unprecedented advances in deep learning technology. In addition to prominent achievements made by each field individually, there have been significant progress on the challenging task of combining vision and language by exploring their contextual relations. Such Vision- and-Language (VL) research has attracted many interests in many applications such as visual question answering [1], [2], image captioning [3]–[5], referring expression comprehension [6], [7], and image-text retrieval [8]–[11] for the last decade.

References is not available for this document.

A Dual-Mode Similarity Search Accelerator based on Embedding Compression for Online Cross-Modal Image-Text Retrieval

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

A Dual-Mode Similarity Search Accelerator based on Embedding Compression for Online Cross-Modal Image-Text Retrieval

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References