Conferences >2024 IEEE/CVF Conference on C...

Zero-Shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) afine-grained di...Show More

Metadata

Abstract:

Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) afine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the for-mat of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully super-vised model. Code is available at https://github.com/Show-han/Zeroshot_REC.

Published in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Date of Conference: 16-22 June 2024

Date Added to IEEE Xplore: 16 September 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/CVPR52733.2024.01362

Conference Location: Seattle, WA, USA

References is not available for this document.

Contents

1. Introduction

Visual grounding is a fundamental task across computer vision and natural language processing, where the goal is to find the correspondences between image content and textual descriptions. It has broad applications in image captioning [18], [53], visual question answering [43], [66], vision-language navigation [11], etc. Collecting detailed grounding annotations to train specialist models, however, is cumbersome. Therefore, zero-shot visual grounding [30], [38], [54] is an attractive alternative. Figure 1.

Illustration of how we disambiguate visual entities based on their interactions with other entities. The same entity or relationships in the image and caption are in the same color.

Select All

David Belanger and Andrew McCallum, "Structured prediction energy networks", International Conference on Ma-chine Learning, pp. 983-992, 2016.

Zero-Shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

Supplemental Items

Footnotes

References