Conferences >2021 IEEE/CVF International C...

Contrast and Classify: Training Robust VQA Models

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input question...Show More

Metadata

Abstract:

Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input questions. Existing approaches address this by augmenting the dataset with question paraphrases from visual question generation models or adversarial perturbations. These approaches use the combined data to learn an answer classifier by minimizing the standard cross-entropy loss. To more effectively leverage augmented data, we build on the recent success in contrastive learning. We propose a novel training paradigm (ConClaT) that optimizes both cross-entropy and contrastive losses. The contrastive loss encourages representations to be robust to linguistic variations in questions while the cross-entropy loss preserves the discriminative power of representations for answer prediction.We find that optimizing both losses – either alternately or jointly – is key to effective training. On the VQA-Rephrasings [44] benchmark, which measures the VQA model’s answer consistency across human paraphrases of a question, ConClaT improves Consensus Score by 1.63% over an improved baseline. In addition, on the standard VQA 2.0 benchmark, we improve the VQA accuracy by 0.78% overall. We also show that ConClaT is agnostic to the type of data-augmentation strategy used.

Published in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Date of Conference: 10-17 October 2021

Date Added to IEEE Xplore: 28 February 2022

ISBN Information:

ISSN Information:

DOI: 10.1109/ICCV48922.2021.00163

Conference Location: Montreal, QC, Canada

Contents

1. Introduction

Visual Question Answering (VQA) refers to the task of automatically answering free-form natural language questions about an image. For VQA systems to work reliably when deployed in the wild, for applications such as assisting visually impaired users, they need to be robust to different ways a user might ask the same question. For example, VQA models should produce the same answer for two paraphrased questions – "What is in the basket?" and "What is contained in the basket?" since their semantic meaning is the same. While significant progress has been made towards building more accurate VQA systems, these models remain brittle to minor linguistic variations in the input question.

References is not available for this document.

Contrast and Classify: Training Robust VQA Models

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Contrast and Classify: Training Robust VQA Models

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

References