Conferences >2024 IEEE/CVF Conference on C...

Non-autoregressive Sequence-to-Sequence Vision-Language Models

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generati...Show More

Metadata

Abstract:

Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.

Published in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Date of Conference: 16-22 June 2024

Date Added to IEEE Xplore: 16 September 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/CVPR52733.2024.01291

Conference Location: Seattle, WA, USA

Contents

1. Introduction

Sequence-to-sequence autoregressive Transformers [12], [34], [42] are deep neural network architectures that map a se-quence of tokens, each representing a segment of text as a vector, onto another sequence, typically representing the same sequence shifted forward by one. Such models can handle a variety of tasks [24], [33], [34], whereby the input (query) text could be a sentence in natural language, and the output (target) the same sentence in a different language (translation), or the answer to a question expressed in the input (question-answering, QA), the name of an entity or class, etc. The Transformer architecture's versatile and uni-fied design has led to the development of all-in-one (AIO) models, such that multiple tasks can be approached as a sequence-to-sequence translation problem.

References is not available for this document.

Non-autoregressive Sequence-to-Sequence Vision-Language Models

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Non-autoregressive Sequence-to-Sequence Vision-Language Models

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References