Loading [MathJax]/extensions/MathZoom.js
Non-autoregressive Sequence-to-Sequence Vision-Language Models | IEEE Conference Publication | IEEE Xplore

Non-autoregressive Sequence-to-Sequence Vision-Language Models


Abstract:

Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generati...Show More

Abstract:

Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.
Date of Conference: 16-22 June 2024
Date Added to IEEE Xplore: 16 September 2024
ISBN Information:

ISSN Information:

Conference Location: Seattle, WA, USA
References is not available for this document.

1. Introduction

Sequence-to-sequence autoregressive Transformers [12], [34], [42] are deep neural network architectures that map a se-quence of tokens, each representing a segment of text as a vector, onto another sequence, typically representing the same sequence shifted forward by one. Such models can handle a variety of tasks [24], [33], [34], whereby the input (query) text could be a sentence in natural language, and the output (target) the same sentence in a different language (translation), or the answer to a question expressed in the input (question-answering, QA), the name of an entity or class, etc. The Transformer architecture's versatile and uni-fied design has led to the development of all-in-one (AIO) models, such that multiple tasks can be approached as a sequence-to-sequence translation problem.

Select All
1.
Nader Akoury, Kalpesh Krishna and Mohit Iyyer, "Syn-tactically supervised transformers for faster neural machine translation", ACL, 2019.
2.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra and C Lawrence Lawrence, "and Devi Parikh. Vqa: Visual question answering", Proceedings of the IEEE international conference on computer vision, pp. 2425-2433, 2015.
3.
Yu Bao, Hao Zhou, Shujian Huang, Dongqi Wang, Lihua Qian, Xinyu Dai, et al., "latent-glat: Glancing at latent variables for parallel text generation", CoRR, 2022.
4.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov and Sergey Zagoruyko, "End-to-end object detection with transformers", European confer-ence on computer vision, 2020.
5.
Ting Chen, Saurabh Saxena, Lala Li and J Fleet David, "and Ge-offrey Hinton. Pix2seq: A language modeling framework for object detection", International Conference on Learning Representations, 2022.
6.
Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin and J Fleet David, "and Geoffrey Hinton. A unified sequence interface for vision tasks", arXiv preprint, 2022.
7.
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El El, Faisal Ahmed, Zhe Gan, et al., "and Jingjing Liu. Uniter: Universal image-text representation learning", European conference on computer vision, 2020.
8.
Jaemin Cho, Jie Lei, Hao Tan and Mohit Bansal, "Unifying vision-and-language tasks via text generation", International Conference on Machine Learning, pp. 1931-1942, 2021.
9.
Qi Dong, Zhuowen Tu, Haofu Liao, Yuting Zhang, Vi-jay Mahadevan and Stefano Soatto, "Visual relationship detection using part-and-sum transformers with composite queries", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3550-3559, 2021.
10.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-vain Gelly et al., "An image is worth 16x 16 words: Trans-formers for image recognition at scale", arXiv preprint.
11.
Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng et al., "An empirical study of training end-to-end vision-and-language transformers", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
12.
Luciano Floridi and Massimo Chiriatti, "Gpt-3: Its nature scope limits and consequences", Minds and Machines, 2020.
13.
Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng and Jingjing Liu, "Large-scale adversarial training for vision-and-language representation learning", Advances in Neural Information Processing Systems, 2020.
14.
Marjan Ghazvininejad, Omer Levy, Yinhan Liu and Luke Zettlemoyer, "Mask-predict: Parallel decoding of conditional masked language models", EMNLP-IJCNLP, 2019.
15.
Alex Graves, Santiago Fernández, Faustino Gomez and Jürgen Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks", Proceedings of the 23rd international confer-ence on Machine learning, pp. 369-376, 2006.
16.
Jiatao Gu and Xiang Kong, "Fully non-autoregressive neural machine translation: Tricks of the trade", ACL, 2021.
17.
Jiatao Gu, James Bradbury, Caiming Xiong and OK Li Victor, "and Richard Socher. Non-autoregressive neural machine translation", ICLR, 2018.
18.
Jiatao Gu, Changhan Wang and Junbo Zhao, "Levenshtein transformer", Advances in Neural Information Processing Systems, 2019.
19.
Geoffrey Hinton, Oriol Vinyals, Jeff Dean et al., "Distilling the knowledge in a neural network", arXiv, 2015.
20.
Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, et al., "Scaling up vision-language pre-training for image captioning", Pro-ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
21.
Lukasz Kaiser, Samy Bengio, Aurko Roy, Ashish Vaswani, Niki Parmar, Jakob Uszkoreit, et al., "Fast de-coding in sequence models using discrete latent variables", International Conference on Machine Learning, 2018.
22.
Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve and Ishan Misra, "and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding", Proceedings of the IEEE/CVF International Confer-ence on Computer Vision, 2021.
23.
Justin Lazarow, Kwonjoon Lee, Kunyu Shi and Zhuowen Tu, "Learning instance occlusion for panoptic segmentation", Proceedings of the IEEE/CVF conference on computer vi-sion and pattern recognition, pp. 10720-10729, 2020.
24.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvinine-jad, Abdelrahman Mohamed, Omer Levy, et al., "and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation translation and comprehension", arXiv preprint.
25.
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty and Caiming Xiong, "and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation", Advances in neural infor-mation processing systems, 2021.
26.
Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, et al., "Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning", ACL/IJCNLP, 2021.
27.
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei et al., "Oscar: Object-semantics aligned pre-training for vision-language tasks", European Conference on Computer Vision, 2020.
28.
Jinglin Liu, Yi Ren, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, et al., "Task-level curriculum learning for non-autoregressive neural machine translation", IJCAI, 2021.
29.
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot-taghi and Aniruddha Kembhavi, "Unified-io: A unified model for vision language and multi-modal tasks", arXiv preprint, 2022.
30.
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang and Alan Yuille, "Deep captioning with multimodal recurrent neural networks (m-rnn)", arXiv preprint, 2014.
Contact IEEE to Subscribe

References

References is not available for this document.