1. Introduction
Sequence-to-sequence autoregressive Transformers [12], [34], [42] are deep neural network architectures that map a se-quence of tokens, each representing a segment of text as a vector, onto another sequence, typically representing the same sequence shifted forward by one. Such models can handle a variety of tasks [24], [33], [34], whereby the input (query) text could be a sentence in natural language, and the output (target) the same sentence in a different language (translation), or the answer to a question expressed in the input (question-answering, QA), the name of an entity or class, etc. The Transformer architecture's versatile and uni-fied design has led to the development of all-in-one (AIO) models, such that multiple tasks can be approached as a sequence-to-sequence translation problem.