I. Introduction
With the advance of modality representation learning in language [1]–[4], audio [5]–[7], and vision [8]–[11], multimodal sequence learning that aims to improve overall performance by fusing multiple sensory data, has drawn much attention recently. Multimodal sequence learning bridges the gaps between different modalities and is expected to provide a more reliable solution with high generalization ability when involving more modalities.