Abstract:
The demand for edge device models equipped with multilingual visual capabilities is rapidly increasing in complex IoT application scenarios. While many studies have endow...Show MoreMetadata
Abstract:
The demand for edge device models equipped with multilingual visual capabilities is rapidly increasing in complex IoT application scenarios. While many studies have endowed models with strong visual sensory and language analysis capabilities, these models are often large and require substantial amounts of data. Moreover, multilingual parallel corpora are extremely scarce. Although large parameter sizes can enhance a model’s visual-language processing capabilities, the high training and inference costs make them unsuitable for edge devices and result in suboptimal performance in multilingual contexts. To address these challenges, this paper proposes an generative visual language model that is cross-lingual, lightweight, data-efficient, easy to train, and easy to infer. We map both English and non-English features into the same space and align them with a visually distilled model while leveraging the inherent similarity information of languages to increase the supervision coverage of the dataset. Through extensive experiments, we demonstrate that our model achieves state-of-the-art performance across three downstream tasks: Image Captioning, Machine Translation, and Visual Question Answering, surpassing existing methods.
Published in: IEEE Internet of Things Journal ( Early Access )