I. Introduction
The Transformer architecture has profoundly transformed the landscape of deep learning and brought forward large models with their applications spreading from data centers [1] to edge devices. Large models are generally categorized into Natural Language Processing (NLP), Computer Vision (CV), and Multimodal models. NLP is widely applied on mobile devices [2], [3], from intelligent personal assistants, like Google Assistant and Apple Siri to real-time language translation [4]. CV plays a pivotal role in the field of autonomous driving [5], [6], where it is utilized for tasks such as real-time object detection [7], [8], lane recognition [9], and traffic signal detection [10]. By enriching robots' perception and decision-making capabilities through the integration of diverse data types [11], such as visual, auditory [12], and tactile [13] information, Multimodal large models are revolutionizing the field of robotics [14], [15].