1. Introduction
With the increasing power of multimodal pre-trained models, they have been widely applied in various multimodal tasks such as visual question answering, image-text generation, and image-text retrieval. Simultaneously, the parameter-efficient fine-tuning method, Adapter, has been extended from unimodal to multimodal contexts. VL-Adapter [1] tested the adapter-based method on multiple image-text and video-text tasks, demonstrating that adapters can efficiently learn fused information from visual and linguistic inputs. CLIP-Adapter [2] showcased the powerful few-shot performance of adapters in data-scarce downstream tasks. Tip-Adapter [3] achieved training-free adaptability for downstream tasks by utilizing a key-value cache to construct the adapter. Multiway-Adapter [4] proposed an innovative framework that enhances modal alignment, improving the transferability of large-scale multimodal models in specific tasks while reducing fine-tuning time and parameter count.