I. Introduction
The rapid expansion of ultra-large-scale image/video transmission applications in camera phones and extended reality devices continues to drive the demand for efficient transmission of large media data under limited bandwidth conditions. Traditional communication systems, based on the source-channel separation paradigm, utilize rate-distortion theory for source coding and channel coding theory for transmission. These systems aim to minimize the size of source data under a distortion constraint while ensuring reliable data transmission over noisy channels. However, they can cause significant bandwidth waste due to their focus on global bit information rather than critical semantic information. To address this inefficiency, recent advancements in deep learning have inspired data-driven solutions that extract semantic feature information [1], [2], [3], [4] and implement joint source-channel coding (JSCC) [5], [6], [7], [8], [9], [10] for end-to-end communications.