I. Introduction
Remote sensing image change interpretation technology has emerged as a pivotal tool across various domains, including environmental monitoring, urban planning, and disaster management [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]. Remote sensing image change captioning (RSICC) constitutes a critical aspect of this technology, focusing on describing the changes between temporally disparate remote sensing images through natural language [11], [12], [13]. The advent of specialized datasets and the refinement of vision-language models (VLMs) have catalyzed substantial advancements in deep-learning-driven RSICC methods [13], [14].