I. Introduction
Vision and Language Navigation (VLN) tasks [5], [6] require an agent to navigate through an unseen environment following a natural language instruction, replicating the communication between humans and domestic robots. As vision and language technologies continue to rapidly develop, researchers have conducted extensive research on VLN from various perspectives, including model structure [4], [5], [30], [33], representation learning [23], [29], [24], [31], and data augmentation [8], [32]. This has resulted in a greater understanding of effective approaches for VLN, with advancements being made towards improving the accuracy and efficiency of navigation.