I. Introduction
As the instruction based on the referring expression like “Please go to my office and bring me the document on the table” are usually used in social conversations in our daily life, it is fundamental for embodied AI agents to have such a capability to accurately pursue the guidance and discern the goal object in perceptually-rich environments. The Goal-oriented Vision-Language Navigation (GVLN) setups in real environments [1], [2] take such a step toward this goal. Compared with the Vision-and-Language Navigation (VLN) task [3] that focuses on building a model capable of following fine-grained instructions, the GVLN task is more practical and constructive since fine-grained instructions are actually difficult to access in real life; thus the GVLN task has driven increased interest from various research fields. Although numerous methods have been explored to use visual and language clues to assist in goal navigation [1], [2], [4], [5], [6], [7], the severe overfitting problem caused by the small dataset with monotonous environments still remains challenging. This problem will become prominent when the network scale significantly increases, resulting in weak generalization.