I. Introduction
The ability to explore and navigate in unknown environments based on semantic information is a fundamental skill for embodied agents, which are robotic or virtual entities that interact with their surroundings. Embodied agents have various applications in fields such as robotics, video games, and virtual assistants, where they need to operate autonomously in complex and dynamic environments. As shown in Fig. 1, an important task is to explore the environment and locate a target object based on the reasoning context, such as the visual observation and the name of the target object. Prior knowledge about the environment plays a crucial role in optimizing the effectiveness of the exploration process. For instance, this prior knowledge could encompass typical locations where a butter knife is commonly found. As a result, the agent can concentrate its search efforts in areas with a higher probability of locating a butter knife, effectively reducing the overall search space for exploration. Most existing methods obtaining prior knowledge primarily rely on collected data [1], [2], [3], [4] or pre-defined rules [5], [6], [7]. However, these approaches suffer from limitations in terms of scalability and applicability to new, unseen environments. Collecting data-specific information can be expensive and time-consuming, while defining rules for each new task or environment is impractical.