I. Introduction
Large language models (LLMs) have emerged as a promising avenue for acquiring a comprehensive understanding of the world by transforming it into a linguistic framework. Nonetheless, utilizing such broad knowledge to enable embodied agents to execute precise, specific physical actions in the real world remains a challenge. To handle such issues, some studies encapsulate the agent’s actions into action primitives to promote LLM invocation [1], [2]. Conversely, others integrate reinforcement learning with LLMs [3], [4], or even let LLMs define rewards to guide agents to accomplish intricate implementations of fundamental actions, such as grasping and navigation tasks [5]–[7].