I. Introduction
The recent development of large language models, such as ChatGPT and ChatGLM [1] [2], has marked a significant leap forward in language understanding, knowledge representation, and basic reasoning, closely mirroring intelligent human behavior. This has spurred numerous researchers to incorporate these large language models into robotic manipulation, leading to developments such as RobotGPT and ROSGPT [3] [4]. A key area of research now lies in merging the scene and task understanding capabilities inherent in these rich vision-language models with the complex planning required for operational tasks.