I. Introduction
Object rearrangement is an essential but challenging task in robot-environment interaction, marking a crucial capability in embodied AI [1]. This interactive ability attains its zenith of automation by synergizing vision [2], [3], [4], [5], textual insights from sources [6], [7], [8], and strategic motion planning [9], [10]. Together, these elements culminate in a sophisticated physical embodiment for robots.