Introduction
Language-conditioned robot manipulation [1] is an emerging field of research at the intersection of robotics, natural language processing, and computer vision. This domain seeks to develop robots capable of understanding their surrounding environments and executing complex manipulation tasks based on natural language commands provided by humans. Substantial progress has been made in recent years, with some studies focusing on deep reinforcement learning techniques to shape reward functions for language instructions, enabling agents to solve tasks through trial-and-error processes by following language instructions [2], [3], [4], [5], [6]. They are well-designed to address the low sample efficiency and enable effective learning. Other researchers also leverage language-conditioned imitation learning approaches, which train agents using demonstration datasets. For instance, some studies utilize imitation learning with expert demonstrations that are accompanied by labeled language instructions to solve such language-conditioned tasks [7], [8]. While these methods have demonstrated a high success rate in completing tasks, two main shortcomings still exist. Firstly, the process is limited by the substantial effort required to sample expert demonstrations. As a result, the dataset available for exploring various scenarios in the environment is restricted, ultimately hindering the agent's potential for better performance. Secondly, the trained agent is deficient in its capacity for generalization, which impedes its ability to carry out tasks in unseen environments.
To address the first problem, some researchers employ unstructured data (play data) [9], [10], [11], [12], [13], which consists of human demonstrations driven by curiosity or other intrinsic motivations, rather than being driven by specific tasks, to reduce the effort required to collect expert data for training. All the play data is obtained through interactions with simulation environments by participants using virtual reality (VR) equipment, with only 1 % of the data is labeled with language instruction. By employing play data, the labor-intensive task of data labeling is significantly reduced, facilitating the creation of larger training datasets for imitation learning. The trained agent demonstrates remarkable performance, exhibiting a high success rate across various tasks. Building upon the ideas presented in [10], HULC [14] was developed to enhance the performance of language-conditioned imitation learning by integrating transformer structures and contrastive learning. HULC++ [15] further improves the performance by incorporating a self-supervised visuo-lingual affordance model.
Regarding the second problem, current approaches still face a challenge in generalizing to perform tasks in unfamiliar and complex environments. The policy learned through the imitation learning algorithm exhibits outstanding evaluation performance, primarily in training domains, suggesting that the policy's effectiveness is restricted to scenarios where training and evaluation environments are identical. Upon conducting sim2real experiments and zero-shot evaluations in novel environments, the discrepancy between the evaluation and training environments results in a substantial decline in success rates.
Within the imitation learning framework, agents typically rely on predicting the short-term next action at each time step based on the current observation and goal without learning a high-level long-term procedure. This approach diverges from the more natural approach employed by humans, which typically involves breaking down complex tasks into simpler, basic steps. Skill-based learning [16], [17] is a promising approach that utilizes pre-defined skills to expedite the learning process, leveraging the prior knowledge encoded within these skills, which is typically derived from human expertise. A primary factor contributing to the suboptimal performance of current language-conditioned imitation learning methodologies is the absence of prior knowledge during the training process. The excessive dependence on training data can lead to overfitting and impede generalization to unfamiliar scenarios. By incorporating prior skills, the agent can avoid the necessity to start from scratch and reduce the dependency of training data.
In this letter, we introduce a base Skill Prior based Imitation Learning (SPIL) framework designed to enhance the generalization ability of an agent in adapting to unfamiliar environments by integrating base skill priors: translation, rotation, and grasping. Specifically, SPIL learns both a low-level policy for skill instance execution based on observations, as well as an intermediate-level policy that determines which base skill (translation, rotation, and grasping) should be performed under the current observation. Fig. 1 compares our approach with normal approaches. The intermediate-level policy functions as a manager, interpreting language instructions and appropriately combining these base skills to solve manipulation tasks. For instance, when the intermediate-level policy receives the language instruction “lift the block”, it will decompose the task into several steps involving base skills, such as approaching the block (translation), grasping the block (grasping), and lifting the block (translation). Note that the reason we call it intermediate-level policy is to distinguish it from the more complex high-level policy for tasks like “tidying up the room” which can be decomposed into several subtasks (usually done by LLMs [18]). We evaluate our algorithm using the CALVIN benchmark [19] and achieve outstanding performance in the challenging zero-shot multi-environment settings. Furthermore, we conduct sim-to-real experiments to assess the performance of our approach in real-world environments, yielding outstanding results. We summarize the key contributions as follows:
In this letter, we incorporate the skill priors into imitation learning and design a skill-prior-based imitation learning mechanism to enable learning of an intermediate-level procedure and enhance the generalization ability of the learned policy.
Our proposed method exhibits superior performance compared to previous baselines, particularly in terms of its ability to generalize and perform well in previously unseen environments. Our evaluation shows that our approach outperforms the current method HULC by a significant margin, achieving 2.5 times the performance. We conducted a series of sim-to-real experiments to investigate further our model's generalization ability in unseen environments and the potential of our model for real-world applications.
Comparison of common approaches (
Related Works
In the field of language-conditioned robot manipulation, some studies establish connections between visual perception and linguistic comprehension in the vision-and-language field, facilitating the agent's ability to tackle multimodal problems [20], [21], [22]. Other research focuses on grounding language instructions and the agent's behaviors, empowering the agent to comprehend instructions and effectively interact with the environment [23], [24], [25], [26]. However, these approaches employ two-stream architectural models to process multimodal data. Such a model require distinct feature representations for each data modality, such as semantic and spatial representations [26], thus potentially compromising learning efficiency. As an alternative, end-to-end models focus on learning feature representations and decision-making directly from raw input data, where the language instructions as a conditioning factor to train the agent. This approach eliminates the need for manual feature engineering [14], [27], thereby offering a more efficient and robust solution for complex tasks and emerging as a trend in language-conditioned robot manipulations.
For instance, imitation learning with end-to-end models has been applied to solve language-conditioned manipulation tasks using expert demonstrations accompanied by a large number of labeled language instructions [7] [8]. These approaches necessitate a substantial amount of labeled and structured demonstration data. By extending the idea of [9], Lynch et al. proposed MCIL [28], which grounds the agent's behavior with language instructions using unlabeled and unstructured demonstration data, reducing data acquisition efforts and achieving more robust performance. HULC [14], as an enhanced version of MCIL, is designed to improve the performance of MCIL even further. It has achieved impressive results in the CALVIN benchmark [19] using the single environment setting. However, when tested in the more challenging Zero-shot Multi Environment setting, where the evaluation environment is not exactly the same as the training environments, HULC's performance drops significantly. These suboptimal results suggest that current language-conditioned imitation learning approaches lack the ability to adapt to unfamiliar environments. More recently, some approaches [29], [30], [31], [32] leverage rich knowledge in the pre-trained foundation models to enhance the generalization ability in unseen environments.
The concept of skill-based mechanisms in deep reinforcement learning provides valuable insights for enhancing the generalizability of algorithms. Specifically, skill-based reinforcement learning leverages task-agnostic experiences in the form of large datasets to accelerate the learning process [33], [34], [35], [36]. To extract skills from a large task-agnostic dataset, several approaches [37], [38] first learn an embedding space of skills and skill priors from the dataset. Inspired by this, we have developed an imitation learning approach that utilizes certain base skill priors. By employing this method, the agent learns intermediate-level processes (composing these base skills) that aid in task completion, thereby enhancing its ability to generalize across different scenarios.
Methodology
A. Overview
The key idea of our approach is integrating skills into imitation learning by changing the original action space - Cartesian End Effector space
We define three base skills (translation, rotation, grasping) for a robotic arm agent and introduce the method to stochastically label action sequences with base skills.
We introduce our approach to learning continuous skill embedding space, integrating base skill priors into such skill space.
By utilizing a continuous skill space and base skills, we implement an imitation learning algorithm to train the agent to acquire the ability to 1) learn an intermediate-level base skill composition to accomplish the desired task and 2) develop a policy that can determine which specific skill instance to perform based on each observation, as opposed to a single action.
This architecture comprises two encoders - the action sequence encoder and the base skill locator (encoder), and a decoder for reconstructing the skill embeddings into action sequences. The base skill locator takes one-hot-key embeddings of three base skills as input and outputs the distribution of the base skill prior in the skill latent space. The action sequence encoder encodes the action sequences with a fixed horizon of
The Overall Architecture. Following the encoding process, the static observation, gripper observation, and language instruction are generated to embeddings for the plan, language goal, language, static observation, and gripper observation. The skill selector module subsequently decodes a sequence of skill embeddings using the plan, observation, and language goal embeddings. The skill labeler labels the skill embeddings with the base skills: translation, rotation, and grasping. The base skill regularization loss is calculated based on the base skill prior distributions (from base skill locator
B. Base Skill Labeling
This section formally defines three base skills - translation, rotation, and grasping. Since each action sequence can contain multiple base skills, deterministically classifying an action sequence to one of three base skills is not reasonable. Here, we stochastically label each given action sequence
\begin{align*}
p(y|x) = \frac{w_{y} \cdot \sum _{i=0}^{N_{h}-1} |a_{i}^{y}| }{\sum _{k\in \lbrace \text {trans., rot., grasp.}\rbrace } w_{k} \cdot \sum _{i=0}^{N_{h}-1} |a_{i}^{k}|} \text{,} \tag{1}
\end{align*}
C. Continuous Skill Embeddings With Base Skill Priors
In this section, we introduce a skill space
\begin{align*}
\mathcal {L}_{\text {ELBO}} =& \overbrace{\mathbb {E}_{z \sim q_{\boldsymbol{\phi }}(z|x)}[\log p_{\boldsymbol{\theta }}(x|z)]}^{\text {reconstruction loss}} - \beta _{1} \overbrace{D_{KL} (q_{\boldsymbol{\phi }}(z|x)||p(z))}^{\text {regularizer} (\mathcal {L}_{\text {reg.}})}\\
& - \beta _{2} \sum _{k} q(y=k|x) \underbrace{D_{KL}(q_{\boldsymbol{\phi }}(z|x)||p_{\boldsymbol{\kappa }}(z|y=k))}_{\text {base-skill regularizer} (\mathcal {L}_\text {skill})} \text{,} \tag{2}
\end{align*}
The learning procedure is illustrated in Fig. 2 and the overall algorithm can be found in Algorithm 1. After training, we obtain a skill generator
An illustration of the skill latent space by performing the t-SNE algorithm is shown in Fig. 3. Three clusters are labeled with different colors, indicating three base skills we define. Each point indicates a skill embedding
Algorithm 1: Learning Continuous Skill Embeddings with Base Skill Priors.
Given:
Randomly initialize model parameters
while not done do
Sample an action sequence
Encode this sequence with
Compute the base skill distributions
Sample one latent embedding
Feed the sampled
Compute the loss based on Equation (2)
Update parameters
end while
D. Imitation Learning With Base Skill Priors
Algorithm 2: Imitation Learning with Skill Priors.
Given:
Randomly initialize model parameters
Initialize parameters
Freeze the parameters
while not done do
for
Sample a (demonstration, context)
Encode the observation, goal, and plan embeddings, using the encoder network
Skill Embedding Selecter
Determinate a sequence of base skill probabilities with Base Skill Selector
Determinate base skill locations in the latent space with Base Skill Locator
Skill Generator
Calculate the loss function
Accumulate imitation loss
end for
update parameters
end while
After acquiring the skill embedding space
We extend the idea of MCIL [10] and HULC [14] by employing an action space
Skill Embedding Selector: The skill embedding selector, denoted as
, selects skill embeddings in the pre-trained latent space. A bidirectional LSTM network is employed for this skill embedding selector.f_{\boldsymbol{\lambda }} Base Skill Selector: The base skill selector
, also a bidirectional LSTM network, determines the base skill to which a given skill belongs.f_{\boldsymbol{\omega }} Base Skill Locator: The base skill locator shares the same parameters with base skill locator
in Fig. 2. It has the task of locating the base skill locations in the latent space. The input to this network is af_{\boldsymbol{\kappa }} identity matrix, signifying the one-hot representing of three base skills. These locations are used to calculate the regularization loss.3 \times 3 Skill Generator: The skill generator, denoted as
, shares the same parameters with the decoder component in Fig. 2. Its parameters are frozen during the imitation learning process. Its function is to transform space from skill embedding spacef_{\boldsymbol{\theta }} = p_{\boldsymbol{\theta }}(x|z) : \mathcal {A}_{z} \rightarrow \mathcal {A}_{\text {skill}} to skill space\mathcal {A}_{z} . These skills (action sequences) are combined chronologically for a longer action sequence.\mathcal {A}_{\text {skill}}
The objective of our model is to learn a policy
Inspired by the conditional variational autoencoder (CVAE):
\begin{equation*}
\log p(x|c) \geq \mathbb {E}_{q(z|x,c)}[\log p(x|z,c)] - D_{KL}(q(z|x,c) || p(z|c)) \tag{3}
\end{equation*}
\begin{align*}
\mathcal {L} = & \overbrace{\mathbb {E}_{z \sim q_{\boldsymbol{\phi }}(x,c)}\log p_{\boldsymbol{\theta }}(x|z)}^{\text {Reconstruction loss} (\mathcal {L}_{\text {huber}})} \\
& - \gamma _{1} \sum _{k} q_{\boldsymbol{\omega }}(y=k|c) \overbrace{D_{KL}(q_{\boldsymbol{\Phi }, \boldsymbol{\lambda }}(z|x,c)||p_{\boldsymbol{\kappa }}(z|y))}^{\text {Base skill regularizer} (\mathcal {L}_{\text {skill}})}\\
& - \gamma _{2} \overbrace{D_{KL}(q_{\boldsymbol{\omega }}(y|c)||p(y))}^{\text {Categorical regularizer} (\mathcal {L}_{\text {cat.}})} \tag{4}
\end{align*}
Experiments
In this section, we present the experiments conducted to investigate the generalization ability of our model in comparison to other baselines. We choose the CALVIN [19] benchmark to evaluate our model. The CALVIN benchmark is introduced to facilitate learning language-conditioned tasks across four manipulation environments: A, B, C, and D. Each environment features a Franka Emika Panda robot arm equipped with a gripper and a desk that includes a sliding door and a drawer. Additionally, the desk has a button that can toggle a green light and a switch to control a light bulb. Note that each environment has a different desk with various of textures and the position of static elements such as the sliding door, drawer, light, switch, and button are different across each environment. Experiments are conducted in two settings: (1) a single environment where the training and testing environments are the same, and (2) zero-shot multi-environments where training occurs in the first three environments and testing takes place in the fourth, previously unseen environment.
We choose long-horizon multi-task language control (LH-MTLC) to evaluate the effectiveness of the learned multi-task language-conditioned policy in accomplishing several language instructions in a row under the zero-shot multi environment. We also compare other skill-based reinforcement learning approaches to show the advantages of our approaches against theirs.
We analyze the result of our model by comparing it to other baselines (shown in Table I). We evaluate the models with 1000 five-task chains. The columns labeled from one to five demonstrate the success rate of continuously completing that number of tasks in a row. The average length indicates the average number of tasks the agent can continuously complete when given five tasks in a row (The remaining tasks are not performed if one task fails in the middle). Subsequently, ablation studies on hyperparameters
A. Environment Result
As evidenced in Table I, our model substantially improves compared to our baselines HULC and MCIL in a zero-shot multi-environment setting. Compared to the current SOTA model HULC, the success rate of completing one to five tasks in a row has increased by 32.4%, 29.8%, 21.9 %, 12.8%, and 6.9 %, respectively. The overall average length increased from 0.67 to 1.71. Note that zero-shot multi-environment presents a challenging environment, as the agent must solve tasks in an unfamiliar environment. The performance in this setting represents the agent's ability to truly understand and connect the concepts in language instructions with real objects and actions. The performance of our model demonstrates a significant improvement, thus confirming our hypothesis that using skill priors to learn intermediate-level task composition can improve generalization capabilities. The other SOTA models - SuSIE [29], RoboFlamingo [30], GR-1 [31], 3D Diffuser Actor [32], which leverage pre-trained foundation models, as listed in Table I for reference. It is worth mentioning that our SPIL model also outperforms the baselines in the single environment setting.
B. Real-World Experiments
To investigate the viability of the policy trained in a simulated environment to real-world scenarios, we conduct a sim2real experiment without any additional specific adaptation (zero-shot), as shown in Fig. 5.
Real-world experiments. We employ the multi-task language control (MTLC) setting in the CALVIN benchmark, encompassing a total of 10 tasks as listed above. The agent is trained in the simulated CALVIN environment D and directly applied to the real-world setting.
We designed the real-world environment to closely resemble the simulated CALVIN environment D. The rightmost part of Fig. 4 illustrates that the real-world environment comprises one switch, one cabinet with a slider, one button, one drawer, and three blocks in red, pink, and blue colors. Additionally, two RGB cameras are employed to capture the static observation and gripper observations.
Table II lists the tasks performed and the corresponding success rate. The agent is trained in four CALVIN environments (A, B, C, D), and the trained policy is directly applied to a real-world environment. To mitigate the influence of the robot's initial position on the policies, we execute 10 roll-outs for each task, maintaining identical starting positions. The table results demonstrate our model's effectiveness in handling the challenging zero-shot sim2real experiments. Despite the substantial differences between the simulation and real-world contexts, our model still achieves an average success rate of 33% in accomplishing the tasks. Conversely, the HULC model-trained agent struggles with these tasks, with a 3% average success rate, underscoring the difficulty of solving real-world challenges. The results from real-world experiments further substantiate our claim that our proposed method exhibits superior generalization capabilities, enabling successful task completion even in unfamiliar environments.
Conclusion
In this letter, we introduced a novel imitation learning paradigm that integrates base skills into imitation learning. Our proposed SPIL model effectively improves the generalization ability compared to current baselines and substantially surpasses the SOTA models on the language-conditioned robotic manipulation CALVIN benchmark, especially under the challenging zero-shot multi environment setting. This work also aims to contribute towards the development of general-purpose robots that can effectively integrate human language with their perception and actions.