Loading [MathJax]/extensions/TeX/boldsymbol.js
Language-Conditioned Imitation Learning With Base Skill Priors Under Unstructured Data | IEEE Journals & Magazine | IEEE Xplore

Language-Conditioned Imitation Learning With Base Skill Priors Under Unstructured Data


Abstract:

The growing interest in language-conditioned robot manipulation aims to develop robots capable of understanding and executing complex tasks, with the objective of enablin...Show More

Abstract:

The growing interest in language-conditioned robot manipulation aims to develop robots capable of understanding and executing complex tasks, with the objective of enabling robots to interpret language commands and manipulate objects accordingly. While language-conditioned approaches demonstrate impressive capabilities for addressing tasks in familiar environments, they encounter limitations in adapting to unfamiliar environment settings. In this letter, we propose a general-purpose, language-conditioned approach that combines base skill priors and imitation learning under unstructured data to enhance the algorithm's generalization in adapting to unfamiliar environments. We assess our model's performance in both simulated and real-world environments using a zero-shot setting. The average completed task length, indicating the average number of tasks the agent can continuously complete, improves more than 2.5 times compared to the baseline method HULC. In terms of the zero-shot evaluation of our policy in a real-world setting, we set up ten tasks and achieved an average 30% improvement in our approach compared to the current state-of-the-art approach, demonstrating a high generalization capability in both simulated environments and the real world.
Published in: IEEE Robotics and Automation Letters ( Volume: 9, Issue: 11, November 2024)
Page(s): 9805 - 9812
Date of Publication: 20 September 2024

ISSN Information:


SECTION I.

Introduction

Language-conditioned robot manipulation [1] is an emerging field of research at the intersection of robotics, natural language processing, and computer vision. This domain seeks to develop robots capable of understanding their surrounding environments and executing complex manipulation tasks based on natural language commands provided by humans. Substantial progress has been made in recent years, with some studies focusing on deep reinforcement learning techniques to shape reward functions for language instructions, enabling agents to solve tasks through trial-and-error processes by following language instructions [2], [3], [4], [5], [6]. They are well-designed to address the low sample efficiency and enable effective learning. Other researchers also leverage language-conditioned imitation learning approaches, which train agents using demonstration datasets. For instance, some studies utilize imitation learning with expert demonstrations that are accompanied by labeled language instructions to solve such language-conditioned tasks [7], [8]. While these methods have demonstrated a high success rate in completing tasks, two main shortcomings still exist. Firstly, the process is limited by the substantial effort required to sample expert demonstrations. As a result, the dataset available for exploring various scenarios in the environment is restricted, ultimately hindering the agent's potential for better performance. Secondly, the trained agent is deficient in its capacity for generalization, which impedes its ability to carry out tasks in unseen environments.

To address the first problem, some researchers employ unstructured data (play data) [9], [10], [11], [12], [13], which consists of human demonstrations driven by curiosity or other intrinsic motivations, rather than being driven by specific tasks, to reduce the effort required to collect expert data for training. All the play data is obtained through interactions with simulation environments by participants using virtual reality (VR) equipment, with only 1 % of the data is labeled with language instruction. By employing play data, the labor-intensive task of data labeling is significantly reduced, facilitating the creation of larger training datasets for imitation learning. The trained agent demonstrates remarkable performance, exhibiting a high success rate across various tasks. Building upon the ideas presented in [10], HULC [14] was developed to enhance the performance of language-conditioned imitation learning by integrating transformer structures and contrastive learning. HULC++ [15] further improves the performance by incorporating a self-supervised visuo-lingual affordance model.

Regarding the second problem, current approaches still face a challenge in generalizing to perform tasks in unfamiliar and complex environments. The policy learned through the imitation learning algorithm exhibits outstanding evaluation performance, primarily in training domains, suggesting that the policy's effectiveness is restricted to scenarios where training and evaluation environments are identical. Upon conducting sim2real experiments and zero-shot evaluations in novel environments, the discrepancy between the evaluation and training environments results in a substantial decline in success rates.

Within the imitation learning framework, agents typically rely on predicting the short-term next action at each time step based on the current observation and goal without learning a high-level long-term procedure. This approach diverges from the more natural approach employed by humans, which typically involves breaking down complex tasks into simpler, basic steps. Skill-based learning [16], [17] is a promising approach that utilizes pre-defined skills to expedite the learning process, leveraging the prior knowledge encoded within these skills, which is typically derived from human expertise. A primary factor contributing to the suboptimal performance of current language-conditioned imitation learning methodologies is the absence of prior knowledge during the training process. The excessive dependence on training data can lead to overfitting and impede generalization to unfamiliar scenarios. By incorporating prior skills, the agent can avoid the necessity to start from scratch and reduce the dependency of training data.

In this letter, we introduce a base Skill Prior based Imitation Learning (SPIL) framework designed to enhance the generalization ability of an agent in adapting to unfamiliar environments by integrating base skill priors: translation, rotation, and grasping. Specifically, SPIL learns both a low-level policy for skill instance execution based on observations, as well as an intermediate-level policy that determines which base skill (translation, rotation, and grasping) should be performed under the current observation. Fig. 1 compares our approach with normal approaches. The intermediate-level policy functions as a manager, interpreting language instructions and appropriately combining these base skills to solve manipulation tasks. For instance, when the intermediate-level policy receives the language instruction “lift the block”, it will decompose the task into several steps involving base skills, such as approaching the block (translation), grasping the block (grasping), and lifting the block (translation). Note that the reason we call it intermediate-level policy is to distinguish it from the more complex high-level policy for tasks like “tidying up the room” which can be decomposed into several subtasks (usually done by LLMs [18]). We evaluate our algorithm using the CALVIN benchmark [19] and achieve outstanding performance in the challenging zero-shot multi-environment settings. Furthermore, we conduct sim-to-real experiments to assess the performance of our approach in real-world environments, yielding outstanding results. We summarize the key contributions as follows:

  • In this letter, we incorporate the skill priors into imitation learning and design a skill-prior-based imitation learning mechanism to enable learning of an intermediate-level procedure and enhance the generalization ability of the learned policy.

  • Our proposed method exhibits superior performance compared to previous baselines, particularly in terms of its ability to generalize and perform well in previously unseen environments. Our evaluation shows that our approach outperforms the current method HULC by a significant margin, achieving 2.5 times the performance. We conducted a series of sim-to-real experiments to investigate further our model's generalization ability in unseen environments and the potential of our model for real-world applications.

Fig. 1. - Comparison of common approaches ($\color{red}{\text{dashed red}}$) and our approach ($\color{green}{\text{green}}$). Common approaches usually directly learn the actions, depending on current observation and instruction. Our approach aims to learn the extra intermediate-level policy of which base skill to choose, based on current observation and instruction.
Fig. 1.

Comparison of common approaches (\color{red}{\text{dashed red}}) and our approach (\color{green}{\text{green}}). Common approaches usually directly learn the actions, depending on current observation and instruction. Our approach aims to learn the extra intermediate-level policy of which base skill to choose, based on current observation and instruction.

SECTION II.

Related Works

In the field of language-conditioned robot manipulation, some studies establish connections between visual perception and linguistic comprehension in the vision-and-language field, facilitating the agent's ability to tackle multimodal problems [20], [21], [22]. Other research focuses on grounding language instructions and the agent's behaviors, empowering the agent to comprehend instructions and effectively interact with the environment [23], [24], [25], [26]. However, these approaches employ two-stream architectural models to process multimodal data. Such a model require distinct feature representations for each data modality, such as semantic and spatial representations [26], thus potentially compromising learning efficiency. As an alternative, end-to-end models focus on learning feature representations and decision-making directly from raw input data, where the language instructions as a conditioning factor to train the agent. This approach eliminates the need for manual feature engineering [14], [27], thereby offering a more efficient and robust solution for complex tasks and emerging as a trend in language-conditioned robot manipulations.

For instance, imitation learning with end-to-end models has been applied to solve language-conditioned manipulation tasks using expert demonstrations accompanied by a large number of labeled language instructions [7] [8]. These approaches necessitate a substantial amount of labeled and structured demonstration data. By extending the idea of [9], Lynch et al. proposed MCIL [28], which grounds the agent's behavior with language instructions using unlabeled and unstructured demonstration data, reducing data acquisition efforts and achieving more robust performance. HULC [14], as an enhanced version of MCIL, is designed to improve the performance of MCIL even further. It has achieved impressive results in the CALVIN benchmark [19] using the single environment setting. However, when tested in the more challenging Zero-shot Multi Environment setting, where the evaluation environment is not exactly the same as the training environments, HULC's performance drops significantly. These suboptimal results suggest that current language-conditioned imitation learning approaches lack the ability to adapt to unfamiliar environments. More recently, some approaches [29], [30], [31], [32] leverage rich knowledge in the pre-trained foundation models to enhance the generalization ability in unseen environments.

The concept of skill-based mechanisms in deep reinforcement learning provides valuable insights for enhancing the generalizability of algorithms. Specifically, skill-based reinforcement learning leverages task-agnostic experiences in the form of large datasets to accelerate the learning process [33], [34], [35], [36]. To extract skills from a large task-agnostic dataset, several approaches [37], [38] first learn an embedding space of skills and skill priors from the dataset. Inspired by this, we have developed an imitation learning approach that utilizes certain base skill priors. By employing this method, the agent learns intermediate-level processes (composing these base skills) that aid in task completion, thereby enhancing its ability to generalize across different scenarios.

SECTION III.

Methodology

A. Overview

The key idea of our approach is integrating skills into imitation learning by changing the original action space - Cartesian End Effector space \mathcal {A} \in \mathbb {R}^{7} into skill space \mathcal {A}_\text {skill} \in \mathbb {R}^{N_{h} \times 7}, where N_{h} indicates the horizon of skills. Note that each skill represents a fixed-length (N_{h}) action sequence in our setting. Also, we intend to integrate the concept of base skills (translation, rotation, grasping) into the learning procedure so that the agent can learn an extra intermediate-level policy to decompose tasks into several base skills. Unlike reinforcement learning, the optimization strategy employed in imitation learning involves minimizing the discrepancy between the predicted actions and the corresponding actions observed in the demonstration data. For this reason, a primary challenge in integrating skill priors into imitation learning is the continuous nature of actions in the demonstration data, which requires modeling the skills as a continuous action space to align with the demonstration actions rather than representing the skills by a finite, discrete set of pre-defined action sequences. To address the challenges mentioned above, the rest of this section is organized as follows:

  1. We define three base skills (translation, rotation, grasping) for a robotic arm agent and introduce the method to stochastically label action sequences with base skills.

  2. We introduce our approach to learning continuous skill embedding space, integrating base skill priors into such skill space.

  3. By utilizing a continuous skill space and base skills, we implement an imitation learning algorithm to train the agent to acquire the ability to 1) learn an intermediate-level base skill composition to accomplish the desired task and 2) develop a policy that can determine which specific skill instance to perform based on each observation, as opposed to a single action.

The architecture of our proposed method is illustrated in Fig. 4
Fig. 2. - This architecture comprises two encoders - the action sequence encoder and the base skill locator (encoder), and a decoder for reconstructing the skill embeddings into action sequences. The base skill locator takes one-hot-key embeddings of three base skills as input and outputs the distribution of the base skill prior in the skill latent space. The action sequence encoder encodes the action sequences with a fixed horizon of $N_{h}$ to the skill distribution in the latent space. The decoder then reconstructs the skill embedding into action sequences.
Fig. 2.

This architecture comprises two encoders - the action sequence encoder and the base skill locator (encoder), and a decoder for reconstructing the skill embeddings into action sequences. The base skill locator takes one-hot-key embeddings of three base skills as input and outputs the distribution of the base skill prior in the skill latent space. The action sequence encoder encodes the action sequences with a fixed horizon of N_{h} to the skill distribution in the latent space. The decoder then reconstructs the skill embedding into action sequences.

Fig. 3. - t-SNE visualization of skill latent space.
Fig. 3.

t-SNE visualization of skill latent space.

Fig. 4. - The Overall Architecture. Following the encoding process, the static observation, gripper observation, and language instruction are generated to embeddings for the plan, language goal, language, static observation, and gripper observation. The skill selector module subsequently decodes a sequence of skill embeddings using the plan, observation, and language goal embeddings. The skill labeler labels the skill embeddings with the base skills: translation, rotation, and grasping. The base skill regularization loss is calculated based on the base skill prior distributions (from base skill locator $f_{\boldsymbol{\kappa }}$), selected skill instance, and labeled probability indicating its belonging to specific base skills. This labeled probability is also leveraged to determine the categorical regularization loss. Finally, the pre-trained and frozen skill generator $f_{\boldsymbol{\theta }}$ decodes all the skill embeddings into action sequences, which are then utilized to calculate the reconstruction loss (Huber loss).
Fig. 4.

The Overall Architecture. Following the encoding process, the static observation, gripper observation, and language instruction are generated to embeddings for the plan, language goal, language, static observation, and gripper observation. The skill selector module subsequently decodes a sequence of skill embeddings using the plan, observation, and language goal embeddings. The skill labeler labels the skill embeddings with the base skills: translation, rotation, and grasping. The base skill regularization loss is calculated based on the base skill prior distributions (from base skill locator f_{\boldsymbol{\kappa }}), selected skill instance, and labeled probability indicating its belonging to specific base skills. This labeled probability is also leveraged to determine the categorical regularization loss. Finally, the pre-trained and frozen skill generator f_{\boldsymbol{\theta }} decodes all the skill embeddings into action sequences, which are then utilized to calculate the reconstruction loss (Huber loss).

.

B. Base Skill Labeling

This section formally defines three base skills - translation, rotation, and grasping. Since each action sequence can contain multiple base skills, deterministically classifying an action sequence to one of three base skills is not reasonable. Here, we stochastically label each given action sequence x=(a_{0}, a_{1},{\ldots }, a_{N_{h}-1}) of length N_{h} with probability (p({\text {trans.}|x)}, p({\text {rot.}|x)}, p({\text {grasp.}|x)}) which indicate the probability of x belongs to these three base skills. For example, the probability of (0.7, 0.2, 0.1) suggests a dominance of translation skill within the given action sequence, a minor presence of rotation skill, and a minimal grasping skill. We design a non-learning-based approach to label each action sequence. Since the action is defined in the Cartesian EE space, it can be accomplished by assessing the accumulated magnitude of seven degrees of freedom within the temporal dimension of a given horizon N_{h}. The probability of this sequence belonging to translation, rotation, and grasping skills can be defined as follows: \begin{align*} p(y|x) = \frac{w_{y} \cdot \sum _{i=0}^{N_{h}-1} |a_{i}^{y}| }{\sum _{k\in \lbrace \text {trans., rot., grasp.}\rbrace } w_{k} \cdot \sum _{i=0}^{N_{h}-1} |a_{i}^{k}|} \text{,} \tag{1} \end{align*} View SourceRight-click on figure for MathML and additional features.where y \in \lbrace \text {trans., rot., grasp.}\rbrace refers to base skills and a = [t_{x},t_{y},t_{z},r_\alpha,r_\beta, r_\gamma, g] with a^{\text {trans.}}=[t_{x},t_{y},t_{z}], a^{\text {rot.}}=[r_\alpha,r_\beta, r_\gamma ], and a^{\text {grasp.}}=[g], indicating the end effector's displacement, rotation, and gripper control. The “magic weight” w_{y} is introduced to address inconsistencies in scale across different units like meters and degrees. These values act as balancing factors and are determined based on our understanding of the inherent relationships between translation, rotation, and grasping. They reflect the subjective nature of defining translation, rotation, and grasping. Since these classifications may be nuanced and depend on human experience, we've chosen ’magic weight' w_{k} that reflects a common understanding of how these motions are typically defined.

C. Continuous Skill Embeddings With Base Skill Priors

In this section, we introduce a skill space \mathcal {A}_\text {skill} \in \mathbb {R}^{N_{h} \times 7} as the action space for the agent. To better represent such skill space, we compress the action sequences into skill embeddings by following the idea of Variantial AutoEncoders (VAEs), leveraging the action sequences sampled from play data. After training, we acquire a latent space full of skill embeddings and three clusters, indicating the base skills priors for translation, rotation, and grasping. To achieve this, we define y as the variable for base skills and the base skill distribution in the latent space can be written as z \sim p(z|y). For the given action sequence x, we employ the approximate variational posterior q(z|x) and q(y,z|x) to estimate the intractable true posterior. Following the VAEs procedure, we measure the Kullback-Leibler (KL) divergence between the true posterior and the posterior approximation to determine the ELBO (the details can be seen in Appendix Theoretical Motivation): \begin{align*} \mathcal {L}_{\text {ELBO}} =& \overbrace{\mathbb {E}_{z \sim q_{\boldsymbol{\phi }}(z|x)}[\log p_{\boldsymbol{\theta }}(x|z)]}^{\text {reconstruction loss}} - \beta _{1} \overbrace{D_{KL} (q_{\boldsymbol{\phi }}(z|x)||p(z))}^{\text {regularizer} (\mathcal {L}_{\text {reg.}})}\\ & - \beta _{2} \sum _{k} q(y=k|x) \underbrace{D_{KL}(q_{\boldsymbol{\phi }}(z|x)||p_{\boldsymbol{\kappa }}(z|y=k))}_{\text {base-skill regularizer} (\mathcal {L}_\text {skill})} \text{,} \tag{2} \end{align*} View SourceRight-click on figure for MathML and additional features.where p_{\boldsymbol{\theta }}(x|y,z) and q_{\boldsymbol{\phi }}(z|x) are the decoder and encoder networks with parameters \boldsymbol{\theta } and \boldsymbol{\phi }, respectively. We also define a network p_{\boldsymbol{\kappa }}(z|y) with parameters \boldsymbol{\kappa } for locating the base skills in the latent skill space. q(y=k|x) is calculated by (1). The hyperparameters \beta _{1} and \beta _{2} are introduced to weigh the regularizer terms. \mathcal {L}_{\text {ELBO}} can be interpreted as follows. On the one hand, we intend to achieve higher reconstruction accuracy. As the reconstruction improves, our approximated posterior will also become more accurate. On the other hand, the two introduced regularizers contribute to a more structured latent skill space. The first regularizer, D_{KL} (q_{\boldsymbol{\phi }}(z|x)||p(z)), constrains the encoded distribution to be close to the prior distribution p(z). Likewise, the second regularizer, D_{KL}(q_{\boldsymbol{\phi }}(z|x)||p_{\boldsymbol{\kappa }}(z|y)), draws the encoded distribution nearer to its corresponding base skill prior distribution.

The learning procedure is illustrated in Fig. 2 and the overall algorithm can be found in Algorithm 1. After training, we obtain a skill generator f_{\boldsymbol{\theta }} = p_{\boldsymbol{\theta }}(x|z), which maps the skill embedding to the corresponding action sequence. Since there exists such a one-to-one mapping relationship, the action space \mathcal {A}_\text {skill} is equivalent to \mathcal {A}_{z} \in \mathbb {R}^{N_{z}}, where N_{z} is the skill embedding dimension. The agent should select one skill embedding in the latent space at each timestep rather than one action sequence that we typically consider. Additionally, we have the base skill locator f_{\boldsymbol{\kappa }} = p_{\boldsymbol{\kappa }}(z|y) to identify the position of base skill distributions within the skill latent space. Their parameters remain frozen during later imitation learning.

An illustration of the skill latent space by performing the t-SNE algorithm is shown in Fig. 3. Three clusters are labeled with different colors, indicating three base skills we define. Each point indicates a skill embedding z \in \mathcal {A}_{z} that corresponds to an action sequence with the length of N_{h}. A single skill embedding could encompass various base skill features, given that the latent space is continuous. Consequently, a skill embedding between two base skill clusters would capture features from both.

Algorithm 1: Learning Continuous Skill Embeddings with Base Skill Priors.

1:

Given:

\mathcal {D}: \lbrace (a_{0},a_{1},{\ldots },a_{H-1})\rbrace: A Play dataset full of action sequences with horizon H.

\mathcal {F} = \lbrace f_{\boldsymbol{\phi }}, f_{\boldsymbol{\theta }}, f_{\boldsymbol{\kappa }}\rbrace. They are the encoder network with parameters \boldsymbol{\phi }, the decoder network, also denoted as skill generator network with parameters \boldsymbol{\theta }, and the base skill locator network with parameters \boldsymbol{\kappa }.

2:

Randomly initialize model parameters \lbrace \boldsymbol{\theta }, \boldsymbol{\phi }, \boldsymbol{\kappa } \rbrace

3:

while not done do

4:

Sample an action sequence x \sim \mathcal {D}

5:

Encode this sequence with f_{\boldsymbol{\phi }} = q_{\boldsymbol{\phi }}(z|x)

6:

Compute the base skill distributions f_{\boldsymbol{\kappa }} = p_{\boldsymbol{\kappa }}(z|y).

7:

Sample one latent embedding z \sim q_{\boldsymbol{\phi }}(z|x)

8:

Feed the sampled z into the decoder f_{\boldsymbol{\theta }} = p_{\boldsymbol{\theta }}(x|z) to get the reconstructed action sequence \hat{x}

9:

Compute the loss based on Equation (2)

10:

Update parameters \boldsymbol{\theta }, \boldsymbol{\phi }, \boldsymbol{\kappa } to minimize \mathcal {L}

11:

end while

D. Imitation Learning With Base Skill Priors

Algorithm 2: Imitation Learning with Skill Priors.

1:

Given:

\mathcal {D}: \lbrace (D^{\text {play}}, D^{\text {lang}})\rbrace: Play Dataset and Language Dataset

\mathcal {F} = \lbrace f_{\boldsymbol{\Phi }}, f_{\boldsymbol{\lambda }}, f_{\boldsymbol{\kappa }}, f_{\boldsymbol{\omega }}, f_{\boldsymbol{\theta }}\rbrace. They are the encoder f_{\boldsymbol{\Phi }}, the skill embedding selecter f_{\boldsymbol{\lambda }}, the base skill locator f_{\boldsymbol{\kappa }}, the base skill selector f_{\boldsymbol{\omega }}, the skill generator networks f_{\boldsymbol{\theta }} with parameters \boldsymbol{\Phi }, \boldsymbol{\lambda }, \boldsymbol{\kappa }, \boldsymbol{\omega }, and \boldsymbol{\theta }, respectively.

2:

Randomly initialize model parameters \lbrace \boldsymbol{\Phi }, \boldsymbol{\lambda }, \boldsymbol{\omega } \rbrace

3:

Initialize parameters \boldsymbol{\theta } and \boldsymbol{\kappa } with pre-trained skill generator and base skill locator

4:

Freeze the parameters \boldsymbol{\theta } and \boldsymbol{\kappa }.

5:

while not done do

6:

\mathcal {L} \leftarrow 0

7:

for l in {play, lang} do

8:

Sample a (demonstration, context) (x^{l}, c^{l}) \sim D^{l}

9:

Encode the observation, goal, and plan embeddings, using the encoder network f_{\boldsymbol{\Phi }}

10:

Skill Embedding Selecter f_{\boldsymbol{\lambda }} selects the skill embedding sequence

11:

Determinate a sequence of base skill probabilities with Base Skill Selector f_{\boldsymbol{\omega }}.

12:

Determinate base skill locations in the latent space with Base Skill Locator f_{\boldsymbol{\kappa }}

13:

Skill Generator f_{\boldsymbol{\theta }} maps the skill embeddings to action sequences.

14:

Calculate the loss function \mathcal {L}_{l} according to (4)

15:

Accumulate imitation loss \mathcal {L} \mathrel {+}= \mathcal {L}_{l}

16:

end for

17:

update parameters \lbrace \boldsymbol{\Phi }, \boldsymbol{\lambda }, \boldsymbol{\omega } \rbrace w.r.t \mathcal {L}

18:

end while

After acquiring the skill embedding space \mathcal {A}_{z} and the distributions of base skill priors in such latent space, we can train a policy using imitation learning based on that. This approach results in a policy with enhanced generalization capabilities, as incorporating prior knowledge prevents the model from overfitting. The base skill priors we have defined encapsulate human proficiency in task completion. We aim to leverage the prior knowledge contained in base skills to reduce the agent's reliance solely on training data. In our approach, the agent learns to choose a skill that embodies motion-related human knowledge instead of determining the action at every step. Meanwhile, it also selects the appropriate base skill for the current state, mirroring the habitual approach of humans in accomplishing tasks.

We extend the idea of MCIL [10] and HULC [14] by employing an action space \mathcal {A}_{z} comprising skill embeddings instead of Cartesian action space \mathcal {A}. In this framework, the action performed by the agent is no longer a single 7 DoF movement in one time step, but instead, a skill (action sequence) over a horizon N_{h}. Consequently, the agent learns to select a skill based on the current observation. After the skill is performed, the agent selects the next skill based on the subsequent observation, and the process continues iteratively until the agent completes the task or the time runs out. Fig. 4 depicts the overall structure of our approach. Given the superior performance of the HULC model, we employ its encoder, denoted as f_{\boldsymbol{\Phi }}, to transform the static observation, gripper observation, and language instruction into their corresponding embeddings. All these embeddings align with the definition provided in the HULC model. Additionally, to extract the overall process information from the language instruction, we introduce extra language embedding. This process information is crucial for inferring the intermediate-level compositions of base skills required for successful task completion. We further analyze four key parts in our structure:

  • Skill Embedding Selector: The skill embedding selector, denoted as f_{\boldsymbol{\lambda }}, selects skill embeddings in the pre-trained latent space. A bidirectional LSTM network is employed for this skill embedding selector.

  • Base Skill Selector: The base skill selector f_{\boldsymbol{\omega }}, also a bidirectional LSTM network, determines the base skill to which a given skill belongs.

  • Base Skill Locator: The base skill locator shares the same parameters with base skill locator f_{\boldsymbol{\kappa }} in Fig. 2. It has the task of locating the base skill locations in the latent space. The input to this network is a 3 \times 3 identity matrix, signifying the one-hot representing of three base skills. These locations are used to calculate the regularization loss.

  • Skill Generator: The skill generator, denoted as f_{\boldsymbol{\theta }} = p_{\boldsymbol{\theta }}(x|z) : \mathcal {A}_{z} \rightarrow \mathcal {A}_{\text {skill}}, shares the same parameters with the decoder component in Fig. 2. Its parameters are frozen during the imitation learning process. Its function is to transform space from skill embedding space \mathcal {A}_{z} to skill space \mathcal {A}_{\text {skill}}. These skills (action sequences) are combined chronologically for a longer action sequence.

The objective of our model is to learn a policy \pi (x|s_{c},s_{g}) conditioned on the current state s_{c} and the goal state s_{g} and outputting x, a sequence of actions, namely a skill. Since we introduced the base skill concept into our model, the policy \pi (\cdot) should also find the best base skill y for the current observation. We have \pi (x,y|s_{c},s_{g}), where y is the base skill the agent chooses based on the current state and goal state.

Inspired by the conditional variational autoencoder (CVAE): \begin{equation*} \log p(x|c) \geq \mathbb {E}_{q(z|x,c)}[\log p(x|z,c)] - D_{KL}(q(z|x,c) || p(z|c)) \tag{3} \end{equation*} View SourceRight-click on figure for MathML and additional features.where c is a symbol to describe a general condition, we would like to extend the above equation by integrating y which indicates the base skill. The evidence we want to maximize then turns to p(x,y|c). We employ the approximate variational posterior q(y,z|x,c) to approximate the intractable true posterior p(y,z|x,c) where z indicates the skill embeddings in the skill latent space. We intend to find the ELBO by measuring the KL divergence between the true posterior and the posterior approximation (detailed theoretical motivation in the Appendix). We have \begin{align*} \mathcal {L} = & \overbrace{\mathbb {E}_{z \sim q_{\boldsymbol{\phi }}(x,c)}\log p_{\boldsymbol{\theta }}(x|z)}^{\text {Reconstruction loss} (\mathcal {L}_{\text {huber}})} \\ & - \gamma _{1} \sum _{k} q_{\boldsymbol{\omega }}(y=k|c) \overbrace{D_{KL}(q_{\boldsymbol{\Phi }, \boldsymbol{\lambda }}(z|x,c)||p_{\boldsymbol{\kappa }}(z|y))}^{\text {Base skill regularizer} (\mathcal {L}_{\text {skill}})}\\ & - \gamma _{2} \overbrace{D_{KL}(q_{\boldsymbol{\omega }}(y|c)||p(y))}^{\text {Categorical regularizer} (\mathcal {L}_{\text {cat.}})} \tag{4} \end{align*} View SourceRight-click on figure for MathML and additional features.where c represents a combination of the current state and the goal state (s_{c},s_{g}). z is skill embedding in the latent skill space. p_{\boldsymbol{\theta }}(x|z) is the skill generator network f_{\boldsymbol{\theta }} with parameters \boldsymbol{\theta } and it is trained by VAEs discussed in the previous session and frozen during the imitation learning. f_{\boldsymbol{\omega }} = q_{\boldsymbol{\omega }}(y|c) corresponds to the skill labeller with parameter \boldsymbol{\omega }. q_{\boldsymbol{\Phi }, \boldsymbol{\lambda }}(z|x,c) refers to the encoder network f_{\boldsymbol{\Phi }} plus the skill embedding selector network f_{\boldsymbol{\lambda }}. Furthermore, p_{\boldsymbol{\kappa }}(z|y) constitutes the base skill prior locater f_{\boldsymbol{\kappa }} with parameter \boldsymbol{\kappa }. It is also trained by VAEs, as discussed in the previous section and frozen during the training process. Here, we use Huber loss as the metric for reconstructive loss. Intuitively, the base skill regularizer is used to regularize a skill embedding, depending on its base skill category. The categorial regularizer aims to regularize the base skill classification based on the prior categorical distribution of y. The overall algorithm can be seen in Algorithm 2.

SECTION IV.

Experiments

In this section, we present the experiments conducted to investigate the generalization ability of our model in comparison to other baselines. We choose the CALVIN [19] benchmark to evaluate our model. The CALVIN benchmark is introduced to facilitate learning language-conditioned tasks across four manipulation environments: A, B, C, and D. Each environment features a Franka Emika Panda robot arm equipped with a gripper and a desk that includes a sliding door and a drawer. Additionally, the desk has a button that can toggle a green light and a switch to control a light bulb. Note that each environment has a different desk with various of textures and the position of static elements such as the sliding door, drawer, light, switch, and button are different across each environment. Experiments are conducted in two settings: (1) a single environment where the training and testing environments are the same, and (2) zero-shot multi-environments where training occurs in the first three environments and testing takes place in the fourth, previously unseen environment.

We choose long-horizon multi-task language control (LH-MTLC) to evaluate the effectiveness of the learned multi-task language-conditioned policy in accomplishing several language instructions in a row under the zero-shot multi environment. We also compare other skill-based reinforcement learning approaches to show the advantages of our approaches against theirs.

We analyze the result of our model by comparing it to other baselines (shown in Table I). We evaluate the models with 1000 five-task chains. The columns labeled from one to five demonstrate the success rate of continuously completing that number of tasks in a row. The average length indicates the average number of tasks the agent can continuously complete when given five tasks in a row (The remaining tasks are not performed if one task fails in the middle). Subsequently, ablation studies on hyperparameters \gamma _{1},\gamma _{2} in (4) and the length of skill N_{h} (the default value is 5) are performed in the zero-shot multi-environment. Each model is evaluated three times across 3 random seeds.

TABLE I CALVIN Benchmark Results
Table I- CALVIN Benchmark Results

A. Environment Result

As evidenced in Table I, our model substantially improves compared to our baselines HULC and MCIL in a zero-shot multi-environment setting. Compared to the current SOTA model HULC, the success rate of completing one to five tasks in a row has increased by 32.4%, 29.8%, 21.9 %, 12.8%, and 6.9 %, respectively. The overall average length increased from 0.67 to 1.71. Note that zero-shot multi-environment presents a challenging environment, as the agent must solve tasks in an unfamiliar environment. The performance in this setting represents the agent's ability to truly understand and connect the concepts in language instructions with real objects and actions. The performance of our model demonstrates a significant improvement, thus confirming our hypothesis that using skill priors to learn intermediate-level task composition can improve generalization capabilities. The other SOTA models - SuSIE [29], RoboFlamingo [30], GR-1 [31], 3D Diffuser Actor [32], which leverage pre-trained foundation models, as listed in Table I for reference. It is worth mentioning that our SPIL model also outperforms the baselines in the single environment setting.

B. Real-World Experiments

To investigate the viability of the policy trained in a simulated environment to real-world scenarios, we conduct a sim2real experiment without any additional specific adaptation (zero-shot), as shown in Fig. 5.

Fig. 5. - Real-world experiments. We employ the multi-task language control (MTLC) setting in the CALVIN benchmark, encompassing a total of 10 tasks as listed above. The agent is trained in the simulated CALVIN environment D and directly applied to the real-world setting.
Fig. 5.

Real-world experiments. We employ the multi-task language control (MTLC) setting in the CALVIN benchmark, encompassing a total of 10 tasks as listed above. The agent is trained in the simulated CALVIN environment D and directly applied to the real-world setting.

We designed the real-world environment to closely resemble the simulated CALVIN environment D. The rightmost part of Fig. 4 illustrates that the real-world environment comprises one switch, one cabinet with a slider, one button, one drawer, and three blocks in red, pink, and blue colors. Additionally, two RGB cameras are employed to capture the static observation and gripper observations.

Table II lists the tasks performed and the corresponding success rate. The agent is trained in four CALVIN environments (A, B, C, D), and the trained policy is directly applied to a real-world environment. To mitigate the influence of the robot's initial position on the policies, we execute 10 roll-outs for each task, maintaining identical starting positions. The table results demonstrate our model's effectiveness in handling the challenging zero-shot sim2real experiments. Despite the substantial differences between the simulation and real-world contexts, our model still achieves an average success rate of 33% in accomplishing the tasks. Conversely, the HULC model-trained agent struggles with these tasks, with a 3% average success rate, underscoring the difficulty of solving real-world challenges. The results from real-world experiments further substantiate our claim that our proposed method exhibits superior generalization capabilities, enabling successful task completion even in unfamiliar environments.

TABLE II Real-World Experiment Results
Table II- Real-World Experiment Results
SECTION V.

Conclusion

In this letter, we introduced a novel imitation learning paradigm that integrates base skills into imitation learning. Our proposed SPIL model effectively improves the generalization ability compared to current baselines and substantially surpasses the SOTA models on the language-conditioned robotic manipulation CALVIN benchmark, especially under the challenging zero-shot multi environment setting. This work also aims to contribute towards the development of general-purpose robots that can effectively integrate human language with their perception and actions.

References

References is not available for this document.