Behavior Cloning with RLBench

Dataset generation process for change point detection and behavior cloning with RLBench simulator

9 min readOct 12, 2024

Franka Emika Robot hand used for data generation with expert demonstrations (Image by Author)

In my recent paper¹ “Learning Temporally Composable Task Segmentations with Language,” published at the IROS 2024 conference, I presented a method for detecting change-points in long-horizon robotics trajectories. To achieve this, I needed to create a specialized dataset, which was done using the RLBench² simulator. RLBench provides both low- and high-dimensional data, and I leveraged its capabilities to mark change-points that split longer tasks into smaller subtasks, like picking and placing a block. I generated 10,000 demonstrations across 20 manually curated long-horizon tasks. I’ll walk you through the data creation process in detail in this article.

RLBench²

RLBench is a learning environment that is widely used to benchmark robot learning with over 100 tasks at varying levels of difficulty. The simulator allows us to capture proprioceptive observations and visual observations, including RGB, depth, and segmentation masks from cameras of multiple different angles. Each individual task in RLBench includes various initial configurations of objects to guarantee the robustness of the benchmark, with demonstrations provided by the in-built planner and the waypoints
from the environment.

RLBench simulator in edit mode for the task of “Place the blue block in the drawer” (Image by Author)

For installation, you can follow the awesome Github repository for RLBench: Git Repo

Visual representaion of dataset, for wrist camera, front camera, overhead camera and proprioceptive data (Image by Author)

The RLBench simulator's dataset includes low- and high-dimensional data. The low-dimensional data also referred to as proprioceptive data, captures the robot’s internal states, such as the force, velocity of the 7 joints, and the gripper status. On the other hand, the high-dimensional data consists of visual inputs from multiple camera angles. For this dataset, we utilized three cameras: a wrist camera, a front-facing camera, and an overhead camera, providing comprehensive visual coverage of the task environment.

Success Condition

In long-horizon tasks, a sub-task like “open the drawer” is marked complete when specific success conditions are met, allowing for a change-point to be defined in the trajectory. In simulations, these success conditions need to be clearly defined. For instance, a virtual wall can be used to mark the completion of the sub-task — when the drawer touches this virtual boundary, it signals that the sub-task is complete and a change-point can be recorded. Similarly, for a task like “pick up the block,” a change-point can be defined when the block crosses a predefined virtual boundary.

Visual represenstation of change-points. Left: “**Open the middle drawer**”. Right: “**Pick up the blue block**”. (Image by Author)

In the simulator, object positions are randomized, and the virtual walls defining success or change points are set relative to the objects. This flexibility allows us to generate data in a randomized fashion, varying elements such as object position, color, and background. For example, in the “open drawer” task, I can generate 250 demonstrations where both the starting and end positions, as well as the positions of nearby objects, will differ for each demonstration. This approach enhances the dataset’s diversity and robustness, ensuring that model can generalize better across different environments and conditions.

Complete example with change-points/success conditions for the task of “**Put the ball in the basket**” (Image by Author)

This diagram visually represents a trajectory for the task “Put the ball in the basket,” which can be broken down into two distinct subtasks, marked by change-points: “Pick up the ball” and “Place the ball in the basket.” In the accompanying GIF, the first change-point is marked when the ball exits the virtual red box, signifying the completion of the picking action. The second change-point is triggered when the ball enters a second virtual red box, indicating it is now positioned correctly for placing. Green markers labeled 1 and 2 visually indicate these change-points, providing clear boundaries between the subtasks.

Code Changes

Click on “Run Pen” to understand the changes required for this dataset creation. It shows the diff in changes made to each file in codebase, and shows how the “Put the ball in the basket” should look in code. For full code, clone this git repository.

Code changes made to RLBench codebase to define change-points and success conditions for dataset creation (Image by Author)

Examples

Let’s see a few examples of how change-points are defined in the 20 tasks.

Build Pyramid

The process of building a pyramid with 3 levels can be divided into clear “pick” and “place” actions, where each block is distinguished by color:

Constructing the Base Layer (Level 1)
- Pick up the Orange Block
- Place the Orange Block in the center
- Pick up the Pink Block
- Place the Pink Block on the right of the Green center
- Pick up the Green block
- Place the Green Block on the left of the Green center
Building the Middle Layer (Level 2)
- Pick the Gray Block
- Place the Gray block on the right of the middle layer
- Pick the Purple Block
- Place the Purple block on the left of the middle layer
Completing the Top Layer (Level 3)
- Pick up the Pink Block
- Place the Pink Block on the top of the pyramid

This sequence of actions ensures a stable pyramid assembly, with each colored block marking a clear stage in the construction. The robot trajectory is decomposed into sub-tasks of picking and placing blocks in a structured order from the base to the top. Further, we can collect all the pick sub-tasks and train a behavior cloning model conditioned on language instructions to perform a “pick” action for a specific colored block (e.g., “Pick the red block”). Similarly, we can train another behavior cloning model for the place action, which is also conditioned by language (e.g., “Place the red block on the base”). This method enables the robot to effectively learn both picking and placing actions based on color and task context.

Change Channel

The process of changing a channel with remote control can be broken down into clear “pick,” “rotate,” and “press” actions, each defined by specific sub-tasks:

Pick Up the Remote
Change Remote Orientation
Press the Button on the Remote

The robot’s trajectory is decomposed into these structured sub-tasks of picking, orienting, and pressing. Furthermore, we can collect all the pick sub-tasks and train a behavior cloning model conditioned on language instructions to perform the “pick” action for the remote (e.g., “Pick up the remote”). Similarly, we can develop another behavior cloning model for the orientation adjustment, conditioned by commands like “Change the remote’s angle,” and a separate model for pressing the button, based on instructions such as “Press the channel up button.” This method enables the robot to effectively learn the complete task of changing channels, adapting its actions based on context and specific language cues.

Empty Dishwasher

The process of emptying a dishwasher can be broken down into clear “open,” “pull,” “pick,” and “place” actions, structured as follows:

Open the Dishwasher
Pull the Tray Out
Pick up the plate from the tray
Place the Plate on Top of the Dishwasher

The robot’s trajectory is decomposed into these structured sub-tasks of opening, pulling, picking, and placing. Additionally, we can collect all the pick sub-tasks and train a behavior cloning model conditioned on language instructions to perform the “pick” action for the plates (e.g., “Pick up the plate”). Similarly, we can create models for the other actions, such as “Open the dishwasher” and “Place the plate on top of the dishwasher,” enabling the robot to learn the complete task of emptying the dishwasher effectively, following natural language instructions throughout the process.

Play Hockey

The process of playing hockey can be broken down into clear “pick” and “slide” actions, structured as follows:

Pick Up the Hockey Stick
Slide the Pink Ball to the Goal
Slide the Gray Ball to the Goal

The robot’s trajectory is decomposed into these structured sub-tasks of picking and sliding. Moreover, we can collect all the pick and slide sub-tasks and train separate behavior cloning models conditioned on language instructions to perform specific actions, such as “Pick up the hockey stick” and “Slide the pink ball to the goal.”

More Examples

Visual representation of decomposition of several tasks into sub-tasks marked by fulfilling success conditions for change-point detection dataset creation. (Image by Author)

Feature Generation

We extract compact features from the trajectory observations by first converting visual observations into 10 FPS videos. These videos are then segmented into 2-second clips, each labeled with the associated language instruction. We use the HERO³ feature extractor to convert these clips into features and concatenate the features from all the cameras. At the end of every 2-second clip, we capture the robot’s proprioceptive data as part of the observation. These observations, along with the visual features extracted from the clip constitute the trajectory features.

Visual representation of capturing visual(high-dimensional data) and proprioceptive(low-dimensional data) data from robotics trajectory to be used for extracting trajectory features (Image by Author)

Results

This dataset is then used for language conditioned task segmentation approach for high-dimensional robot trajectories. This dataset helped to demonstrate significant improvements in trajectory segmentation accuracy on high dimensional data (34% improvement) while demonstrating sample efficiency(mAP of 61.41 with only 100). Moreover, this method of data generations provides the ability to change sub-task granularity and a pathway towards sim-to-real transfer of the changepoint detection models. Further, behavior cloning models trained on the
segmented trajectories from this dataset outperform a single model trained on the whole trajectory by up to 20%. The detailed results can be found in the following website:

Results for Trajectory Segmentation

Detailed results for each experiment performed for change point detection with dataset created with RLBench simulator

sites.google.com/asu.edu/change-point

Demonstrations

I am presenting two demonstrations that illustrate the improvements achieved by utilizing this dataset to separate sub-tasks such as “pick” and “place.” By training distinct behavior cloning (BC) models that specialize in individual actions, we can effectively stitch these models together to complete more complex tasks, rather than relying on a single unified policy.

The results shown below use the same amount of data but employ different approaches to demonstrate the effectiveness of this method. For a deeper understanding of the techniques and findings, please refer to my paper¹.

Comparison of single behavior cloning policy vs stitched specialized multiple behavior cloning policies to perform a task (Image by Author)

In the gif, you will see two demonstrations, each divided into upper and lower segments, highlighting the effectiveness of our approach to task segmentation using the dataset.

Upper Segment: This segment showcases a traditional method where a single policy is trained to perform the entire long horizon task. While this method utilizes the same amount of data, it struggles with accuracy and efficiency in executing sub-tasks, leading to inconsistencies in performance.

Lower Segment: In contrast, this segment illustrates the advantages of training separate behavior cloning models for the sub-tasks such as “pick” and “place” actions. Each model is fine-tuned to specialize in its respective sub-task, resulting in smoother and more precise movements. These models are effectively stitched together, demonstrating a more coherent and efficient execution of the complete task.

Overall, the video clearly demonstrates how separating sub-tasks and training specialized models can lead to significant improvements in robotic performance.

References:

D. Raj et al., “Learning Temporally Composable Task Segmentations with Language,” IEEE/RSJ International Conference on Intelligent Robots and Systems, 2024.
S. James et al., “Rlbench: The robot learning benchmark & learning
environment,” IEEE Robotics and Automation Letters, 2020.
L. Li et al., “Hero: Hierarchical encoder for video+language omni-
representation pre-training,” EMNLP, 2020.
GitHub: https://github.com/divraz/TrajectorySegmentation.git