Behavior cloning (BC) currently stands as a dominant paradigm for learning real-world visual manipulation. However, in tasks that require locally corrective behaviors like multi-part assembly, learning robust policies purely from human demonstrations remains challenging. Reinforcement learning (RL) can mitigate these limitations by allowing policies to acquire locally corrective behaviors through task reward supervision and exploration.
This paper explores the use of RL fine-tuning to improve upon BC-trained policies in precise manipulation tasks. We analyze and overcome technical challenges associated with using RL to directly train policy networks that incorporate modern architectural components like diffusion models and action chunking. We propose training residual policies on top of frozen BC-trained diffusion models using standard policy gradient methods and sparse rewards.
Our experimental results demonstrate that this residual learning framework can significantly improve success rates beyond the base BC-trained models in high-precision assembly tasks by learning corrective actions. We also show that by combining our residual learning approach with teacher-student distillation and visual domain randomization, our method can enable learning real-world policies for robotic assembly directly from RGB images.
We present a pipeline for teaching robots to perform complex assembly tasks from RGB observations. The approach combines behavioral cloning (BC) and reinforcement learning (RL) techniques to develop robotic systems capable of precise manipulation.
The process involves training an initial policy through BC in simulation, enhancing it with RL, and then distilling the improved policy into a vision-based policy operating from RGB images. By integrating synthetic data with real-world demonstrations, we can create assembly policies that can be effectively deployed in the real-world.
This approach addresses challenges in robotic learning, including the need for high precision, adaptability to various initial conditions, and the ability to operate directly from visual data.
Fig. 1: (1) Beginning with a policy trained with BC insimulation, (2) we train residual policies to improve task success rates with RL and sparse rewards. (3) We then distill the resulting behaviors to a policy that operates on RGB images. (4) By combining synthetic data with a small set of real demonstrations, (5) we deploy assembly policies that operate from RGB images in the real world.
For long-horizon and complex tasks, modern policy architectural choices, like action chunks and diffusion, are necessary to achieve a non-zero task success from BC, which is necessary for downstream RL. However, this introduces complications for the RL finetuning. We opt for residual policies as a solution.
Per-timestep residual policies trained with PPO to locally correct action chunks predicted by a BC-trained diffusion policy:
This approach combines behavioral cloning and reinforcement learning, allowing flexible optimization without modifying the base model.
Fig. 2: Per-timestep residual policies trained with PPO to locally correct action chunks predicted by a BC-trained diffusion policy.
In the below videos, we visualize what the predictions made by the residuals look like and what the corrections do to the resulting net action. In the videos, the red line is the base action, the blue line is the residual prediction, and the green line is the resulting net action.
The peg comes down a bit to the right of the hole, and the base policy tries pushing the peg down, while the residual corrects the position so the insertion is successful.
The base policy tries to go down with the peg too deep in the gripper, which would probably have caused a colision between the right finger and the other peg. The residual pushes the gripper back so that it can get the peg between the fingers without collision.
Table 1 demonstrates the effectiveness of different approaches in robotic assembly tasks. Key findings include:
Tab. 1: Top BC-trained MLPs without chunking (MLP-S) cannot perform any of the tasks, and Diffusion Policies (DP) generally outperform MLPs with chunking (MLP-C). Bottom Training our proposed residual policies with RL on top of frozen diffusion policies performs the best among all evaluated fine-tuning techniques.
These experiments investigate the effectiveness of distilling a reinforcement learning (RL) policy trained in simulation into a vision-based policy for real-world deployment. The study focuses on how the quantity and quality of synthetic RL data impact the performance of the distilled policies. The experiments compare policies trained directly on human demonstrations with those distilled from RL agents, and explore the effects of dataset size and modality (state-based vs. image-based) on distillation performance.
Key Findings:
Fig. 4: Comparison of distilled performance from BC and RL-based teacher.
Fig. 5: BC distillation scaling with dataset size.
These experiments evaluate the performance of sim-to-real policies on the physical robot. The study compares policies trained on a mixture of real-world demonstrations and simulation data against those trained solely on real-world demonstrations. The quantitative experiments focus on the "one leg" task.
Key Findings:
These results demonstrate the effectiveness of combining real-world demonstrations with simulation data for improving the performance and robustness of robotic assembly policies in real-world settings.
Tab. 2: We compare the impact of combining real-world demonstrations with simulation trajectories obtained by rolling our RL-trained residual policies. We find that co-training with both real and synthetic data leads to improved motion quality and success rate on the one_leg task.
There's a lot of excellent work related to ours in the space of manipulation and assembly, reinforcement learning, and diffusion models. Here are some notable examples:
Recent work has explored combining diffusion models with reinforcement learning:
Learning corrective residual components has seen widespread success in robotics:
There's been an increasing amount of theoretical analysis of imitation learning, with recent works focusing on the properties of noise injection and corrective actions:
These works aim to enhance the robustness and sample efficiency of imitation learning algorithms.
@misc{ankile2024imitationrefinementresidual,
title={From Imitation to Refinement -- Residual RL for Precise Visual Assembly},
author={Lars Ankile and Anthony Simeonov and Idan Shenfeld and Marcel Torne and Pulkit Agrawal},
year={2024},
eprint={2407.16677},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2407.16677},
}