From Imitation to Refinement
Residual RL for Precise Visual Assembly

1Massachusetts Institute of Technology 2Improbable AI Lab 3Harvard University
MIT Logo Harvard Logo Improbable AI Logo

Perform complex assembly tasks in the real world from RGB observations by learning from demonstrations and refining with reinforcement learning.

Abstract

Behavior cloning (BC) currently stands as a dominant paradigm for learning real-world visual manipulation. However, in tasks that require locally corrective behaviors like multi-part assembly, learning robust policies purely from human demonstrations remains challenging. Reinforcement learning (RL) can mitigate these limitations by allowing policies to acquire locally corrective behaviors through task reward supervision and exploration.

This paper explores the use of RL fine-tuning to improve upon BC-trained policies in precise manipulation tasks. We analyze and overcome technical challenges associated with using RL to directly train policy networks that incorporate modern architectural components like diffusion models and action chunking. We propose training residual policies on top of frozen BC-trained diffusion models using standard policy gradient methods and sparse rewards.

Our experimental results demonstrate that this residual learning framework can significantly improve success rates beyond the base BC-trained models in high-precision assembly tasks by learning corrective actions. We also show that by combining our residual learning approach with teacher-student distillation and visual domain randomization, our method can enable learning real-world policies for robotic assembly directly from RGB images.

Real-World Demos

Improved assembly on in-distribution task

40 Real Demos

40 Real Demos + 350 RL Sim Demos


Pipeline is task-agnostic

40 Real Demos + 400 RL Sim Demos

40 Real Demos + 400 RL Sim Demos


Simulation allows for assembly of unseen parts

40 Real Demos

40 Real Demos + 400 RL Sim Demos

Methods

System Overview

We present a pipeline for teaching robots to perform complex assembly tasks from RGB observations. The approach combines behavioral cloning (BC) and reinforcement learning (RL) techniques to develop robotic systems capable of precise manipulation.

The process involves training an initial policy through BC in simulation, enhancing it with RL, and then distilling the improved policy into a vision-based policy operating from RGB images. By integrating synthetic data with real-world demonstrations, we can create assembly policies that can be effectively deployed in the real-world.

This approach addresses challenges in robotic learning, including the need for high precision, adaptability to various initial conditions, and the ability to operate directly from visual data.

Fig. 1: (1) Beginning with a policy trained with BC insimulation, (2) we train residual policies to improve task success rates with RL and sparse rewards. (3) We then distill the resulting behaviors to a policy that operates on RGB images. (4) By combining synthetic data with a small set of real demonstrations, (5) we deploy assembly policies that operate from RGB images in the real world.


Residual RL over Action Chunks

For long-horizon and complex tasks, modern policy architectural choices, like action chunks and diffusion, are necessary to achieve a non-zero task success from BC, which is necessary for downstream RL. However, this introduces complications for the RL finetuning. We opt for residual policies as a solution.

Per-timestep residual policies trained with PPO to locally correct action chunks predicted by a BC-trained diffusion policy:

  • Base Model (Diffusion Policy): Predicts action chunks, trained via behavioral cloning.
  • Residual Model: Makes per-timestep corrections, trained with reinforcement learning.

This approach combines behavioral cloning and reinforcement learning, allowing flexible optimization without modifying the base model.

Fig. 2: Per-timestep residual policies trained with PPO to locally correct action chunks predicted by a BC-trained diffusion policy.



The residual model learns local corrections

In the below videos, we visualize what the predictions made by the residuals look like and what the corrections do to the resulting net action. In the videos, the red line is the base action, the blue line is the residual prediction, and the green line is the resulting net action.


Corrections during insertion

The peg comes down a bit to the right of the hole, and the base policy tries pushing the peg down, while the residual corrects the position so the insertion is successful.

Corrections during grasping

The base policy tries to go down with the peg too deep in the gripper, which would probably have caused a colision between the right finger and the other peg. The residual pushes the gripper back so that it can get the peg between the fingers without collision.

Results

Comparing RL to BC and baselines

Table 1 demonstrates the effectiveness of different approaches in robotic assembly tasks. Key findings include:

  • Basic MLPs without action chunking (MLP-S) fail completely across all tasks.
  • Diffusion Policies (DP) generally outperform MLPs with chunking (MLP-C) in imitation learning.
  • ResiP shows significant improvements over both imitation learning baselines and alternative RL fine-tuning methods, particularly in tasks with lower initial randomization.

Tab. 1: Top BC-trained MLPs without chunking (MLP-S) cannot perform any of the tasks, and Diffusion Policies (DP) generally outperform MLPs with chunking (MLP-C). Bottom Training our proposed residual policies with RL on top of frozen diffusion policies performs the best among all evaluated fine-tuning techniques.



Distilling from an RL expert improves performance

These experiments investigate the effectiveness of distilling a reinforcement learning (RL) policy trained in simulation into a vision-based policy for real-world deployment. The study focuses on how the quantity and quality of synthetic RL data impact the performance of the distilled policies. The experiments compare policies trained directly on human demonstrations with those distilled from RL agents, and explore the effects of dataset size and modality (state-based vs. image-based) on distillation performance.

Key Findings:

  • Distilling trajectories from the RL agent (73% success rate) outperforms training directly on human demonstrations (50% success rate).
  • A performance gap exists between the RL-trained teacher (95%) and the distilled student policy (73%).
  • The change in modality (state-based to image-based) is not the primary cause of this performance gap.
  • Increasing the distillation dataset size improves performance, but a gap persists even with large datasets (77% success rate with 10k trajectories vs. 95% for the teacher).
  • The ability to generate large-scale synthetic datasets in simulation provides a significant advantage for improving distilled policy performance.

Fig. 4: Comparison of distilled performance from BC and RL-based teacher.

Fig. 5: BC distillation scaling with dataset size.



Simulation data improves performance in the real-world

These experiments evaluate the performance of sim-to-real policies on the physical robot. The study compares policies trained on a mixture of real-world demonstrations and simulation data against those trained solely on real-world demonstrations. The quantitative experiments focus on the "one leg" task.

Key Findings:

  • Incorporating simulation data significantly improves real-world performance, increasing task completion rates from 20-30% to 50-60%.
  • Policies co-trained with simulation data exhibit smoother behavior and make fewer erratic movements.
  • Performance improvements are observed across various subtasks (corner alignment, grasping, insertion, and screwing).
  • The combination of 40 real demonstrations and 350 simulated trajectories yields the best results, with up to 60% task completion rate.
  • Co-trained policies show improved generalization to both part pose and obstacle pose randomizations.

These results demonstrate the effectiveness of combining real-world demonstrations with simulation data for improving the performance and robustness of robotic assembly policies in real-world settings.

Tab. 2: We compare the impact of combining real-world demonstrations with simulation trajectories obtained by rolling our RL-trained residual policies. We find that co-training with both real and synthetic data leads to improved motion quality and success rate on the one_leg task.

Related Links

There's a lot of excellent work related to ours in the space of manipulation and assembly, reinforcement learning, and diffusion models. Here are some notable examples:

Manipulation and Assembly

  • FurnitureBench introduces a real-world furniture assembly benchmark, providing a reproducible and easy-to-use platform for long-horizon complex robotic manipulation that we use in our work.
  • ASAP is a physics-based planning approach for automatically generating sequences for general-shaped assemblies, accounting for gravity to design a sequence where each sub-assembly is physically stable.
  • InsertionNet 1.0 and InsertionNet 2.0 address the problem of insertion specifically and propose regression-based methods that combine visual and force inputs to solve various insertion tasks efficiently and robustly.
  • Grasping with Chopsticks develops an autonomous chopsticks-equipped robotic manipulator for picking up small objects, using approaches to reduce covariate shift and improve generalization.

Diffusion Models and Reinforcement Learning

Recent work has explored combining diffusion models with reinforcement learning:

  • Black et al. and Fan et al. studied how to cast diffusion de-noising as a Markov Decision Process, enabling preference-aligned image generation with policy gradient RL.
  • IDQL uses a Q-function to select the best among multiple diffusion model outputs.
  • Goo et al. explored advantage weighted regression for diffusion models.
  • Decision Diffuser and related works change the objective into a supervised learning problem with return conditioning.
  • Wang et al. explored augmenting the de-noising training objective with a Q-function maximization objective.

Residual Learning in Robotics

Learning corrective residual components has seen widespread success in robotics:

  • Works like Silver et al., Davchev et al., and others have explored learning residual policies that correct for errors made by a nominal behavior policy.
  • Ajay et al. and Kloss et al. combined learned components to correct for inaccuracies in analytical models for physical dynamics.
  • Schoettler et al. applied residual policies to insertion tasks.
  • TRANSIC by Jiang et al. applied residual policy learning to the FurnitureBench task suite, using the residual component to model online human-provided corrections.

Theoretical Analysis of Imitation Learning

There's been an increasing amount of theoretical analysis of imitation learning, with recent works focusing on the properties of noise injection and corrective actions:

  • Provable Guarantees for Generative Behavior Cloning proposes a framework for generative behavior cloning, ensuring continuity through data augmentation and noise injection.
  • CCIL generates corrective data using local continuity in environment dynamics.
  • TaSIL penalizes deviations in higher-order Taylor series terms between learned and expert policies.

These works aim to enhance the robustness and sample efficiency of imitation learning algorithms.

BibTeX

@misc{ankile2024imitationrefinementresidual,
        title={From Imitation to Refinement -- Residual RL for Precise Visual Assembly}, 
        author={Lars Ankile and Anthony Simeonov and Idan Shenfeld and Marcel Torne and Pulkit Agrawal},
        year={2024},
        eprint={2407.16677},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2407.16677}, 
  }