From Imitation to Refinement
Residual RL for Precise Assembly

1Massachusetts Institute of Technology 2Improbable AI Lab 3Harvard University
MIT Logo Harvard Logo Improbable AI Logo

We show how combining behavior cloning with residual RL enables precise robot manipulation, improving success rates from ~5-50% to >95% on challenging assembly tasks.

Abstract

Recent advances in behavior cloning (BC), like action-chunking and diffusion, have led to impressive progress, but imitation alone remains insufficient for tasks requiring reliable and precise movements, such as aligning and inserting objects. Our central insight is that chunked BC policies function as trajectory planners, enabling long-horizon tasks, but because they execute action chunks in an open loop, they lack the fine-grained reactivity necessary for reliable execution. Further, we find that the performance of BC policies saturates despite increasing data. We present a simple yet effective method, ResiP (Residual for Precise Manipulation), that sidesteps these challenges by augmenting a frozen, chunked BC model with a fully closed-loop residual policy trained with reinforcement learning (RL) that also overcomes the limitation of finetuning action-chunked diffusion policies with RL. The residual policy is trained via on-policy RL, addressing distribution shifts and reactivity without altering the BC trajectory planner. Evaluation on high-precision manipulation tasks demonstrates strong performance of ResiP over BC methods and direct RL fine-tuning.

The key insight

Modern chunked BC policies work well for planning trajectory segments but lack the corrective behavior and fine-grained reactivity needed for reliable execution.

Even massively scaling the dataset doesn't fully resolve it—Pure BC performance saturates even with 100K demos! Adding residual RL dramatically improves success rates with just 50 demos + RL fine-tuning.




Introducing Residual for Precise manipulation

Our method, ResiP, adds closed-loop corrections via residual RL while keeping the base model frozen. As such, we effectively side-step all complications with applying RL to action-chunked and diffusion policies and retains all modes from pre-training.

The reward signal is only with a sparse task completion reward inferred directly from the demos and therefore requires no reward shaping.




Closed-loop corrections lead to large performance gains

By fine-tuning action-chunked diffusion policies with a per-step residual correction learned with online RL, we can drastically improve success rates for a range of multi-step tasks requiring precise alignments.




Remarkable performance for tight-tolerance tasks

ResiP really shines on high-precision tasks. For a 0.2mm clearance peg-in-hole task, we improve from 5% → 99% success! The local nature of residual corrections is perfect for precise alignment.




What does the residual learn?

How does it work? The residual policy learns to make small corrective adjustments to avoid common failure modes:

  • Prevents premature pushing during insertion
  • Maintains better grasp stability
  • Enables recovery from small perturbations
  • Learns new grasping and retrying behaviors (surprisingly!)




Real-World Policies with Sim-to-Real

Finally, we demonstrate successful sim-to-real transfer by:

  • Distilling to vision-based policies
  • Co-training with real demos
  • Using domain randomization

Improved assembly on in-distribution task

Real demos only

Real demos + ResiP distillation


Pipeline is task-agnostic

Real demos + ResiP distillation

Real demos + ResiP distillation


Simulation allows for assembly of unseen parts

Real demos only

Real demos + ResiP distillation

Sim demos

To fully appreciate the difference between imitation only and with the reactive controller learned with RL, please enjoy ~3-7 hours of rollouts per task at 4x speed below. This is 1000 consecutive rollouts for the different policies without any editiing or cherry-picking.

1,000 uncut rollouts for one_leg with low randomness

BC from 50 human demonstrations

BC fine-tuned with residual

1,000 uncut rollouts for round_table with med randomness

BC from 50 human demonstrations

BC fine-tuned with residual

Methods

System Overview

We present a pipeline for teaching robots to perform complex assembly tasks from RGB observations. The approach combines behavioral cloning (BC) and reinforcement learning (RL) techniques to develop robotic systems capable of precise manipulation.

The process involves training an initial policy through BC in simulation, enhancing it with RL, and then distilling the improved policy into a vision-based policy operating from RGB images. By integrating synthetic data with real-world demonstrations, we can create assembly policies that can be effectively deployed in the real-world.

This approach addresses challenges in robotic learning, including the need for high precision, adaptability to various initial conditions, and the ability to operate directly from visual data.

Fig. 1: (1) Beginning with a policy trained with BC insimulation, (2) we train residual policies to improve task success rates with RL and sparse rewards. (3) We then distill the resulting behaviors to a policy that operates on RGB images. (4) By combining synthetic data with a small set of real demonstrations, (5) we deploy assembly policies that operate from RGB images in the real world.


Residual RL over Action Chunks

For long-horizon and complex tasks, modern policy architectural choices, like action chunks and diffusion, are necessary to achieve a non-zero task success from BC, which is necessary for downstream RL. However, this introduces complications for the RL finetuning. We opt for residual policies as a solution.

Per-timestep residual policies trained with PPO to locally correct action chunks predicted by a BC-trained diffusion policy:

  • Base Model (Diffusion Policy): Predicts action chunks, trained via behavioral cloning.
  • Residual Model: Makes per-timestep corrections, trained with reinforcement learning.

This approach combines behavioral cloning and reinforcement learning, allowing flexible optimization without modifying the base model.

Fig. 2: Per-timestep residual policies trained with PPO to locally correct action chunks predicted by a BC-trained diffusion policy.



The residual model learns local corrections

In the below videos, we visualize what the predictions made by the residuals look like and what the corrections do to the resulting net action. In the videos, the red line is the base action, the blue line is the residual prediction, and the green line is the resulting net action.


Corrections during insertion

The peg comes down a bit to the right of the hole, and the base policy tries pushing the peg down, while the residual corrects the position so the insertion is successful.

Corrections during grasping

The base policy tries to go down with the peg too deep in the gripper, which would probably have caused a colision between the right finger and the other peg. The residual pushes the gripper back so that it can get the peg between the fingers without collision.

Results

Comparing RL to BC and baselines

Table 1 demonstrates the effectiveness of different approaches in robotic assembly tasks. Key findings include:

  • Basic MLPs without action chunking (MLP-S) fail completely across all tasks.
  • Diffusion Policies (DP) generally outperform MLPs with chunking (MLP-C) in imitation learning.
  • ResiP shows significant improvements over both imitation learning baselines and alternative RL fine-tuning methods, particularly in tasks with lower initial randomization.

Tab. 1: Top BC-trained MLPs without chunking (MLP-S) cannot perform any of the tasks, and Diffusion Policies (DP) generally outperform MLPs with chunking (MLP-C). Bottom Training our proposed residual policies with RL on top of frozen diffusion policies performs the best among all evaluated fine-tuning techniques.



Distilling from an RL expert improves performance

These experiments investigate the effectiveness of distilling a reinforcement learning (RL) policy trained in simulation into a vision-based policy for real-world deployment. The study focuses on how the quantity and quality of synthetic RL data impact the performance of the distilled policies. The experiments compare policies trained directly on human demonstrations with those distilled from RL agents, and explore the effects of dataset size and modality (state-based vs. image-based) on distillation performance.

Key Findings:

  • Distilling trajectories from the RL agent (73% success rate) outperforms training directly on human demonstrations (50% success rate).
  • A performance gap exists between the RL-trained teacher (95%) and the distilled student policy (73%).
  • The change in modality (state-based to image-based) is not the primary cause of this performance gap.
  • Increasing the distillation dataset size improves performance, but a gap persists even with large datasets (77% success rate with 10k trajectories vs. 95% for the teacher).
  • The ability to generate large-scale synthetic datasets in simulation provides a significant advantage for improving distilled policy performance.

Fig. 4: Comparison of distilled performance from BC and RL-based teacher.

Fig. 5: BC distillation scaling with dataset size.



Simulation data improves performance in the real-world

These experiments evaluate the performance of sim-to-real policies on the physical robot. The study compares policies trained on a mixture of real-world demonstrations and simulation data against those trained solely on real-world demonstrations. The quantitative experiments focus on the "one leg" task.

Key Findings:

  • Incorporating simulation data significantly improves real-world performance, increasing task completion rates from 20-30% to 50-60%.
  • Policies co-trained with simulation data exhibit smoother behavior and make fewer erratic movements.
  • Performance improvements are observed across various subtasks (corner alignment, grasping, insertion, and screwing).
  • The combination of 40 real demonstrations and 350 simulated trajectories yields the best results, with up to 60% task completion rate.
  • Co-trained policies show improved generalization to both part pose and obstacle pose randomizations.

These results demonstrate the effectiveness of combining real-world demonstrations with simulation data for improving the performance and robustness of robotic assembly policies in real-world settings.

Tab. 2: We compare the impact of combining real-world demonstrations with simulation trajectories obtained by rolling our RL-trained residual policies. We find that co-training with both real and synthetic data leads to improved motion quality and success rate on the one_leg task.

Related Links

There's a lot of excellent work related to ours in the space of manipulation and assembly, reinforcement learning, and diffusion models. Here are some notable examples:

Manipulation and Assembly

  • FurnitureBench introduces a real-world furniture assembly benchmark, providing a reproducible and easy-to-use platform for long-horizon complex robotic manipulation that we use in our work.
  • ASAP is a physics-based planning approach for automatically generating sequences for general-shaped assemblies, accounting for gravity to design a sequence where each sub-assembly is physically stable.
  • InsertionNet 1.0 and InsertionNet 2.0 address the problem of insertion specifically and propose regression-based methods that combine visual and force inputs to solve various insertion tasks efficiently and robustly.
  • Grasping with Chopsticks develops an autonomous chopsticks-equipped robotic manipulator for picking up small objects, using approaches to reduce covariate shift and improve generalization.

Diffusion Models and Reinforcement Learning

Recent work has explored combining diffusion models with reinforcement learning:

  • Black et al. and Fan et al. studied how to cast diffusion de-noising as a Markov Decision Process, enabling preference-aligned image generation with policy gradient RL.
  • IDQL uses a Q-function to select the best among multiple diffusion model outputs.
  • Goo et al. explored advantage weighted regression for diffusion models.
  • Decision Diffuser and related works change the objective into a supervised learning problem with return conditioning.
  • Wang et al. explored augmenting the de-noising training objective with a Q-function maximization objective.

Residual Learning in Robotics

Learning corrective residual components has seen widespread success in robotics:

  • Works like Silver et al., Davchev et al., and others have explored learning residual policies that correct for errors made by a nominal behavior policy.
  • Ajay et al. and Kloss et al. combined learned components to correct for inaccuracies in analytical models for physical dynamics.
  • Schoettler et al. applied residual policies to insertion tasks.
  • TRANSIC by Jiang et al. applied residual policy learning to the FurnitureBench task suite, using the residual component to model online human-provided corrections.

Theoretical Analysis of Imitation Learning

There's been an increasing amount of theoretical analysis of imitation learning, with recent works focusing on the properties of noise injection and corrective actions:

  • Provable Guarantees for Generative Behavior Cloning proposes a framework for generative behavior cloning, ensuring continuity through data augmentation and noise injection.
  • CCIL generates corrective data using local continuity in environment dynamics.
  • TaSIL penalizes deviations in higher-order Taylor series terms between learned and expert policies.

These works aim to enhance the robustness and sample efficiency of imitation learning algorithms.

Future Directions

Exciting future directions: Our method is base-model agnostic - we show it works with diffusion models, ACT, and MLPs!

This means it could potentially scale to fine-tuning large multi-task behavior models (like Octo, OpenVLA, etc.) while fully preserving their pre-training capabilities.

BibTeX

@misc{ankile2024imitationrefinementresidual,
        title={From Imitation to Refinement -- Residual RL for Precise Assembly}, 
        author={Lars Ankile and Anthony Simeonov and Idan Shenfeld and Marcel Torne and Pulkit Agrawal},
        year={2024},
        eprint={2407.16677},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2407.16677}, 
  }