SLAC: Simulation-Pretrained Latent Action Space for Whole-Body Real-World RL



Learning tasks in the real world with reinforcement learning (RL) is very challenging, especially for complex robots such as bimanual mobile manipulators. The high-dimensional, unconstrained action spaces worsen the already poor RL sample efficiency and raise significant safety concerns because naive exploration can easily damage the robot or its environment.

We address this problem with a powerful insight: both sample efficiency and safety can be dramatically improved by operating in the right action space, one that can be pre-learned entirely in simulation.






Summary Video


Two-Step SLAC Procedure

We introduce SLAC, a framework that makes real-world RL feasible for high-DoF robots through a learned latent action space. SLAC consists of two steps: first, it leverages a low-fidelity simulator to pretrain a task-agnostic latent action space; then, it uses this space to drive real-world policy learning.

Step 1: Latent Action Space Learning in Simulation

How can we learn a good latent action space that allows us to efficiently handle different downstream tasks? Unsupervised Skill Discovery (USD) provides an elegant solution, through maximizing the empowerment of the latent action space. However, USD often requires a large number of samples to converge, which is not practical to run on a real robot.

But we do not necessarily have to run USD on a real robot! Instead, SLAC leverages a low-fidelity simulator for USD. Even if there is a reality gap, the learned latent action space can still be used to drive real-world policy learning as long as it has enough coverage. Specifically, SLAC designs a novel USD objective that encourages temporally extended, disentangled, and safe latent actions, through the following pipeline:

Step 2: Real-World RL in the Learned Action Space

Once the latent action space is learned, SLAC trains a downstream policy in the learned action space. Specifically, we derive a novel off-policy RL algorithm on top of Soft Actor-Critic. Our algorithm not only handles discrete action space, but can also leverage the dependencies between reward terms and latent action dimensions to decompose the RL training into parallel but easier sub-problems, thereby greatly reducing the sample complexity. We show the pipeline for downstream policy learning below:

SLAC Real-World Training

In this video, we show one full training run for the board wiping task. The learned SLAC latent action space enables the robot to effectively explore the environment. Within 40 minutes of autonomous interactions, the robot learns to reliable wipe the whiteboard.

Real-World Results

Policies Visualization

The SLAC framework does not require any demonstrations or hand-crafted behavior priors, and can be easily applied to a variety of tasks. We test SLAC on a suite of contact-rich, visuomotor bimanual mobile manipulation tasks, and show learned policies below.

Wipe a Whiteboard

SLAC policy continuously wipe a whiteboard by simultaneously moving its arms and base, while utilizing its active camera to keep track of the mark.

Wipe over Obstacle

SLAC policy learns to position its base such that it will not collide with the obstacle. In the meantime, it succecsfully wipes the mark on the whiteboard.

Push to Tray

SLAC policy learns to use the broom in hand to push trash on the table into a tray.

Sweep into Bag

SLAC learns to sweep trash off the table, while using a bag in the other hand to catch the trash.

Quantitative Results

Quantitatively, SLAC achieves significantly higher performance compared to state-of-the-art real-world RL and sim-to-real RL methods.

More importantly, it stays safe during the entire training process, with minimal collisions nor excessive force!

Learned Latent Actions Visualization

To better understand what SLAC learns during the simulation pretraining phase, we visualize the learned latent actions below. Each video shows the robot executing a sequence of different latent actions sampled from our learned latent action space. Notice how the actions are temporally extended, diverse, and safe - the robot never collides with itself or the environment. These properties are crucial for efficient downstream learning.

SLAC Beyond Robotics

While SLAC is primarily designed for robotics, its core principles can be applied to other domains with complex action spaces. Here we demonstrate SLAC's effectiveness on a centralized multi-agent particle domain, where a centralized controller needs to simulateously control 10 agents towards coordinated interactions with the landmarks. SLAC enables efficient downstream task learning on this challenging domain, outperforming previous state-of-the-art methods.