๐Ÿฆพ Robotics Institute
Educational Resources

Robot Learning

Teaching robots through experience, demonstration, and data. From reinforcement learning and imitation to foundation models and sim-to-real transfer โ€” the frontier of intelligent robotics.

Why Robot Learning?

Classical robotics relies on explicit models: kinematic equations, dynamic parameters, geometric maps. These work well in structured environments (factories, warehouses) where everything is known in advance. But the real world is messy, unpredictable, and infinitely varied. A robot in a kitchen encounters objects of unknown shapes, weights, and friction properties. A robot on a construction site faces terrain that changes daily. For these scenarios, learning from experience can succeed where hand-engineering fails.

What Can Be Learned?

The Data Challenge

Unlike computer vision or NLP, robotics data is expensive and slow to collect. Each data point requires physical interaction with the real world. A robot arm can perform at most ~1000 grasps per day. Training a policy that requires millions of interactions is impractical on real hardware. This fundamental bottleneck drives two major research directions: simulation-based training (sim-to-real) and data-efficient learning (few-shot, meta-learning).

🧠 Robot learning is like teaching a puppy tricks. Instead of writing exact instructions for every situation, you let the robot try things, make mistakes, and get better over time. Just like how you learned to ride a bike โ€” you fell down a lot, but eventually your brain figured out how to balance!

Reinforcement Learning Fundamentals

Reinforcement learning (RL) is a framework for learning optimal behavior through trial and error. An agent interacts with an environment, takes actions, receives rewards, and learns a policy that maximizes cumulative reward.

The RL Framework (MDP)

Formalized as a Markov Decision Process (MDP):

Agent
Policy π(a|s)
action a →
← reward r, state s'
Environment
Robot + World

The Objective

Maximize: E[sum_{t=0}^{infinity} gamma^t * r(s_t, a_t)]

The agent learns a policy pi(a|s) โ€” a mapping from states to actions (or a distribution over actions) โ€” that maximizes the expected discounted sum of rewards.

Key Concepts

RL Algorithms for Robotics

PPO (Proximal Policy Optimization)

Schulman et al. (OpenAI, 2017). The most widely used RL algorithm for robotics. A policy gradient method that restricts policy updates to a trust region (clipped surrogate objective), preventing destructively large updates. Simple to implement, stable, and parallelizable. Used for locomotion (quadrupeds, humanoids), dexterous manipulation, and drone control. The default choice when starting a new robotics RL project.

SAC (Soft Actor-Critic)

Haarnoja et al. (UC Berkeley, 2018). An off-policy algorithm that maximizes both expected reward and policy entropy (exploration bonus). Learns a stochastic policy, which is beneficial for multimodal tasks (multiple ways to solve the problem). More sample-efficient than PPO due to off-policy learning with replay buffers. Widely used in manipulation tasks.

TD3 (Twin Delayed DDPG)

Fujimoto et al. (2018). An improvement on DDPG (Deep Deterministic Policy Gradient) that addresses overestimation bias with twin critics and delayed policy updates. Deterministic policy โ€” outputs a single action rather than a distribution. Good for continuous control with low-dimensional action spaces.

Model-Based RL

Instead of learning a policy directly (model-free), learn a dynamics model and use it for planning:

Algorithm Comparison for Robotics

AlgorithmTypeSample EfficiencyStabilityBest For
PPOOn-policy, model-freeLowHighSim-to-real, locomotion
SACOff-policy, model-freeMediumHighManipulation, real robot
TD3Off-policy, model-freeMediumMediumContinuous control
MBPOOff-policy, model-basedHighMediumData-limited settings
DreamerV3Model-basedHighHighVisual observations

RL Success Stories in Robotics

Robotic Grasping at Scale (Google Brain, 2018)

Levine et al. trained grasping policies using 800,000 grasp attempts across 7 real robots over 2 months. The policy learns an end-to-end mapping from camera images to motor commands. 96% grasp success on novel objects.

Sergey Levine | UC Berkeley / Google

Dexterous In-Hand Manipulation (OpenAI, 2019)

Trained a Shadow Dexterous Hand to solve a Rubik's cube using RL in simulation with massive domain randomization. Transferred to real hardware. Demonstrated the power (and difficulty) of sim-to-real for dexterous manipulation.

OpenAI | 16,000 CPU years of simulation

Quadruped Locomotion (ETH Zurich, 2022)

Miki et al. trained ANYmal to traverse extreme terrain (stairs, gaps, rubble) using RL with a teacher-student framework. The teacher uses privileged simulation information; the student learns from the teacher using only onboard sensors. Deployed in real-world disaster response scenarios.

Robotic Systems Lab, ETH Zurich

Agile Drone Racing (UZH / RPG, 2023)

Kaufmann et al. trained a neural network to fly a drone through a racing course, beating human world champions for the first time. Deep RL policy trained entirely in simulation, deployed on real hardware at speeds exceeding 20 m/s.

University of Zurich | Nature 2023

Learning to Walk in Minutes (NVIDIA, 2022)

Rudin et al. used Isaac Gym (GPU-parallelized simulation) to train locomotion policies for thousands of robots simultaneously. A quadruped policy can be trained in 20 minutes on a single GPU. Enabled by massive parallelism.

NVIDIA Isaac Gym | Legged Gym

Humanoid Locomotion (UC Berkeley, 2024)

Radosavovic et al. trained a humanoid robot (Digit) to walk in the real world using sim-to-real RL. The policy observes proprioception only (no vision) and achieves robust walking on various terrains. Simple reward, no motion capture reference.

Ilija Radosavovic | UC Berkeley

Imitation Learning

Instead of learning from a reward signal (RL), imitation learning trains a policy by observing expert demonstrations. The expert provides examples of correct behavior; the learner mimics them. Also called learning from demonstration (LfD) or apprenticeship learning.

Behavioral Cloning (BC)

The simplest form: treat imitation as supervised learning. Given a dataset of (observation, action) pairs from an expert, train a neural network to predict actions from observations. Fast to train, no simulator needed. The fundamental problem: distributional shift. At test time, small prediction errors compound โ€” the policy drifts to states not seen during training, leading to catastrophic failure. DAgger (Ross, Gordon & Bagnell, 2011) addresses this by iteratively collecting data from the learner's own execution with expert corrections.

Inverse Reinforcement Learning (IRL)

Instead of imitating actions directly, IRL infers the reward function that the expert is optimizing, then uses standard RL to find a policy for that reward. This is more robust than BC because the learned reward generalizes to new situations. Key algorithms:

Data Collection Methods

Learning from Demonstration: Practical Approaches

Action Chunking with Transformers (ACT)

Zhao et al. (2023, Stanford) introduced ACT for the ALOHA bimanual manipulation system. Instead of predicting one action at a time, ACT predicts a "chunk" of future actions (e.g., the next 100 timesteps) using a CVAE (Conditional Variational Autoencoder) with a transformer backbone. This handles multi-modal demonstrations (multiple ways to perform a task) and temporal consistency. Demonstrated bimanual tasks like inserting a battery, picking up a cup from a saucer, and threading a zip tie โ€” from only 50 demonstrations.

Movement Primitives

Encode demonstrations as parameterized motion trajectories that can be adapted to new situations:

One-Shot and Few-Shot Imitation

Learning a new task from a single demonstration. Approaches:

Sim-to-Real Transfer

Training in simulation is fast, safe, parallelizable, and cheap. But simulations are imperfect: physics engines approximate contact dynamics, rendering engines approximate visual appearance, and sensor models approximate noise characteristics. Policies trained in simulation often fail when deployed on real robots โ€” the "reality gap." Sim-to-real transfer is the art of bridging this gap.

Simulation
MuJoCo / Isaac
Domain Rand.
Friction, mass, viz
Train Policy
PPO / SAC
Real Robot
Deploy + fine-tune

Domain Randomization

The most successful sim-to-real technique. During training, randomize simulation parameters so the policy experiences a wide distribution of environments. If the real world falls within this distribution, the policy will work. Parameters to randomize:

Tobin et al. (2017, OpenAI) first demonstrated that domain randomization alone (without any real data) enables sim-to-real transfer for object localization. OpenAI's Rubik's cube work (2019) pushed this to an extreme: the policy was robust to physical perturbations, broken fingers, and novel objects because training covered an enormous distribution of conditions.

System Identification

Measure the real robot's physical parameters (masses, friction, motor models) and configure the simulator to match. This reduces the reality gap directly but requires careful measurement and doesn't account for unmodeled phenomena. Often combined with domain randomization: identify what you can, randomize what you can't.

Domain Adaptation

Learn to map between simulated and real observations so that a policy trained on simulated observations works on real observations:

Sim-to-Real Frameworks

SimulatorStrengthUsed By
Isaac Sim / Isaac LabGPU-parallelized physics, photorealistic rendering, massive scaleNVIDIA, many research labs
MuJoCoFast, accurate contact physics, lightweightDeepMind, UC Berkeley, Stanford
dm_controlMuJoCo-based benchmark suite for continuous controlDeepMind
PyBulletFree, OpenAI Gym integration, URDF supportGoogle Brain, educational
robosuiteMuJoCo-based manipulation benchmark, standardized tasksStanford ILIAD, many
🎮 Sim-to-real is like learning to play basketball in a video game first, then playing for real. The video game isn't perfect โ€” the ball bounces a little differently in real life. So scientists make the video game slightly random each time (heavier ball, slippery floor) so the robot gets used to surprises. Then when it plays for real, it's ready!

Foundation Models for Robotics

The success of large language models (LLMs) and vision-language models (VLMs) has inspired a wave of research on "foundation models" for robotics โ€” large, general-purpose models that can be applied to many robot tasks with minimal task-specific training.

RT-2 (Robotics Transformer 2)

Brohan et al. (Google DeepMind, 2023). A vision-language-action (VLA) model that directly outputs robot actions as text tokens. Built on PaLM-E (a 562B parameter VLM), fine-tuned on robot manipulation data from Google's fleet of everyday robots. RT-2 can follow natural language instructions ("move the banana to the plate"), reason about spatial relationships, and generalize to novel objects and instructions not seen during training. The key insight: pre-trained VLMs already understand the visual world; fine-tuning them to output actions adds robot embodiment.

Octo

Ghosh et al. (UC Berkeley, 2024). An open-source generalist robot policy trained on 800K robot trajectories from the Open X-Embodiment dataset (data from 22 different robot types). Octo is a transformer-based model that takes language instructions and images as input and outputs actions. It can be fine-tuned on a new robot with as few as 100 demonstrations. The first truly open foundation model for robot manipulation.

OpenVLA

Kim et al. (Stanford / UC Berkeley, 2024). A 7B parameter open-source vision-language-action model built on Llama 2 + SigLIP. Trained on the Open X-Embodiment dataset. Achieves strong performance on real robot manipulation tasks with language conditioning. Open weights and code enable community research.

SayCan

Ahn et al. (Google, 2022). Uses an LLM to propose actions (what the robot should do) and a learned affordance model to filter for feasible actions (what the robot can do). The LLM breaks down high-level instructions ("I spilled something, can you help?") into primitive skills; the affordance model grounds these skills in the robot's physical capabilities.

Code as Policies

Liang et al. (Google, 2023). Instead of outputting low-level motor commands, an LLM generates Python code that calls robot API functions. The code can include loops, conditionals, and function composition, enabling complex, compositional robot behaviors. Example: "sort the fruits by color" generates code that detects fruits, classifies colors, and commands pick-and-place for each fruit.

The Trajectory So Far

YearModelKey Contribution
2022SayCanLLMs for high-level robot planning with affordance grounding
2022RT-1Transformer policy trained on 130K real robot episodes
2023PaLM-EEmbodied multimodal LLM (562B params), integrates vision + language + robotics
2023RT-2VLA model: VLM fine-tuned to output robot actions as tokens
2023Code as PoliciesLLMs generate executable robot programs
2024OctoOpen-source generalist policy, 800K trajectories, 22 robot types
2024OpenVLAOpen-source 7B VLA, Llama 2 backbone, strong real-robot performance
2024pi0 (Physical Intelligence)Flow-matching VLA for dexterous manipulation, pre-trained on diverse data

Diffusion Policies

One of the most impactful recent advances in robot learning. Chi et al. (Columbia / Toyota Research, 2023) proposed using denoising diffusion models to represent robot policies. Instead of predicting a single action, the policy generates action trajectories by iteratively denoising random noise โ€” the same process used in image generation (Stable Diffusion, DALL-E).

Why Diffusion for Robotics?

How It Works

Results

Diffusion Policy achieves state-of-the-art performance on 11 out of 12 benchmark tasks in robosuite, outperforming behavioral cloning, IBC (Implicit Behavioral Cloning), and BeT (Behavior Transformer). On real robot tasks (pushing a T-shape, sauce pouring), it achieves 80-95% success from 100-200 demonstrations. The approach has been rapidly adopted: 3D Diffusion Policy (Ze et al., 2024) extends it to 3D observations, and dp3 adds point cloud conditioning for dexterous manipulation.

Open Challenges

1. Sample Efficiency

Even the most sample-efficient RL algorithms require thousands of real-world interactions. Sim-to-real reduces this but introduces its own challenges (reality gap). The holy grail: a robot that learns a new manipulation skill from a single human demonstration, like a human apprentice would. Current best: ACT/Diffusion Policy with 50-200 demos for simple tasks. Complex tasks still require thousands.

2. Generalization

Policies trained on specific objects, environments, and tasks often fail when anything changes. A grasping policy trained on mugs may fail on bowls. Foundation models (RT-2, Octo) show promise but still struggle with truly novel scenarios. The open question: how much data and what architectures are needed for general-purpose robot intelligence?

3. Long-Horizon Tasks

Most robot learning successes are on short-horizon tasks (pick up an object, push a button). Real-world tasks involve long sequences of actions: cook a meal, clean a room, assemble furniture. These require task decomposition, error recovery, and planning over hundreds of steps. Hierarchical RL and LLM-based planners are promising but far from solved.

4. Safety and Robustness

Learned policies are black boxes. They can fail catastrophically in novel situations without warning. For deployment in homes, hospitals, and public spaces, we need: formal safety guarantees, graceful degradation, uncertainty estimation (knowing when you don't know), and safe exploration (learning without breaking things or hurting people). Constrained RL, safety filters (e.g., control barrier functions), and runtime monitoring are active research areas.

5. Contact-Rich Manipulation

Tasks involving complex contact (inserting a USB cable, tying a knot, folding clothes) remain extremely difficult. Contact physics is discontinuous and hard to simulate accurately. Tactile sensing helps but adds complexity. Deformable objects (cloth, rope, food) lack the rigid-body assumptions that most methods rely on.

6. Real-World Deployment

The gap between research demos and real-world deployment is enormous. Research papers report best-case results in controlled settings. Real deployment requires: robustness to lighting/weather changes, handling of edge cases, recovery from failures, integration with existing systems, and meeting regulatory requirements. Companies like Agility (Digit), Boston Dynamics (Spot), and Figure (01) are pushing the frontier but reliable autonomous operation in unstructured environments remains years away.

Tools & Platforms

Stable Baselines3

Reliable PyTorch implementations of RL algorithms: PPO, SAC, TD3, A2C, DQN. Clean API, good documentation, well-tested. The go-to RL library for robotics researchers.

Python | PyTorch | MIT License | 9k+ stars

MuJoCo

The standard physics engine for robot learning research. Fast, accurate contact dynamics, C API with Python bindings. Now free and open-source (Google DeepMind).

C / Python | Apache 2.0 | 8k+ stars

Isaac Lab

NVIDIA's GPU-accelerated robot learning framework. Massively parallel simulation (thousands of environments on one GPU). Includes legged locomotion, manipulation, and drone tasks.

Python | NVIDIA | 2k+ stars

Gymnasium

Successor to OpenAI Gym. Standard API for RL environments. Robotics environments available via gymnasium-robotics (Fetch, Shadow Hand, Adroit tasks).

Python | Farama Foundation | MIT License

Diffusion Policy

Official implementation of Diffusion Policy (Chi et al., 2023). Includes training code, pretrained models, and real robot deployment scripts. The fastest way to get started with diffusion-based robot learning.

Python | PyTorch | MIT License

Octo

Open-source generalist robot policy. Pre-trained on Open X-Embodiment (800K trajectories, 22 robots). Fine-tune on your robot with 100+ demos. JAX/Flax implementation.

Python | JAX | MIT License | UC Berkeley

References & Further Reading

Sutton & Barto: Reinforcement Learning: An Introduction (2018)

THE textbook for RL. Covers bandits, MDPs, dynamic programming, Monte Carlo methods, TD learning, policy gradients, and function approximation. Free online. 2nd edition.

MIT Press | Free PDF | 50k+ citations

Brohan et al: RT-2 โ€” Vision-Language-Action Models (2023)

Transfers knowledge from web-scale vision-language pre-training to robot control. Shows that VLMs can output robot actions as text tokens, enabling semantic reasoning about manipulation.

Google DeepMind | Robotics Transformer 2

Chi et al: Diffusion Policy (2023)

Denoising diffusion models for visuomotor policy learning. Handles multi-modal demonstrations, predicts action chunks, and achieves SOTA on manipulation benchmarks. Rapidly adopted by the community.

Columbia / Toyota Research Institute

Open X-Embodiment (2024)

A collaboration of 21 institutions pooling robot demonstration data from 22 robot embodiments. Over 1 million trajectories. Enables cross-embodiment generalization research. Powers Octo and RT-X.

Google DeepMind + 20 institutions

Zhao et al: Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (2023)

ALOHA paper. Low-cost bimanual teleoperation system ($20K vs $100K+ alternatives) + ACT policy learning from 50 demonstrations. Demonstrated precise tasks: battery insertion, cup transfer.

Stanford | ALOHA + ACT

OpenAI Spinning Up

The best practical introduction to deep RL. Clean implementations of VPG, TRPO, PPO, DDPG, TD3, SAC. Explanations of key concepts, math, and code in one place.

OpenAI | Free | Educational resource