Robot Learning

Teaching robots through experience, demonstration, and data. From reinforcement learning and imitation to foundation models and sim-to-real transfer — the frontier of intelligent robotics.

Why Robot Learning?

Classical robotics relies on explicit models: kinematic equations, dynamic parameters, geometric maps. These work well in structured environments (factories, warehouses) where everything is known in advance. But the real world is messy, unpredictable, and infinitely varied. A robot in a kitchen encounters objects of unknown shapes, weights, and friction properties. A robot on a construction site faces terrain that changes daily. For these scenarios, learning from experience can succeed where hand-engineering fails.

What Can Be Learned?

Perception — object recognition, scene understanding, depth estimation from images. The most successful application of ML in robotics. Deep learning has revolutionized robot perception.
Control policies — mapping sensor observations to motor commands. Instead of designing a controller, learn it from data. Examples: locomotion gaits, manipulation strategies, drone acrobatics.
Dynamics models — learn the physics of the robot and environment from interaction data. Used in model-based RL and adaptive control.
Reward functions — learn what "good behavior" looks like from human preferences or demonstrations, rather than manually specifying a reward function.
Task planning — learn to decompose high-level goals into sequences of primitive actions.

The Data Challenge

Unlike computer vision or NLP, robotics data is expensive and slow to collect. Each data point requires physical interaction with the real world. A robot arm can perform at most ~1000 grasps per day. Training a policy that requires millions of interactions is impractical on real hardware. This fundamental bottleneck drives two major research directions: simulation-based training (sim-to-real) and data-efficient learning (few-shot, meta-learning).

🧠 Robot learning is like teaching a puppy tricks. Instead of writing exact instructions for every situation, you let the robot try things, make mistakes, and get better over time. Just like how you learned to ride a bike — you fell down a lot, but eventually your brain figured out how to balance!

Reinforcement Learning Fundamentals

Reinforcement learning (RL) is a framework for learning optimal behavior through trial and error. An agent interacts with an environment, takes actions, receives rewards, and learns a policy that maximizes cumulative reward.

The RL Framework (MDP)

Formalized as a Markov Decision Process (MDP):

State s — the current configuration of the robot and environment (joint angles, velocities, object positions, sensor readings).
Action a — the control command (joint torques, motor velocities, gripper open/close).
Transition p(s'|s,a) — the probability of moving to state s' after taking action a in state s. Determined by physics (unknown to the agent in model-free RL).
Reward r(s,a) — a scalar signal indicating how good the action was. Designing good reward functions is a major challenge in robotics RL.
Discount factor gamma — 0 < gamma <= 1. Trades off immediate vs future rewards. Gamma = 0.99 is typical for robotics.

Agent

Policy π(a|s)

action a →

← reward r, state s'

Environment

Robot + World

The Objective

Maximize: E[sum_{t=0}^{infinity} gamma^t * r(s_t, a_t)]

The agent learns a policy pi(a|s) — a mapping from states to actions (or a distribution over actions) — that maximizes the expected discounted sum of rewards.

Key Concepts

Value function V(s) — the expected return starting from state s and following policy pi. V(s) = E[sum gamma^t r | s_0 = s].
Action-value function Q(s,a) — the expected return starting from state s, taking action a, then following pi. Q(s,a) = r(s,a) + gamma * E[V(s')].
Exploration vs exploitation — the agent must try new actions (explore) to discover better strategies, while also using its current best knowledge (exploit) to accumulate reward. Critical in robotics: exploration means trying potentially dangerous actions on real hardware.
On-policy vs off-policy — on-policy methods (PPO, TRPO) learn from data collected by the current policy. Off-policy methods (SAC, TD3) can learn from data collected by any policy (replay buffer). Off-policy is more sample-efficient but less stable.

RL Algorithms for Robotics

PPO (Proximal Policy Optimization)

Schulman et al. (OpenAI, 2017). The most widely used RL algorithm for robotics. A policy gradient method that restricts policy updates to a trust region (clipped surrogate objective), preventing destructively large updates. Simple to implement, stable, and parallelizable. Used for locomotion (quadrupeds, humanoids), dexterous manipulation, and drone control. The default choice when starting a new robotics RL project.

SAC (Soft Actor-Critic)

Haarnoja et al. (UC Berkeley, 2018). An off-policy algorithm that maximizes both expected reward and policy entropy (exploration bonus). Learns a stochastic policy, which is beneficial for multimodal tasks (multiple ways to solve the problem). More sample-efficient than PPO due to off-policy learning with replay buffers. Widely used in manipulation tasks.

TD3 (Twin Delayed DDPG)

Fujimoto et al. (2018). An improvement on DDPG (Deep Deterministic Policy Gradient) that addresses overestimation bias with twin critics and delayed policy updates. Deterministic policy — outputs a single action rather than a distribution. Good for continuous control with low-dimensional action spaces.

Model-Based RL

Instead of learning a policy directly (model-free), learn a dynamics model and use it for planning:

MBPO (Model-Based Policy Optimization) — Janner et al. (2019). Learn an ensemble of neural network dynamics models, generate synthetic rollouts, and train a policy on the synthetic data. 10-100x more sample-efficient than model-free methods.
Dreamer — Hafner et al. (2020, Google DeepMind). Learn a world model in a latent space, then train a policy entirely in imagination. DreamerV3 (2023) achieves strong results across diverse tasks without task-specific tuning.
PETS — Chua et al. (2018). Probabilistic Ensemble Trajectory Sampling. Learn an ensemble of probabilistic dynamics models, use CEM (Cross-Entropy Method) for planning. Demonstrated on real robot tasks with < 100 trials.

Algorithm Comparison for Robotics

Algorithm	Type	Sample Efficiency	Stability	Best For
PPO	On-policy, model-free	Low	High	Sim-to-real, locomotion
SAC	Off-policy, model-free	Medium	High	Manipulation, real robot
TD3	Off-policy, model-free	Medium	Medium	Continuous control
MBPO	Off-policy, model-based	High	Medium	Data-limited settings
DreamerV3	Model-based	High	High	Visual observations

RL Success Stories in Robotics

Robotic Grasping at Scale (Google Brain, 2018)

Levine et al. trained grasping policies using 800,000 grasp attempts across 7 real robots over 2 months. The policy learns an end-to-end mapping from camera images to motor commands. 96% grasp success on novel objects.

Sergey Levine | UC Berkeley / Google

Dexterous In-Hand Manipulation (OpenAI, 2019)

Trained a Shadow Dexterous Hand to solve a Rubik's cube using RL in simulation with massive domain randomization. Transferred to real hardware. Demonstrated the power (and difficulty) of sim-to-real for dexterous manipulation.

OpenAI | 16,000 CPU years of simulation

Quadruped Locomotion (ETH Zurich, 2022)

Miki et al. trained ANYmal to traverse extreme terrain (stairs, gaps, rubble) using RL with a teacher-student framework. The teacher uses privileged simulation information; the student learns from the teacher using only onboard sensors. Deployed in real-world disaster response scenarios.

Robotic Systems Lab, ETH Zurich

Agile Drone Racing (UZH / RPG, 2023)

Kaufmann et al. trained a neural network to fly a drone through a racing course, beating human world champions for the first time. Deep RL policy trained entirely in simulation, deployed on real hardware at speeds exceeding 20 m/s.

University of Zurich | Nature 2023

Learning to Walk in Minutes (NVIDIA, 2022)

Rudin et al. used Isaac Gym (GPU-parallelized simulation) to train locomotion policies for thousands of robots simultaneously. A quadruped policy can be trained in 20 minutes on a single GPU. Enabled by massive parallelism.

NVIDIA Isaac Gym | Legged Gym

Humanoid Locomotion (UC Berkeley, 2024)

Radosavovic et al. trained a humanoid robot (Digit) to walk in the real world using sim-to-real RL. The policy observes proprioception only (no vision) and achieves robust walking on various terrains. Simple reward, no motion capture reference.

Ilija Radosavovic | UC Berkeley

Imitation Learning

Instead of learning from a reward signal (RL), imitation learning trains a policy by observing expert demonstrations. The expert provides examples of correct behavior; the learner mimics them. Also called learning from demonstration (LfD) or apprenticeship learning.

Behavioral Cloning (BC)

The simplest form: treat imitation as supervised learning. Given a dataset of (observation, action) pairs from an expert, train a neural network to predict actions from observations. Fast to train, no simulator needed. The fundamental problem: distributional shift. At test time, small prediction errors compound — the policy drifts to states not seen during training, leading to catastrophic failure. DAgger (Ross, Gordon & Bagnell, 2011) addresses this by iteratively collecting data from the learner's own execution with expert corrections.

Inverse Reinforcement Learning (IRL)

Instead of imitating actions directly, IRL infers the reward function that the expert is optimizing, then uses standard RL to find a policy for that reward. This is more robust than BC because the learned reward generalizes to new situations. Key algorithms:

Maximum Entropy IRL — Ziebart et al. (2008). Assumes the expert acts according to a Boltzmann distribution over trajectories, with higher probability for lower-cost trajectories. The inferred reward explains the demonstrations while being maximally uncertain (maximum entropy) about undemonstrated behavior.
GAIL (Generative Adversarial Imitation Learning) — Ho & Ermon (2016). Frames imitation as a GAN problem: a discriminator distinguishes expert trajectories from learner trajectories, and the learner's policy is trained to fool the discriminator. Avoids explicitly recovering the reward function.
AIRL (Adversarial IRL) — Fu et al. (2018). Recovers a transferable reward function using adversarial training. The recovered reward can be used to train new policies in different environments.

Data Collection Methods

Teleoperation — a human controls the robot directly using a joystick, VR controller, or leader-follower setup. Produces high-quality demonstrations but is slow and requires operator skill. The ALOHA system (Zhao et al., 2023) uses bilateral teleoperation for bimanual manipulation demonstrations.
Kinesthetic teaching — the human physically guides the robot through the desired motion while the robot records joint positions/torques. Intuitive but limited to compliant robots (cobots).
Video demonstrations — learn from watching human videos without any robot data. Extremely challenging due to the embodiment gap (human hands are not robot grippers). Active research area: R3M (Nair et al., 2022) and VIP (Ma et al., 2023) learn visual representations from human videos for downstream robot learning.

Learning from Demonstration: Practical Approaches

Action Chunking with Transformers (ACT)

Zhao et al. (2023, Stanford) introduced ACT for the ALOHA bimanual manipulation system. Instead of predicting one action at a time, ACT predicts a "chunk" of future actions (e.g., the next 100 timesteps) using a CVAE (Conditional Variational Autoencoder) with a transformer backbone. This handles multi-modal demonstrations (multiple ways to perform a task) and temporal consistency. Demonstrated bimanual tasks like inserting a battery, picking up a cup from a saucer, and threading a zip tie — from only 50 demonstrations.

Movement Primitives

Encode demonstrations as parameterized motion trajectories that can be adapted to new situations:

DMPs (Dynamic Movement Primitives) — Ijspeert et al. (2013). Represent a motion as a dynamical system with a learnable forcing function. The forcing function is learned from a single demonstration. DMPs can be scaled in time and space, and are robust to perturbations. Widely used in industrial skill transfer.
ProMP (Probabilistic Movement Primitives) — Paraschos et al. (2013). Represent demonstrations as distributions over trajectories (mean + covariance). Can condition on start/end points, blend multiple primitives, and handle via-points. Used for human-robot handover and collaborative assembly.
KMP (Kernelized Movement Primitives) — Huang et al. (2019). Kernel-based regression for trajectory learning. Handles high-dimensional input spaces and orientation trajectories.

One-Shot and Few-Shot Imitation

Learning a new task from a single demonstration. Approaches:

Meta-imitation learning — train a meta-learner on many tasks so it can quickly adapt to new tasks from one demo. Yu et al. (2018, Stanford/UC Berkeley).
Task-conditioned policies — condition the policy on a video or image of the task being performed. The policy learns to extract task intent from the demonstration and execute it.

Sim-to-Real Transfer

Training in simulation is fast, safe, parallelizable, and cheap. But simulations are imperfect: physics engines approximate contact dynamics, rendering engines approximate visual appearance, and sensor models approximate noise characteristics. Policies trained in simulation often fail when deployed on real robots — the "reality gap." Sim-to-real transfer is the art of bridging this gap.

Simulation

MuJoCo / Isaac

→

Domain Rand.

Friction, mass, viz

→

Train Policy

PPO / SAC

→

Real Robot

Deploy + fine-tune

Domain Randomization

The most successful sim-to-real technique. During training, randomize simulation parameters so the policy experiences a wide distribution of environments. If the real world falls within this distribution, the policy will work. Parameters to randomize:

Physics: friction coefficients, mass, inertia, damping, restitution, actuator delays, joint backlash.
Visual: lighting direction/intensity/color, textures, camera position/orientation, object colors, distractors.
Dynamics: gravity, motor strength, sensor noise, communication delays.

Tobin et al. (2017, OpenAI) first demonstrated that domain randomization alone (without any real data) enables sim-to-real transfer for object localization. OpenAI's Rubik's cube work (2019) pushed this to an extreme: the policy was robust to physical perturbations, broken fingers, and novel objects because training covered an enormous distribution of conditions.

System Identification

Measure the real robot's physical parameters (masses, friction, motor models) and configure the simulator to match. This reduces the reality gap directly but requires careful measurement and doesn't account for unmodeled phenomena. Often combined with domain randomization: identify what you can, randomize what you can't.

Domain Adaptation

Learn to map between simulated and real observations so that a policy trained on simulated observations works on real observations:

Pixel-level adaptation — use GANs (CycleGAN, etc.) to transform simulated images to look realistic. The policy sees adapted images that look like real camera data.
Feature-level adaptation — learn a shared representation that is invariant to the sim/real domain. The policy operates on domain-invariant features.
RCAN (Sim-to-Real via Sim-to-Sim) — James et al. (2019). Train a sim-to-real adaptation network by transferring between two different simulators first, then fine-tuning for real.

Sim-to-Real Frameworks

Simulator	Strength	Used By
Isaac Sim / Isaac Lab	GPU-parallelized physics, photorealistic rendering, massive scale	NVIDIA, many research labs
MuJoCo	Fast, accurate contact physics, lightweight	DeepMind, UC Berkeley, Stanford
dm_control	MuJoCo-based benchmark suite for continuous control	DeepMind
PyBullet	Free, OpenAI Gym integration, URDF support	Google Brain, educational
robosuite	MuJoCo-based manipulation benchmark, standardized tasks	Stanford ILIAD, many

🎮 Sim-to-real is like learning to play basketball in a video game first, then playing for real. The video game isn't perfect — the ball bounces a little differently in real life. So scientists make the video game slightly random each time (heavier ball, slippery floor) so the robot gets used to surprises. Then when it plays for real, it's ready!

Foundation Models for Robotics

The success of large language models (LLMs) and vision-language models (VLMs) has inspired a wave of research on "foundation models" for robotics — large, general-purpose models that can be applied to many robot tasks with minimal task-specific training.

RT-2 (Robotics Transformer 2)

Brohan et al. (Google DeepMind, 2023). A vision-language-action (VLA) model that directly outputs robot actions as text tokens. Built on PaLM-E (a 562B parameter VLM), fine-tuned on robot manipulation data from Google's fleet of everyday robots. RT-2 can follow natural language instructions ("move the banana to the plate"), reason about spatial relationships, and generalize to novel objects and instructions not seen during training. The key insight: pre-trained VLMs already understand the visual world; fine-tuning them to output actions adds robot embodiment.

Octo

Ghosh et al. (UC Berkeley, 2024). An open-source generalist robot policy trained on 800K robot trajectories from the Open X-Embodiment dataset (data from 22 different robot types). Octo is a transformer-based model that takes language instructions and images as input and outputs actions. It can be fine-tuned on a new robot with as few as 100 demonstrations. The first truly open foundation model for robot manipulation.

OpenVLA

Kim et al. (Stanford / UC Berkeley, 2024). A 7B parameter open-source vision-language-action model built on Llama 2 + SigLIP. Trained on the Open X-Embodiment dataset. Achieves strong performance on real robot manipulation tasks with language conditioning. Open weights and code enable community research.

SayCan

Ahn et al. (Google, 2022). Uses an LLM to propose actions (what the robot should do) and a learned affordance model to filter for feasible actions (what the robot can do). The LLM breaks down high-level instructions ("I spilled something, can you help?") into primitive skills; the affordance model grounds these skills in the robot's physical capabilities.

Code as Policies

Liang et al. (Google, 2023). Instead of outputting low-level motor commands, an LLM generates Python code that calls robot API functions. The code can include loops, conditionals, and function composition, enabling complex, compositional robot behaviors. Example: "sort the fruits by color" generates code that detects fruits, classifies colors, and commands pick-and-place for each fruit.

The Trajectory So Far

Year	Model	Key Contribution
2022	SayCan	LLMs for high-level robot planning with affordance grounding
2022	RT-1	Transformer policy trained on 130K real robot episodes
2023	PaLM-E	Embodied multimodal LLM (562B params), integrates vision + language + robotics
2023	RT-2	VLA model: VLM fine-tuned to output robot actions as tokens
2023	Code as Policies	LLMs generate executable robot programs
2024	Octo	Open-source generalist policy, 800K trajectories, 22 robot types
2024	OpenVLA	Open-source 7B VLA, Llama 2 backbone, strong real-robot performance
2024	pi0 (Physical Intelligence)	Flow-matching VLA for dexterous manipulation, pre-trained on diverse data

Diffusion Policies

One of the most impactful recent advances in robot learning. Chi et al. (Columbia / Toyota Research, 2023) proposed using denoising diffusion models to represent robot policies. Instead of predicting a single action, the policy generates action trajectories by iteratively denoising random noise — the same process used in image generation (Stable Diffusion, DALL-E).

Why Diffusion for Robotics?

Multi-modality — demonstrations often show multiple valid ways to perform a task (reach from the left or the right). Standard behavioral cloning (MSE loss) averages these modes, producing invalid actions. Diffusion models naturally represent multi-modal distributions.
Action chunk prediction — diffusion policies predict a sequence of future actions (a trajectory chunk), providing temporal consistency.
Training stability — the denoising objective is well-behaved and trains reliably compared to adversarial methods (GANs) or autoregressive models.

How It Works

Training: add Gaussian noise to expert action trajectories at increasing noise levels. Train a neural network (U-Net or transformer) to predict the noise at each level, conditioned on the observation.
Inference: start with random noise, iteratively denoise using the trained network, conditioned on the current observation. After K denoising steps, the result is a clean action trajectory.
Execute the first few actions, then re-plan (receding horizon).

Results

Diffusion Policy achieves state-of-the-art performance on 11 out of 12 benchmark tasks in robosuite, outperforming behavioral cloning, IBC (Implicit Behavioral Cloning), and BeT (Behavior Transformer). On real robot tasks (pushing a T-shape, sauce pouring), it achieves 80-95% success from 100-200 demonstrations. The approach has been rapidly adopted: 3D Diffusion Policy (Ze et al., 2024) extends it to 3D observations, and dp3 adds point cloud conditioning for dexterous manipulation.

Open Challenges

1. Sample Efficiency

Even the most sample-efficient RL algorithms require thousands of real-world interactions. Sim-to-real reduces this but introduces its own challenges (reality gap). The holy grail: a robot that learns a new manipulation skill from a single human demonstration, like a human apprentice would. Current best: ACT/Diffusion Policy with 50-200 demos for simple tasks. Complex tasks still require thousands.

2. Generalization

Policies trained on specific objects, environments, and tasks often fail when anything changes. A grasping policy trained on mugs may fail on bowls. Foundation models (RT-2, Octo) show promise but still struggle with truly novel scenarios. The open question: how much data and what architectures are needed for general-purpose robot intelligence?

3. Long-Horizon Tasks

Most robot learning successes are on short-horizon tasks (pick up an object, push a button). Real-world tasks involve long sequences of actions: cook a meal, clean a room, assemble furniture. These require task decomposition, error recovery, and planning over hundreds of steps. Hierarchical RL and LLM-based planners are promising but far from solved.

4. Safety and Robustness

Learned policies are black boxes. They can fail catastrophically in novel situations without warning. For deployment in homes, hospitals, and public spaces, we need: formal safety guarantees, graceful degradation, uncertainty estimation (knowing when you don't know), and safe exploration (learning without breaking things or hurting people). Constrained RL, safety filters (e.g., control barrier functions), and runtime monitoring are active research areas.

5. Contact-Rich Manipulation

Tasks involving complex contact (inserting a USB cable, tying a knot, folding clothes) remain extremely difficult. Contact physics is discontinuous and hard to simulate accurately. Tactile sensing helps but adds complexity. Deformable objects (cloth, rope, food) lack the rigid-body assumptions that most methods rely on.

6. Real-World Deployment

The gap between research demos and real-world deployment is enormous. Research papers report best-case results in controlled settings. Real deployment requires: robustness to lighting/weather changes, handling of edge cases, recovery from failures, integration with existing systems, and meeting regulatory requirements. Companies like Agility (Digit), Boston Dynamics (Spot), and Figure (01) are pushing the frontier but reliable autonomous operation in unstructured environments remains years away.

Tools & Platforms

Stable Baselines3

Reliable PyTorch implementations of RL algorithms: PPO, SAC, TD3, A2C, DQN. Clean API, good documentation, well-tested. The go-to RL library for robotics researchers.

Python | PyTorch | MIT License | 9k+ stars

MuJoCo

The standard physics engine for robot learning research. Fast, accurate contact dynamics, C API with Python bindings. Now free and open-source (Google DeepMind).

C / Python | Apache 2.0 | 8k+ stars

Isaac Lab

NVIDIA's GPU-accelerated robot learning framework. Massively parallel simulation (thousands of environments on one GPU). Includes legged locomotion, manipulation, and drone tasks.

Python | NVIDIA | 2k+ stars

Gymnasium

Successor to OpenAI Gym. Standard API for RL environments. Robotics environments available via gymnasium-robotics (Fetch, Shadow Hand, Adroit tasks).

Python | Farama Foundation | MIT License

Diffusion Policy

Official implementation of Diffusion Policy (Chi et al., 2023). Includes training code, pretrained models, and real robot deployment scripts. The fastest way to get started with diffusion-based robot learning.

Python | PyTorch | MIT License

Octo

Open-source generalist robot policy. Pre-trained on Open X-Embodiment (800K trajectories, 22 robots). Fine-tune on your robot with 100+ demos. JAX/Flax implementation.

Python | JAX | MIT License | UC Berkeley

Robot Learning

Why Robot Learning?

What Can Be Learned?

The Data Challenge

Reinforcement Learning Fundamentals

The RL Framework (MDP)

The Objective

Key Concepts

RL Algorithms for Robotics

PPO (Proximal Policy Optimization)

SAC (Soft Actor-Critic)

TD3 (Twin Delayed DDPG)

Model-Based RL

Algorithm Comparison for Robotics

RL Success Stories in Robotics

Imitation Learning

Behavioral Cloning (BC)

Inverse Reinforcement Learning (IRL)

Data Collection Methods

Learning from Demonstration: Practical Approaches

Action Chunking with Transformers (ACT)

Movement Primitives

One-Shot and Few-Shot Imitation

Sim-to-Real Transfer

Domain Randomization

System Identification

Domain Adaptation

Sim-to-Real Frameworks

Foundation Models for Robotics

RT-2 (Robotics Transformer 2)

Octo

OpenVLA

SayCan

Code as Policies

The Trajectory So Far

Diffusion Policies

Why Diffusion for Robotics?

How It Works

Results

Open Challenges

1. Sample Efficiency

2. Generalization

3. Long-Horizon Tasks

4. Safety and Robustness

5. Contact-Rich Manipulation

6. Real-World Deployment

Tools & Platforms

References & Further Reading