๐Ÿฆพ Robotics Institute
Educational Resources

Robot Perception

How robots see, sense, and understand the world. From cameras and LiDAR to neural networks and SLAM โ€” the sensory systems that enable autonomy.

What is Robot Perception?

Perception is the process of converting raw sensor data into actionable information about the environment. It answers fundamental questions: Where am I? What is around me? Where are the objects I need to interact with? Is it safe to move? Perception is the bridge between the physical world and the robot's decision-making systems.

Camera
LiDAR
IMU
Sensor Fusion
EKF / Factor Graph
State Estimate
Pose + Map

The Perception Pipeline

Why Perception is Hard

👀 Robot perception is giving robots eyes, ears, and touch. Cameras are robot eyes. LiDAR is like echolocation (what bats use) โ€” it shoots laser beams and listens for them to bounce back. Together, they help robots understand "what's around me?" and "where am I?"

Cameras & Image Sensors

Cameras are the most information-dense sensor in robotics. A single RGB frame contains millions of pixels of color information, enabling object detection, tracking, visual servoing, and scene understanding.

Camera Types

TypeOutputUse CaseExample
Monocular RGBColor imagesObject detection, classification, trackingFLIR Blackfly, Basler ace
StereoRectified image pairs + disparityDepth estimation, 3D reconstructionIntel RealSense D435, ZED 2
RGB-DColor + depth per pixelObject detection with depth, manipulationIntel RealSense D455, Azure Kinect
Event cameraAsynchronous brightness changesHigh-speed motion, HDR scenesiniVation DAVIS346, Prophesee EVK4
ThermalInfrared temperature mapNight vision, human detection, industrial inspectionFLIR Lepton, Seek Thermal

Camera Model

The pinhole camera model maps 3D world points to 2D image pixels via projection:

[u, v, 1]^T = (1/Z) * K * [R | t] * [X, Y, Z, 1]^T
K = [fx 0 cx; 0 fy cy; 0 0 1] (intrinsic matrix)

where (fx, fy) are focal lengths in pixels, (cx, cy) is the principal point, [R|t] is the extrinsic matrix (camera pose), and (X, Y, Z) is the 3D point. Lens distortion (radial and tangential) is modeled by additional polynomial coefficients.

Camera Calibration

Calibration determines intrinsic parameters (focal length, principal point, distortion) and extrinsic parameters (position/orientation relative to the robot). Zhang's method (2000) uses images of a planar checkerboard pattern โ€” the standard approach implemented in OpenCV's calibrateCamera(). For stereo cameras, calibration also determines the baseline (distance between cameras) and rectification transforms.

Event Cameras

Unlike conventional cameras that capture full frames at a fixed rate, event cameras (Dynamic Vision Sensors) output asynchronous events whenever a pixel's brightness changes. Each event is a tuple (x, y, t, polarity). Advantages: microsecond temporal resolution, 120+ dB dynamic range, no motion blur, low power. Challenges: completely different data format requires new algorithms. Research from ETH Zurich (Scaramuzza group) and KAIST has pioneered event-based visual odometry, SLAM, and object detection.

Depth Sensing

Depth sensors measure the distance from the sensor to surfaces in the scene, producing a depth map (each pixel = distance value). Critical for obstacle avoidance, manipulation, and 3D mapping.

Structured Light

Projects a known pattern (dots, stripes, or speckle) onto the scene and observes the deformation with a camera. The original Microsoft Kinect v1 (PrimeSense technology) used this approach. Works well indoors but fails in sunlight (the projected pattern is washed out). Range: 0.5-5m, accuracy: ~1-3mm at 1m.

Stereo Vision

Two cameras separated by a known baseline observe the same scene. Disparity (pixel shift between left and right images) is inversely proportional to depth: Z = f * B / d, where f is focal length, B is baseline, d is disparity. Dense stereo matching algorithms (SGM, RAFT-Stereo, CREStereo) compute disparity for every pixel. Works outdoors but struggles with textureless surfaces (white walls, glass).

Time-of-Flight (ToF)

Emits modulated infrared light and measures the phase shift of the returned signal to compute distance. Each pixel independently measures depth. The Microsoft Kinect v2 and Azure Kinect DK use ToF. Advantages: works on textureless surfaces, produces clean depth maps. Disadvantages: limited resolution (typically 512x512 or less), multipath interference in corners, limited range (0.5-5m).

Active Stereo

Combines structured light projection with stereo cameras. The Intel RealSense D400 series uses active stereo: an IR projector creates texture on the scene, and two IR cameras perform stereo matching. This overcomes the textureless-surface limitation of passive stereo while working in moderate outdoor lighting.

Comparison

TechnologyRangeResolutionOutdoorCost
Structured Light0.5-5m640x480+Poor$100-500
Passive Stereo0.5-100m1280x720+Good$200-2000
Time-of-Flight0.5-5m512x512Moderate$300-1000
Active Stereo0.3-10m1280x720Moderate$200-500
LiDAR1-200mN/A (points)Excellent$500-75,000

LiDAR

LiDAR (Light Detection and Ranging) measures distance by emitting laser pulses and timing the return. It produces sparse but highly accurate 3D point clouds, making it the primary sensor for autonomous vehicles, surveying, and outdoor mobile robotics.

Types of LiDAR

LiDAR Data Characteristics

LiDAR Processing

Object Detection

Object detection identifies what objects are present in an image and where they are (bounding boxes). It is the foundation of robot perception for manipulation, navigation, and interaction.

YOLO (You Only Look Once)

The YOLO family treats detection as a single regression problem: divide the image into a grid, predict bounding boxes and class probabilities for each cell in one forward pass. Key milestones:

YOLO is the go-to choice for robotics when real-time performance matters: 30+ FPS on edge devices (Jetson Orin), 100+ FPS on desktop GPUs.

DETR (Detection Transformer)

Carion et al. (Facebook AI, 2020) introduced DETR: an end-to-end transformer-based detector that eliminates hand-crafted components (anchors, NMS). A CNN backbone extracts features, a transformer encoder/decoder processes them, and a set prediction loss matches predictions to ground truth. Follow-ups include Deformable DETR (faster convergence), DAB-DETR, and DINO (state-of-the-art accuracy). Transformers dominate detection benchmarks but are slower than YOLO for real-time robotics.

Grounding DINO

Liu et al. (2023) combined DINO with text grounding: given an image and a text prompt ("red cup on the table"), it detects and localizes the described objects. This open-set detection capability is transformative for robotics โ€” the robot can be told what to look for in natural language without task-specific training.

Detection Metrics

Segmentation

While detection gives bounding boxes, segmentation provides pixel-level understanding of the scene.

Types of Segmentation

Segment Anything Model (SAM)

Kirillov et al. (Meta AI, 2023) introduced SAM, a foundation model for segmentation. Trained on 11 million images and 1.1 billion masks, SAM can segment any object in any image given a point, box, or text prompt. SAM 2 (Ravi et al., 2024) extends this to video with temporal consistency. For robotics, SAM enables zero-shot object segmentation โ€” segment objects the robot has never seen during training.

Robotics Applications

Pose Estimation

6-DOF pose estimation determines the position (x, y, z) and orientation (roll, pitch, yaw) of objects in the scene. Essential for robotic manipulation: the robot needs to know not just where an object is, but how it is oriented to plan a grasp.

Approaches

BOP Benchmark

The BOP (Benchmark for 6D Object Pose Estimation) challenge, run annually since 2017 at ECCV/ICCV, is the standard evaluation for pose estimation methods. It includes datasets like YCB-Video (21 household objects), T-LESS (texture-less industrial parts), and LM-O (occluded objects). The BOP leaderboard tracks the state of the art: bop.felk.cvut.cz.

SLAM โ€” Simultaneous Localization and Mapping

SLAM is the process of building a map of an unknown environment while simultaneously tracking the robot's position within that map. It is one of the most important problems in mobile robotics, studied intensively since the 1986 paper by Smith, Self, and Cheeseman.

Sense
Extract Features
Update Map
Localize
Move
← repeat continuously →

The SLAM Problem

The robot moves through an unknown environment, making noisy observations of landmarks. It must estimate both its own trajectory and the positions of all landmarks. The key insight: landmark observations are correlated through the robot's trajectory โ€” observing the same landmark from two locations constrains the robot's motion between those observations.

Visual SLAM (vSLAM)

Uses cameras as the primary sensor. Tracks visual features across frames to estimate motion and build a 3D map of feature points.

LiDAR SLAM

Backend Optimization

Modern SLAM systems use factor graph optimization (pose graph SLAM) rather than filtering. The graph has nodes (robot poses, landmark positions) and edges (odometry constraints, observation constraints). Nonlinear least-squares solvers minimize the total error:

Loop Closure

When the robot revisits a previously mapped location, loop closure detects this and corrects the accumulated drift. Without loop closure, SLAM drift grows without bound. Methods: bag-of-words image retrieval (DBoW2 in ORB-SLAM), scan context for LiDAR (Kim & Kim, 2018), and learned place recognition (NetVLAD, Patch-NetVLAD).

🗺️ SLAM is like being dropped in a dark cave with a flashlight. You don't have a map, and you don't know where you are. As you walk around shining your flashlight, you build a map AND figure out where you are at the same time. That's exactly what robots do with their cameras and lasers!

Sensor Fusion

No single sensor is sufficient for robust robot perception. Sensor fusion combines data from multiple sensors to produce estimates that are more accurate, more complete, and more robust than any individual sensor.

Kalman Filter

The workhorse of sensor fusion since Rudolf Kalman's 1960 paper. For linear systems with Gaussian noise, the Kalman filter provides the optimal (minimum variance) state estimate. It operates in two steps:

Predict: x_hat = A*x + B*u, P = A*P*A^T + Q
Update: K = P*H^T*(H*P*H^T + R)^{-1}, x = x_hat + K*(z - H*x_hat), P = (I-K*H)*P

where x is the state, P is the covariance, Q is process noise, R is measurement noise, H is the observation matrix, and K is the Kalman gain.

Extended Kalman Filter (EKF)

For nonlinear systems (which includes virtually all robot dynamics), the EKF linearizes the prediction and observation models around the current estimate using Jacobians. Widely used in IMU integration, GPS/INS fusion, and EKF-SLAM. Limitation: the linearization can be inaccurate for highly nonlinear systems.

Unscented Kalman Filter (UKF)

Instead of linearizing, the UKF uses sigma points โ€” a deterministic set of sample points that capture the mean and covariance of the state distribution. Each sigma point is propagated through the actual nonlinear function. More accurate than EKF for highly nonlinear systems, with similar computational cost. Introduced by Julier and Uhlmann (1997).

Particle Filter (Sequential Monte Carlo)

Represents the probability distribution as a set of weighted samples (particles). Each particle is a hypothesis for the state. At each step: propagate particles through the motion model, weight them by the observation likelihood, and resample. Handles multi-modal distributions (multiple hypotheses) and arbitrary nonlinearities. Used in Monte Carlo Localization (MCL) for mobile robots โ€” the standard localization algorithm in ROS (AMCL package). Computational cost scales with the number of particles.

Common Fusion Architectures

SensorsApplicationMethod
IMU + GPSOutdoor localizationEKF or error-state KF
Camera + IMU (VIO)Visual-inertial odometryMSCKF, OKVIS, VINS-Mono
LiDAR + IMULiDAR-inertial SLAMLIO-SAM, FAST-LIO2
Camera + LiDARDense 3D perceptionProjection + late fusion
Camera + LiDAR + RadarAutonomous drivingBEVFusion, TransFusion
Wheel encoders + IMU + LiDARIndoor mobile robotEKF + AMCL or Cartographer

Visual-Inertial Odometry (VIO)

Combines camera and IMU data for 6-DOF pose estimation. The IMU provides high-rate (100-400 Hz) angular velocity and acceleration measurements; the camera provides low-rate (15-60 Hz) but drift-free relative pose constraints. Key systems:

Point Clouds & 3D Reconstruction

A point cloud is a set of 3D points representing the surfaces in a scene. Point clouds are produced by LiDAR, depth cameras, and multi-view stereo, and are the primary data structure for 3D perception.

Point Cloud Processing

Deep Learning on Point Clouds

3D Reconstruction Methods

Tools & Libraries

OpenCV

The standard computer vision library. Image processing, feature detection, camera calibration, stereo matching, object tracking. C++ and Python bindings.

C++ / Python | 78k+ GitHub stars | Apache 2.0

Open3D

Modern 3D data processing library. Point clouds, meshes, RGBD integration, TSDF, ICP, visualization. Excellent Python API.

C++ / Python | Intel ISL | MIT License

PCL (Point Cloud Library)

Comprehensive C++ library for 3D point cloud processing. Filtering, segmentation, registration, surface reconstruction, feature extraction.

C++ | Open Perception | BSD

Ultralytics (YOLOv8/v11)

Production-ready object detection, segmentation, and pose estimation. Easy training, export to ONNX/TensorRT/CoreML for edge deployment.

Python | AGPL-3.0 | 35k+ stars

GTSAM

Factor graph optimization library for SLAM, SfM, and state estimation. Efficient incremental inference using Bayes trees.

C++ / Python | Georgia Tech | BSD

COLMAP

Structure-from-Motion and Multi-View Stereo pipeline. The standard tool for 3D reconstruction from images.

C++ | ETH Zurich | BSD

References & Further Reading

Thrun, Burgard & Fox: Probabilistic Robotics (2005)

THE textbook for robot perception. Covers Kalman filters, particle filters, SLAM, localization, occupancy grids, and Bayesian decision-making in depth.

MIT Press | 14k+ citations

Corke: Robotics, Vision and Control (2017)

Practical treatment of robot vision with MATLAB toolbox. Camera models, image features, visual servoing, and navigation.

Springer | 2nd Edition

Kirillov et al: Segment Anything (2023)

The SAM paper from Meta AI. Trained on SA-1B (1.1 billion masks), SAM is a foundation model for promptable segmentation. Transformative for robotics perception.

Meta AI | ICCV 2023

Campos et al: ORB-SLAM3 (2021)

Visual, visual-inertial, and multi-map SLAM. The most complete open-source SLAM system. Handles monocular, stereo, RGB-D, and fisheye cameras.

IEEE T-RO | University of Zaragoza

Mildenhall et al: NeRF (2020)

Neural Radiance Fields for view synthesis. Encodes a scene as a continuous 5D function (position + direction -> color + density). Sparked a revolution in 3D reconstruction.

ECCV 2020 | UC Berkeley

Cadena et al: Past, Present, and Future of SLAM (2016)

Comprehensive survey of SLAM covering 30 years of progress. Identifies open problems including long-term autonomy, semantic understanding, and resource-constrained operation.

IJRR | 2500+ citations