How robots see, sense, and understand the world. From cameras and LiDAR to neural networks and SLAM โ the sensory systems that enable autonomy.
Perception is the process of converting raw sensor data into actionable information about the environment. It answers fundamental questions: Where am I? What is around me? Where are the objects I need to interact with? Is it safe to move? Perception is the bridge between the physical world and the robot's decision-making systems.
Cameras are the most information-dense sensor in robotics. A single RGB frame contains millions of pixels of color information, enabling object detection, tracking, visual servoing, and scene understanding.
| Type | Output | Use Case | Example |
|---|---|---|---|
| Monocular RGB | Color images | Object detection, classification, tracking | FLIR Blackfly, Basler ace |
| Stereo | Rectified image pairs + disparity | Depth estimation, 3D reconstruction | Intel RealSense D435, ZED 2 |
| RGB-D | Color + depth per pixel | Object detection with depth, manipulation | Intel RealSense D455, Azure Kinect |
| Event camera | Asynchronous brightness changes | High-speed motion, HDR scenes | iniVation DAVIS346, Prophesee EVK4 |
| Thermal | Infrared temperature map | Night vision, human detection, industrial inspection | FLIR Lepton, Seek Thermal |
The pinhole camera model maps 3D world points to 2D image pixels via projection:
where (fx, fy) are focal lengths in pixels, (cx, cy) is the principal point, [R|t] is the extrinsic matrix (camera pose), and (X, Y, Z) is the 3D point. Lens distortion (radial and tangential) is modeled by additional polynomial coefficients.
Calibration determines intrinsic parameters (focal length, principal point, distortion) and extrinsic parameters (position/orientation relative to the robot). Zhang's method (2000) uses images of a planar checkerboard pattern โ the standard approach implemented in OpenCV's calibrateCamera(). For stereo cameras, calibration also determines the baseline (distance between cameras) and rectification transforms.
Unlike conventional cameras that capture full frames at a fixed rate, event cameras (Dynamic Vision Sensors) output asynchronous events whenever a pixel's brightness changes. Each event is a tuple (x, y, t, polarity). Advantages: microsecond temporal resolution, 120+ dB dynamic range, no motion blur, low power. Challenges: completely different data format requires new algorithms. Research from ETH Zurich (Scaramuzza group) and KAIST has pioneered event-based visual odometry, SLAM, and object detection.
Depth sensors measure the distance from the sensor to surfaces in the scene, producing a depth map (each pixel = distance value). Critical for obstacle avoidance, manipulation, and 3D mapping.
Projects a known pattern (dots, stripes, or speckle) onto the scene and observes the deformation with a camera. The original Microsoft Kinect v1 (PrimeSense technology) used this approach. Works well indoors but fails in sunlight (the projected pattern is washed out). Range: 0.5-5m, accuracy: ~1-3mm at 1m.
Two cameras separated by a known baseline observe the same scene. Disparity (pixel shift between left and right images) is inversely proportional to depth: Z = f * B / d, where f is focal length, B is baseline, d is disparity. Dense stereo matching algorithms (SGM, RAFT-Stereo, CREStereo) compute disparity for every pixel. Works outdoors but struggles with textureless surfaces (white walls, glass).
Emits modulated infrared light and measures the phase shift of the returned signal to compute distance. Each pixel independently measures depth. The Microsoft Kinect v2 and Azure Kinect DK use ToF. Advantages: works on textureless surfaces, produces clean depth maps. Disadvantages: limited resolution (typically 512x512 or less), multipath interference in corners, limited range (0.5-5m).
Combines structured light projection with stereo cameras. The Intel RealSense D400 series uses active stereo: an IR projector creates texture on the scene, and two IR cameras perform stereo matching. This overcomes the textureless-surface limitation of passive stereo while working in moderate outdoor lighting.
| Technology | Range | Resolution | Outdoor | Cost |
|---|---|---|---|---|
| Structured Light | 0.5-5m | 640x480+ | Poor | $100-500 |
| Passive Stereo | 0.5-100m | 1280x720+ | Good | $200-2000 |
| Time-of-Flight | 0.5-5m | 512x512 | Moderate | $300-1000 |
| Active Stereo | 0.3-10m | 1280x720 | Moderate | $200-500 |
| LiDAR | 1-200m | N/A (points) | Excellent | $500-75,000 |
LiDAR (Light Detection and Ranging) measures distance by emitting laser pulses and timing the return. It produces sparse but highly accurate 3D point clouds, making it the primary sensor for autonomous vehicles, surveying, and outdoor mobile robotics.
Object detection identifies what objects are present in an image and where they are (bounding boxes). It is the foundation of robot perception for manipulation, navigation, and interaction.
The YOLO family treats detection as a single regression problem: divide the image into a grid, predict bounding boxes and class probabilities for each cell in one forward pass. Key milestones:
YOLO is the go-to choice for robotics when real-time performance matters: 30+ FPS on edge devices (Jetson Orin), 100+ FPS on desktop GPUs.
Carion et al. (Facebook AI, 2020) introduced DETR: an end-to-end transformer-based detector that eliminates hand-crafted components (anchors, NMS). A CNN backbone extracts features, a transformer encoder/decoder processes them, and a set prediction loss matches predictions to ground truth. Follow-ups include Deformable DETR (faster convergence), DAB-DETR, and DINO (state-of-the-art accuracy). Transformers dominate detection benchmarks but are slower than YOLO for real-time robotics.
Liu et al. (2023) combined DINO with text grounding: given an image and a text prompt ("red cup on the table"), it detects and localizes the described objects. This open-set detection capability is transformative for robotics โ the robot can be told what to look for in natural language without task-specific training.
While detection gives bounding boxes, segmentation provides pixel-level understanding of the scene.
Kirillov et al. (Meta AI, 2023) introduced SAM, a foundation model for segmentation. Trained on 11 million images and 1.1 billion masks, SAM can segment any object in any image given a point, box, or text prompt. SAM 2 (Ravi et al., 2024) extends this to video with temporal consistency. For robotics, SAM enables zero-shot object segmentation โ segment objects the robot has never seen during training.
6-DOF pose estimation determines the position (x, y, z) and orientation (roll, pitch, yaw) of objects in the scene. Essential for robotic manipulation: the robot needs to know not just where an object is, but how it is oriented to plan a grasp.
The BOP (Benchmark for 6D Object Pose Estimation) challenge, run annually since 2017 at ECCV/ICCV, is the standard evaluation for pose estimation methods. It includes datasets like YCB-Video (21 household objects), T-LESS (texture-less industrial parts), and LM-O (occluded objects). The BOP leaderboard tracks the state of the art: bop.felk.cvut.cz.
SLAM is the process of building a map of an unknown environment while simultaneously tracking the robot's position within that map. It is one of the most important problems in mobile robotics, studied intensively since the 1986 paper by Smith, Self, and Cheeseman.
The robot moves through an unknown environment, making noisy observations of landmarks. It must estimate both its own trajectory and the positions of all landmarks. The key insight: landmark observations are correlated through the robot's trajectory โ observing the same landmark from two locations constrains the robot's motion between those observations.
Uses cameras as the primary sensor. Tracks visual features across frames to estimate motion and build a 3D map of feature points.
Modern SLAM systems use factor graph optimization (pose graph SLAM) rather than filtering. The graph has nodes (robot poses, landmark positions) and edges (odometry constraints, observation constraints). Nonlinear least-squares solvers minimize the total error:
When the robot revisits a previously mapped location, loop closure detects this and corrects the accumulated drift. Without loop closure, SLAM drift grows without bound. Methods: bag-of-words image retrieval (DBoW2 in ORB-SLAM), scan context for LiDAR (Kim & Kim, 2018), and learned place recognition (NetVLAD, Patch-NetVLAD).
No single sensor is sufficient for robust robot perception. Sensor fusion combines data from multiple sensors to produce estimates that are more accurate, more complete, and more robust than any individual sensor.
The workhorse of sensor fusion since Rudolf Kalman's 1960 paper. For linear systems with Gaussian noise, the Kalman filter provides the optimal (minimum variance) state estimate. It operates in two steps:
where x is the state, P is the covariance, Q is process noise, R is measurement noise, H is the observation matrix, and K is the Kalman gain.
For nonlinear systems (which includes virtually all robot dynamics), the EKF linearizes the prediction and observation models around the current estimate using Jacobians. Widely used in IMU integration, GPS/INS fusion, and EKF-SLAM. Limitation: the linearization can be inaccurate for highly nonlinear systems.
Instead of linearizing, the UKF uses sigma points โ a deterministic set of sample points that capture the mean and covariance of the state distribution. Each sigma point is propagated through the actual nonlinear function. More accurate than EKF for highly nonlinear systems, with similar computational cost. Introduced by Julier and Uhlmann (1997).
Represents the probability distribution as a set of weighted samples (particles). Each particle is a hypothesis for the state. At each step: propagate particles through the motion model, weight them by the observation likelihood, and resample. Handles multi-modal distributions (multiple hypotheses) and arbitrary nonlinearities. Used in Monte Carlo Localization (MCL) for mobile robots โ the standard localization algorithm in ROS (AMCL package). Computational cost scales with the number of particles.
| Sensors | Application | Method |
|---|---|---|
| IMU + GPS | Outdoor localization | EKF or error-state KF |
| Camera + IMU (VIO) | Visual-inertial odometry | MSCKF, OKVIS, VINS-Mono |
| LiDAR + IMU | LiDAR-inertial SLAM | LIO-SAM, FAST-LIO2 |
| Camera + LiDAR | Dense 3D perception | Projection + late fusion |
| Camera + LiDAR + Radar | Autonomous driving | BEVFusion, TransFusion |
| Wheel encoders + IMU + LiDAR | Indoor mobile robot | EKF + AMCL or Cartographer |
Combines camera and IMU data for 6-DOF pose estimation. The IMU provides high-rate (100-400 Hz) angular velocity and acceleration measurements; the camera provides low-rate (15-60 Hz) but drift-free relative pose constraints. Key systems:
A point cloud is a set of 3D points representing the surfaces in a scene. Point clouds are produced by LiDAR, depth cameras, and multi-view stereo, and are the primary data structure for 3D perception.
The standard computer vision library. Image processing, feature detection, camera calibration, stereo matching, object tracking. C++ and Python bindings.
Modern 3D data processing library. Point clouds, meshes, RGBD integration, TSDF, ICP, visualization. Excellent Python API.
Comprehensive C++ library for 3D point cloud processing. Filtering, segmentation, registration, surface reconstruction, feature extraction.
Production-ready object detection, segmentation, and pose estimation. Easy training, export to ONNX/TensorRT/CoreML for edge deployment.
Factor graph optimization library for SLAM, SfM, and state estimation. Efficient incremental inference using Bayes trees.
Structure-from-Motion and Multi-View Stereo pipeline. The standard tool for 3D reconstruction from images.
THE textbook for robot perception. Covers Kalman filters, particle filters, SLAM, localization, occupancy grids, and Bayesian decision-making in depth.
Practical treatment of robot vision with MATLAB toolbox. Camera models, image features, visual servoing, and navigation.
The SAM paper from Meta AI. Trained on SA-1B (1.1 billion masks), SAM is a foundation model for promptable segmentation. Transformative for robotics perception.
Visual, visual-inertial, and multi-map SLAM. The most complete open-source SLAM system. Handles monocular, stereo, RGB-D, and fisheye cameras.
Neural Radiance Fields for view synthesis. Encodes a scene as a continuous 5D function (position + direction -> color + density). Sparked a revolution in 3D reconstruction.
Comprehensive survey of SLAM covering 30 years of progress. Identifies open problems including long-term autonomy, semantic understanding, and resource-constrained operation.