Robot Perception

How robots see, sense, and understand the world. From cameras and LiDAR to neural networks and SLAM — the sensory systems that enable autonomy.

What is Robot Perception?

Perception is the process of converting raw sensor data into actionable information about the environment. It answers fundamental questions: Where am I? What is around me? Where are the objects I need to interact with? Is it safe to move? Perception is the bridge between the physical world and the robot's decision-making systems.

Camera

LiDAR

IMU

→

Sensor Fusion
EKF / Factor Graph

→

State Estimate
Pose + Map

The Perception Pipeline

Sensing — raw data acquisition from cameras, LiDAR, IMUs, encoders, force/torque sensors, microphones, tactile sensors.
Preprocessing — noise filtering, calibration correction, time synchronization, data alignment.
Feature extraction — identifying meaningful patterns: edges, corners, planes, objects, surfaces.
Interpretation — semantic understanding: object classification, scene labeling, activity recognition.
State estimation — fusing noisy, uncertain measurements into a coherent estimate of the world state.

Why Perception is Hard

Noise and uncertainty — every sensor measurement is corrupted by noise. Depth cameras have systematic errors; LiDAR returns have range-dependent accuracy; cameras are affected by lighting changes.
Partial observability — robots can only see what is in front of them. Occluded objects, transparent surfaces, and reflections create ambiguity.
Real-time constraints — perception must keep up with the robot's control loop. A self-driving car at 60 mph travels 27 meters per second; perception must process data in milliseconds.
Open-world problem — unlike controlled industrial settings, real-world environments contain an unbounded variety of objects, materials, and conditions.

👀 Robot perception is giving robots eyes, ears, and touch. Cameras are robot eyes. LiDAR is like echolocation (what bats use) — it shoots laser beams and listens for them to bounce back. Together, they help robots understand "what's around me?" and "where am I?"

Cameras & Image Sensors

Cameras are the most information-dense sensor in robotics. A single RGB frame contains millions of pixels of color information, enabling object detection, tracking, visual servoing, and scene understanding.

Camera Types

Type	Output	Use Case	Example
Monocular RGB	Color images	Object detection, classification, tracking	FLIR Blackfly, Basler ace
Stereo	Rectified image pairs + disparity	Depth estimation, 3D reconstruction	Intel RealSense D435, ZED 2
RGB-D	Color + depth per pixel	Object detection with depth, manipulation	Intel RealSense D455, Azure Kinect
Event camera	Asynchronous brightness changes	High-speed motion, HDR scenes	iniVation DAVIS346, Prophesee EVK4
Thermal	Infrared temperature map	Night vision, human detection, industrial inspection	FLIR Lepton, Seek Thermal

Camera Model

The pinhole camera model maps 3D world points to 2D image pixels via projection:

[u, v, 1]^T = (1/Z) * K * [R | t] * [X, Y, Z, 1]^T
K = [fx 0 cx; 0 fy cy; 0 0 1] (intrinsic matrix)

where (fx, fy) are focal lengths in pixels, (cx, cy) is the principal point, [R|t] is the extrinsic matrix (camera pose), and (X, Y, Z) is the 3D point. Lens distortion (radial and tangential) is modeled by additional polynomial coefficients.

Camera Calibration

Calibration determines intrinsic parameters (focal length, principal point, distortion) and extrinsic parameters (position/orientation relative to the robot). Zhang's method (2000) uses images of a planar checkerboard pattern — the standard approach implemented in OpenCV's calibrateCamera(). For stereo cameras, calibration also determines the baseline (distance between cameras) and rectification transforms.

Event Cameras

Unlike conventional cameras that capture full frames at a fixed rate, event cameras (Dynamic Vision Sensors) output asynchronous events whenever a pixel's brightness changes. Each event is a tuple (x, y, t, polarity). Advantages: microsecond temporal resolution, 120+ dB dynamic range, no motion blur, low power. Challenges: completely different data format requires new algorithms. Research from ETH Zurich (Scaramuzza group) and KAIST has pioneered event-based visual odometry, SLAM, and object detection.

Depth Sensing

Depth sensors measure the distance from the sensor to surfaces in the scene, producing a depth map (each pixel = distance value). Critical for obstacle avoidance, manipulation, and 3D mapping.

Structured Light

Projects a known pattern (dots, stripes, or speckle) onto the scene and observes the deformation with a camera. The original Microsoft Kinect v1 (PrimeSense technology) used this approach. Works well indoors but fails in sunlight (the projected pattern is washed out). Range: 0.5-5m, accuracy: ~1-3mm at 1m.

Stereo Vision

Two cameras separated by a known baseline observe the same scene. Disparity (pixel shift between left and right images) is inversely proportional to depth: Z = f * B / d, where f is focal length, B is baseline, d is disparity. Dense stereo matching algorithms (SGM, RAFT-Stereo, CREStereo) compute disparity for every pixel. Works outdoors but struggles with textureless surfaces (white walls, glass).

Time-of-Flight (ToF)

Emits modulated infrared light and measures the phase shift of the returned signal to compute distance. Each pixel independently measures depth. The Microsoft Kinect v2 and Azure Kinect DK use ToF. Advantages: works on textureless surfaces, produces clean depth maps. Disadvantages: limited resolution (typically 512x512 or less), multipath interference in corners, limited range (0.5-5m).

Active Stereo

Combines structured light projection with stereo cameras. The Intel RealSense D400 series uses active stereo: an IR projector creates texture on the scene, and two IR cameras perform stereo matching. This overcomes the textureless-surface limitation of passive stereo while working in moderate outdoor lighting.

Comparison

Technology	Range	Resolution	Outdoor	Cost
Structured Light	0.5-5m	640x480+	Poor	$100-500
Passive Stereo	0.5-100m	1280x720+	Good	$200-2000
Time-of-Flight	0.5-5m	512x512	Moderate	$300-1000
Active Stereo	0.3-10m	1280x720	Moderate	$200-500
LiDAR	1-200m	N/A (points)	Excellent	$500-75,000

LiDAR

LiDAR (Light Detection and Ranging) measures distance by emitting laser pulses and timing the return. It produces sparse but highly accurate 3D point clouds, making it the primary sensor for autonomous vehicles, surveying, and outdoor mobile robotics.

Types of LiDAR

Mechanical spinning — a rotating head with multiple laser emitters/receivers. The Velodyne VLP-16 (16 channels, 300k points/sec) and Ouster OS1 (32/64/128 channels) are industry standards for autonomous vehicles. 360-degree coverage, 100m+ range.
Solid-state — no moving parts. Uses MEMS mirrors, optical phased arrays, or flash illumination. Smaller, cheaper, more reliable. Livox Mid-360 is widely used in robotics. Limited field of view (typically 70-120 degrees).
Flash LiDAR — illuminates the entire field of view simultaneously with a single pulse. Very fast (no scanning), but limited range. Used in automotive short-range sensing.
FMCW LiDAR — Frequency-Modulated Continuous Wave. Measures both distance AND velocity simultaneously (Doppler shift). Aeva and Aurora are developing FMCW LiDAR for autonomous driving. Immune to interference from other LiDAR sensors.

LiDAR Data Characteristics

Each point has (x, y, z) coordinates, intensity (reflectivity), and sometimes return number (multiple returns from vegetation).
Typical density: 100k-2M points per second for automotive LiDAR.
Accuracy: 1-3cm at 100m range for high-end units (Velodyne Alpha Prime, Hesai AT128).
Sparse compared to cameras: a 128-channel LiDAR produces ~300k points per scan vs ~2M pixels per camera frame.
No color information (unless fused with cameras).
Works in all lighting conditions (day, night, direct sunlight).

LiDAR Processing

Ground segmentation — separate ground plane from objects. Methods: RANSAC plane fitting, height-based filtering, or learned models.
Clustering — group non-ground points into objects. DBSCAN and Euclidean clustering are common.
Registration — align point clouds from different scans. ICP (Iterative Closest Point) and NDT (Normal Distributions Transform) are classical methods; modern approaches use learned features.
3D object detection — detect and classify objects (cars, pedestrians, cyclists) directly from point clouds. PointPillars (Lang et al., 2019), CenterPoint (Yin et al., 2021), and VoxelNet are key architectures.

Object Detection

Object detection identifies what objects are present in an image and where they are (bounding boxes). It is the foundation of robot perception for manipulation, navigation, and interaction.

YOLO (You Only Look Once)

The YOLO family treats detection as a single regression problem: divide the image into a grid, predict bounding boxes and class probabilities for each cell in one forward pass. Key milestones:

YOLOv1 (Redmon et al., 2016) — first real-time detector with reasonable accuracy. 45 FPS on a GPU. Published at CVPR 2016.
YOLOv3 (Redmon, 2018) — multi-scale detection, Darknet-53 backbone. Practical sweet spot of speed vs accuracy for several years.
YOLOv5 (Ultralytics, 2020) — PyTorch implementation, extensive training pipeline. Not a paper but widely adopted in industry.
YOLOv8 (Ultralytics, 2023) — anchor-free detection, improved accuracy. State-of-the-art real-time detector. Supports detection, segmentation, and pose estimation.
YOLOv11/YOLO-World (2024) — open-vocabulary detection (detect any object described in text without retraining).

YOLO is the go-to choice for robotics when real-time performance matters: 30+ FPS on edge devices (Jetson Orin), 100+ FPS on desktop GPUs.

DETR (Detection Transformer)

Carion et al. (Facebook AI, 2020) introduced DETR: an end-to-end transformer-based detector that eliminates hand-crafted components (anchors, NMS). A CNN backbone extracts features, a transformer encoder/decoder processes them, and a set prediction loss matches predictions to ground truth. Follow-ups include Deformable DETR (faster convergence), DAB-DETR, and DINO (state-of-the-art accuracy). Transformers dominate detection benchmarks but are slower than YOLO for real-time robotics.

Grounding DINO

Liu et al. (2023) combined DINO with text grounding: given an image and a text prompt ("red cup on the table"), it detects and localizes the described objects. This open-set detection capability is transformative for robotics — the robot can be told what to look for in natural language without task-specific training.

Detection Metrics

mAP (mean Average Precision) — the standard metric. Computed as the area under the precision-recall curve, averaged over all classes. COCO mAP uses IoU thresholds from 0.5 to 0.95 in steps of 0.05.
IoU (Intersection over Union) — measures overlap between predicted and ground-truth bounding boxes. IoU > 0.5 is a "correct" detection in the PASCAL VOC metric.
FPS (Frames Per Second) — throughput. For robotics, 10+ FPS is minimum for reactive behavior; 30+ FPS is preferred.
Latency — time from image capture to detection result. Critical for time-sensitive tasks (obstacle avoidance). Different from FPS because of pipelining.

Segmentation

While detection gives bounding boxes, segmentation provides pixel-level understanding of the scene.

Types of Segmentation

Semantic segmentation — assign a class label to every pixel (road, sidewalk, car, tree). Does not distinguish individual instances of the same class.
Instance segmentation — detect individual objects AND delineate their boundaries at pixel level. Mask R-CNN (He et al., 2017) is the foundational architecture.
Panoptic segmentation — combines semantic and instance segmentation: classify every pixel and distinguish individual instances for "things" (countable objects like cars, people) while treating "stuff" (uncountable regions like sky, road) with semantic labels only.

Segment Anything Model (SAM)

Kirillov et al. (Meta AI, 2023) introduced SAM, a foundation model for segmentation. Trained on 11 million images and 1.1 billion masks, SAM can segment any object in any image given a point, box, or text prompt. SAM 2 (Ravi et al., 2024) extends this to video with temporal consistency. For robotics, SAM enables zero-shot object segmentation — segment objects the robot has never seen during training.

Robotics Applications

Bin picking — instance segmentation identifies individual parts in a cluttered bin for grasp planning.
Navigation — semantic segmentation labels traversable ground, obstacles, and lane markings.
Manipulation — segmenting the target object from the background enables precise grasp point estimation.
Scene understanding — panoptic segmentation gives a complete understanding of what is where.

Pose Estimation

6-DOF pose estimation determines the position (x, y, z) and orientation (roll, pitch, yaw) of objects in the scene. Essential for robotic manipulation: the robot needs to know not just where an object is, but how it is oriented to plan a grasp.

Approaches

Correspondence-based — find 2D-3D correspondences between image features and a known 3D model, then solve PnP (Perspective-n-Point). Classical approach using feature detectors (SIFT, ORB) and RANSAC for outlier rejection.
Direct regression — neural networks predict the 6-DOF pose directly from the image. PoseCNN (Xiang et al., 2018), PVNet (Peng et al., 2019). Fast but less accurate for symmetric objects.
Render-and-compare — render the 3D model at candidate poses and compare with the observed image. CosyPose (Labbe et al., 2020) iteratively refines the pose. Accurate but slower.
FoundationPose — Wen et al. (NVIDIA, 2024) proposed a foundation model for 6-DOF pose estimation that works with novel objects using either a 3D model or a reference image. No object-specific training needed.

BOP Benchmark

The BOP (Benchmark for 6D Object Pose Estimation) challenge, run annually since 2017 at ECCV/ICCV, is the standard evaluation for pose estimation methods. It includes datasets like YCB-Video (21 household objects), T-LESS (texture-less industrial parts), and LM-O (occluded objects). The BOP leaderboard tracks the state of the art: bop.felk.cvut.cz.

SLAM — Simultaneous Localization and Mapping

SLAM is the process of building a map of an unknown environment while simultaneously tracking the robot's position within that map. It is one of the most important problems in mobile robotics, studied intensively since the 1986 paper by Smith, Self, and Cheeseman.

Sense

→

Extract Features

→

Update Map

→

Localize

→

Move

← repeat continuously →

The SLAM Problem

The robot moves through an unknown environment, making noisy observations of landmarks. It must estimate both its own trajectory and the positions of all landmarks. The key insight: landmark observations are correlated through the robot's trajectory — observing the same landmark from two locations constrains the robot's motion between those observations.

Visual SLAM (vSLAM)

Uses cameras as the primary sensor. Tracks visual features across frames to estimate motion and build a 3D map of feature points.

ORB-SLAM3 (Campos et al., 2021) — the most complete visual SLAM system. Handles monocular, stereo, and RGB-D cameras plus IMU. Includes loop closure, relocalization, and multi-map management. Open-source: github.com/UZ-SLAMLab/ORB_SLAM3.
LSD-SLAM (Engel et al., 2014) — direct method (no feature extraction); uses pixel intensities directly. Produces semi-dense depth maps.
DSO (Engel et al., 2018) — Direct Sparse Odometry. Combines direct photometric error with sparse point selection for efficiency.
DPVO / DROID-SLAM (Teed & Deng, 2021) — learned visual odometry/SLAM using differentiable optimization layers. State-of-the-art accuracy on standard benchmarks.

LiDAR SLAM

LOAM (Zhang & Singh, 2014) — LiDAR Odometry and Mapping. Extracts edge and planar features from scans, performs scan-to-scan matching at high frequency and scan-to-map matching at low frequency. The gold standard for years.
LeGO-LOAM (Shan & Englot, 2018) — lightweight ground-optimized LOAM. Faster and works well on ground vehicles.
KISS-ICP (Vizzo et al., 2023) — simple, robust LiDAR odometry. "Keep It Small and Simple" — point-to-point ICP with adaptive thresholds. Surprisingly competitive with complex methods.
LIO-SAM (Shan et al., 2020) — tightly-coupled LiDAR-IMU SLAM using factor graphs. GTSAM backend. Robust in aggressive motion.

Backend Optimization

Modern SLAM systems use factor graph optimization (pose graph SLAM) rather than filtering. The graph has nodes (robot poses, landmark positions) and edges (odometry constraints, observation constraints). Nonlinear least-squares solvers minimize the total error:

GTSAM (Georgia Tech Smoothing and Mapping) — C++ library using factor graphs and Bayes trees. Created by Frank Dellaert. Used in LIO-SAM, Kimera, and many research systems.
g2o (General Graph Optimization) — another widely-used graph optimization framework. Used in ORB-SLAM.
Ceres Solver (Google) — general nonlinear least-squares solver. Used in Cartographer (Google's 2D/3D SLAM for backpack mapping).

Loop Closure

When the robot revisits a previously mapped location, loop closure detects this and corrects the accumulated drift. Without loop closure, SLAM drift grows without bound. Methods: bag-of-words image retrieval (DBoW2 in ORB-SLAM), scan context for LiDAR (Kim & Kim, 2018), and learned place recognition (NetVLAD, Patch-NetVLAD).

🗺️ SLAM is like being dropped in a dark cave with a flashlight. You don't have a map, and you don't know where you are. As you walk around shining your flashlight, you build a map AND figure out where you are at the same time. That's exactly what robots do with their cameras and lasers!

Sensor Fusion

No single sensor is sufficient for robust robot perception. Sensor fusion combines data from multiple sensors to produce estimates that are more accurate, more complete, and more robust than any individual sensor.

Kalman Filter

The workhorse of sensor fusion since Rudolf Kalman's 1960 paper. For linear systems with Gaussian noise, the Kalman filter provides the optimal (minimum variance) state estimate. It operates in two steps:

Predict: x_hat = A*x + B*u, P = A*P*A^T + Q
Update: K = P*H^T*(H*P*H^T + R)^{-1}, x = x_hat + K*(z - H*x_hat), P = (I-K*H)*P

where x is the state, P is the covariance, Q is process noise, R is measurement noise, H is the observation matrix, and K is the Kalman gain.

Extended Kalman Filter (EKF)

For nonlinear systems (which includes virtually all robot dynamics), the EKF linearizes the prediction and observation models around the current estimate using Jacobians. Widely used in IMU integration, GPS/INS fusion, and EKF-SLAM. Limitation: the linearization can be inaccurate for highly nonlinear systems.

Unscented Kalman Filter (UKF)

Instead of linearizing, the UKF uses sigma points — a deterministic set of sample points that capture the mean and covariance of the state distribution. Each sigma point is propagated through the actual nonlinear function. More accurate than EKF for highly nonlinear systems, with similar computational cost. Introduced by Julier and Uhlmann (1997).

Particle Filter (Sequential Monte Carlo)

Represents the probability distribution as a set of weighted samples (particles). Each particle is a hypothesis for the state. At each step: propagate particles through the motion model, weight them by the observation likelihood, and resample. Handles multi-modal distributions (multiple hypotheses) and arbitrary nonlinearities. Used in Monte Carlo Localization (MCL) for mobile robots — the standard localization algorithm in ROS (AMCL package). Computational cost scales with the number of particles.

Common Fusion Architectures

Sensors	Application	Method
IMU + GPS	Outdoor localization	EKF or error-state KF
Camera + IMU (VIO)	Visual-inertial odometry	MSCKF, OKVIS, VINS-Mono
LiDAR + IMU	LiDAR-inertial SLAM	LIO-SAM, FAST-LIO2
Camera + LiDAR	Dense 3D perception	Projection + late fusion
Camera + LiDAR + Radar	Autonomous driving	BEVFusion, TransFusion
Wheel encoders + IMU + LiDAR	Indoor mobile robot	EKF + AMCL or Cartographer

Visual-Inertial Odometry (VIO)

Combines camera and IMU data for 6-DOF pose estimation. The IMU provides high-rate (100-400 Hz) angular velocity and acceleration measurements; the camera provides low-rate (15-60 Hz) but drift-free relative pose constraints. Key systems:

VINS-Mono (Qin et al., 2018) — monocular VIO with loop closure. Open-source, widely used on drones.
OKVIS (Leutenegger et al., 2015) — keyframe-based visual-inertial SLAM. From ETH Zurich.
Basalt (Usenko et al., 2020) — from TUM, uses non-linear optimization with visual and inertial factors.
Apple ARKit / Google ARCore — commercial VIO implementations on smartphones, enabling AR applications.

Point Clouds & 3D Reconstruction

A point cloud is a set of 3D points representing the surfaces in a scene. Point clouds are produced by LiDAR, depth cameras, and multi-view stereo, and are the primary data structure for 3D perception.

Point Cloud Processing

Downsampling — reduce point density for efficiency. Voxel grid filtering (average points within each voxel) is standard. Random sampling and farthest point sampling (FPS) are alternatives.
Normal estimation — compute surface normals at each point by fitting a plane to local neighbors. Essential for surface reconstruction and registration.
Registration — align two point clouds. ICP (Besl & McKay, 1992) iteratively minimizes the distance between corresponding points. Variants: point-to-plane ICP, generalized ICP (GICP), colored ICP.
Segmentation — separate ground plane (RANSAC), cluster objects (DBSCAN, Euclidean clustering), or segment semantically (PointNet++).
Surface reconstruction — create a mesh from points. Poisson surface reconstruction, ball-pivoting algorithm, or Delaunay triangulation.

Deep Learning on Point Clouds

PointNet (Qi et al., 2017) — the breakthrough. Processes raw point clouds directly (no voxelization or projection). Uses shared MLPs and a symmetric function (max-pooling) to achieve permutation invariance. 89.2% accuracy on ModelNet40.
PointNet++ (Qi et al., 2017) — adds hierarchical feature learning with set abstraction layers. Captures local geometry at multiple scales.
Point Transformer (Zhao et al., 2021) — applies self-attention to 3D point clouds. State-of-the-art on indoor scene segmentation (S3DIS dataset).
MinkowskiNet (Choy et al., 2019) — sparse convolutions on voxelized point clouds. Efficient and powerful for 3D segmentation.

3D Reconstruction Methods

TSDF Fusion — Truncated Signed Distance Function. Integrate multiple depth frames into a volumetric representation. KinectFusion (Newcombe et al., 2011) demonstrated real-time dense reconstruction from an RGB-D camera. Open3D implements this.
NeRF (Mildenhall et al., 2020) — Neural Radiance Fields. A neural network encodes the scene as a continuous volumetric function. Produces photorealistic novel views from sparse images. Follow-ups: Instant-NGP (real-time training), Nerfstudio, and 3D Gaussian Splatting.
3D Gaussian Splatting (Kerbl et al., 2023) — represents the scene as a set of 3D Gaussians. Renders in real-time (100+ FPS). State-of-the-art for novel view synthesis and 3D reconstruction. Increasingly used in robotics for scene representation.
Multi-view stereo (MVS) — reconstruct dense 3D geometry from multiple calibrated images. COLMAP (Schonberger & Frahm, 2016) is the standard pipeline: SfM (structure from motion) for camera poses, then dense MVS for point clouds.

Tools & Libraries

OpenCV

The standard computer vision library. Image processing, feature detection, camera calibration, stereo matching, object tracking. C++ and Python bindings.

C++ / Python | 78k+ GitHub stars | Apache 2.0

Open3D

Modern 3D data processing library. Point clouds, meshes, RGBD integration, TSDF, ICP, visualization. Excellent Python API.

C++ / Python | Intel ISL | MIT License

PCL (Point Cloud Library)

Comprehensive C++ library for 3D point cloud processing. Filtering, segmentation, registration, surface reconstruction, feature extraction.

C++ | Open Perception | BSD

Ultralytics (YOLOv8/v11)

Production-ready object detection, segmentation, and pose estimation. Easy training, export to ONNX/TensorRT/CoreML for edge deployment.

Python | AGPL-3.0 | 35k+ stars

GTSAM

Factor graph optimization library for SLAM, SfM, and state estimation. Efficient incremental inference using Bayes trees.

C++ / Python | Georgia Tech | BSD

COLMAP

Structure-from-Motion and Multi-View Stereo pipeline. The standard tool for 3D reconstruction from images.

C++ | ETH Zurich | BSD

Robot Perception

What is Robot Perception?

The Perception Pipeline

Why Perception is Hard

Cameras & Image Sensors

Camera Types

Camera Model

Camera Calibration

Event Cameras

Depth Sensing

Structured Light

Stereo Vision

Time-of-Flight (ToF)

Active Stereo

Comparison

LiDAR

Types of LiDAR

LiDAR Data Characteristics

LiDAR Processing

Object Detection

YOLO (You Only Look Once)

DETR (Detection Transformer)

Grounding DINO

Detection Metrics

Segmentation

Types of Segmentation

Segment Anything Model (SAM)

Robotics Applications

Pose Estimation

Approaches

BOP Benchmark

SLAM — Simultaneous Localization and Mapping

The SLAM Problem

Visual SLAM (vSLAM)

LiDAR SLAM

Backend Optimization

Loop Closure

Sensor Fusion

Kalman Filter

Extended Kalman Filter (EKF)

Unscented Kalman Filter (UKF)

Particle Filter (Sequential Monte Carlo)

Common Fusion Architectures

Visual-Inertial Odometry (VIO)

Point Clouds & 3D Reconstruction

Point Cloud Processing

Deep Learning on Point Clouds

3D Reconstruction Methods

Tools & Libraries

References & Further Reading