3D vision basics: point cloud processing, depth estimation, detailed explanation of stereoscopic vision

Introduction

Do you want self-driving cars to see pedestrians on the road clearly, use mobile phone cameras to accurately measure furniture dimensions, and reproduce real scenes 1:1 in VR? Behind these "magics" are inseparable from a technology - 3D Vision. It allows machines to understand the three-dimensional world like humans do: restoring the depth, shape and spatial position of objects from one or more flat photos.

📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: Vision-Language 多模态 · 模型轻量化


1. Basic concepts of 3D vision

1.1 What problem is 3D vision solving?

Simply put, the core task of 3D vision is: Use algorithms to reconstruct an interactive three-dimensional world model. It must answer three core questions:

  • **Where does the depth come from? **——How to deduce how far each pixel is from the camera from one or more 2D images?
  • **How ​​to relate multiple perspectives? **——How do photos taken at different locations correspond to the same object in the image? (Involving epipolar geometry, camera calibration, etc.)
  • **How ​​to represent 3D data? **——What data structure is used to store point, surface, volume and other information to facilitate subsequent analysis?

Once these problems are overcome, 3D vision will be directly applied to urgently needed fields such as autonomous driving (lidar + camera fusion), industrial inspection (3D defect positioning), and medical imaging (CT reconstruction).

1.2 5 mainstream representations of 3D data

3D data is not a neat pixel grid like a picture. It has many "forms", and different representations have their own advantages and disadvantages:

Representation typeImage understandingApplicable scenarios
Point Cloud (Point Cloud)It is like sprinkling countless points on the surface of an object. Each point has x, y, z coordinates, and can be accompanied by a color and normal vector. Unordered, variable quantitylidar scan, fast 3D snapshot
Mesh (Mesh)A three-dimensional model made of triangular patches, similar to the character model in the game. Rich details and efficient renderingMovie special effects, CAD industrial design
Voxel (Voxel)Cut the 3D space into small squares, just like the square world in "Minecraft". The structure is regular, but the amount of data explodes at high resolutionMedical CT/MRI reconstruction, 3D voxel game
Depth Map (Depth Map)A grayscale image as large as an ordinary photo. The pixel value represents the distance (far and near) of the point. Simple to produce, but only has information from a single perspectiveDepth camera output, monocular/binocular depth estimation
Implicit Field (Implicit Field)Use a neural network to "memorize" the color and opacity of the entire scene, and query any spatial point to obtain the result. No need to store explicit geometry, but computationally intensiveNeRF, high-fidelity new perspective synthesis

2. Point cloud processing technology

2.1 Playing with point cloud from scratch: Open3D practice

Point cloud is the most primitive output of 3D vision sensors (such as lidar), but it often contains noise, uneven density, and outliers. Therefore, preprocessing is the first step and the key to determining the effect of subsequent analysis.

The following code usesOpen3DandNumpyWe have built a simple point cloud processor, you can run it directly to experience:

import numpy as np
import open3d as o3d

class PointCloudProcessor:
    """封装常用点云处理流程,方便调用"""
    def create_from_numpy(self, points: np.ndarray) -> o3d.geometry.PointCloud:
        """从(N,3)的numpy数组生成Open3D点云"""
        pcd = o3d.geometry.PointCloud()
        pcd.points = o3d.utility.Vector3dVector(points)
        return pcd

    def preprocess(self, pcd: o3d.geometry.PointCloud, 
                   voxel_size=0.05, nb_neighbors=20, std_ratio=2.0) -> o3d.geometry.PointCloud:
        """
        一站式预处理:下采样 → 法向量估计 → 离群点去除
        - voxel_size:体素下采样的立方体边长,越大点数越少
        - nb_neighbors & std_ratio:统计滤波参数,保留正常点,去除远处噪声
        """
        # 体素下采样:降低密度,同时保留整体几何外形
        down_pcd = pcd.voxel_down_sample(voxel_size=voxel_size)
        # 估计法向量:为后续分割、重建做准备
        down_pcd.estimate_normals()
        # 统计滤波:剔除孤立的离群点(扫描噪声)
        filtered_pcd, _ = down_pcd.remove_statistical_outlier(
            nb_neighbors=nb_neighbors, std_ratio=std_ratio
        )
        return filtered_pcd

# 快速测试
if __name__ == "__main__":
    # 随机生成10000个高斯分布点,模拟带噪声的扫描数据
    noisy_points = np.random.randn(10000, 3) * 0.5
    pcd_raw = PointCloudProcessor().create_from_numpy(noisy_points)
    pcd_filtered = PointCloudProcessor().preprocess(pcd_raw)
    # 可视化对比(本地运行时可取消注释)
    # o3d.visualization.draw_geometries([pcd_raw, pcd_filtered], window_name="Raw vs Filtered")
    print(f"原始点数: {len(pcd_raw.points)}, 预处理后点数: {len(pcd_filtered.points)}")

Interpretation of key steps:

  • Voxel downsampling: Divide the space into small cubes, retaining only one point in each cube. It can effectively reduce the amount of data without losing the overall shape.
  • Normal vector estimation: Calculate the orientation of the local surface where each point is located, which is the basis for point cloud segmentation and registration.
  • Statistical Filtering: Detect those "orphan points" that are too far from their neighbors, they are usually caused by sensor errors.

2.2 Let the neural network "eat" point clouds directly: lightweight PointNet

Point clouds have two characteristics that cause headaches for traditional convolutional networks: The order of points is irrelevant (no matter how the points are arranged, they can represent the same object) and The number of points is not fixed. PointNet solves the problem with two ingenious designs:

  1. Shared MLP (Multi-Layer Perceptron): Features are extracted independently for each point, and changes in point order do not affect the output.
  2. Symmetric maximum pooling: Select the maximum value from all point features to obtain a fixed-dimensional global feature, which is not afraid of changes in the number of points.

Next, we use PyTorch to implement a simplified version of PointNet classifier (without the T-net spatial transformation module) and process it directly.(batch, 点数, 3)input:

import torch
import torch.nn as nn

class SimplePointNet(nn.Module):
    """简化版PointNet分类器,理解核心思想"""
    def __init__(self, num_classes=10):
        super().__init__()
        # 共享MLP:对各点独立做1x1卷积(其实就是逐点全连接)
        self.point_mlp = nn.Sequential(
            nn.Conv1d(3, 64, kernel_size=1),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Conv1d(64, 128, kernel_size=1),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Conv1d(128, 1024, kernel_size=1),
            nn.BatchNorm1d(1024)
        )
        # 全局特征聚合:对所有点取最大值,得到(batch,1024)
        self.global_pool = nn.MaxPool1d(kernel_size=1024)
        # 分类头
        self.classifier = nn.Sequential(
            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, num_classes)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: (batch_size, num_points, 3)
        Returns:
            logits: (batch_size, num_classes)
        """
        # 维度调整 (B,N,3) → (B,3,N) 以适配Conv1d
        x = x.transpose(2, 1)
        # 逐点特征提取
        x = self.point_mlp(x)
        # 最大池化得到全局特征 (B,1024,N) → (B,1024)
        x = self.global_pool(x).squeeze(-1)
        # 分类预测
        return self.classifier(x)

# 快速测试维度
if __name__ == "__main__":
    test_input = torch.randn(8, 1024, 3)  # batch=8, 每个物体1024个点
    model = SimplePointNet(num_classes=40)  # 假设使用ModelNet40的40类
    output = model(test_input)
    print(f"输入维度: {test_input.shape}, 输出维度: {output.shape}")

Core advantages: PointNet relies on symmetric functions (maximum pooling) to achieve natural adaptation to the disorder and invariance of point clouds, which is the cornerstone of subsequent work such as PointNet++.


3. Depth estimation and stereo vision

3.1 How to “thicken” photos: Classification of depth estimation technology

Depth estimation is to assign a distance value to each pixel of an ordinary photo to generate a depth map. According to cost and accuracy, the mainstream methods are as follows:

Method typeInputPopular explanation of the principleCostAccuracy
Monocular depth estimationSingle color pictureAI learns scene rules: such as near and far, texture gradient, occlusion relationship, and directly "guesses" the depthLowMedium
Binocular Stereo VisionTwo color picturesImitation of the human eye: slight offset (parallax) of the same object position in the left and right pictures, converted into depth using geometric formulasLowMedium High
Hardware SensorLidar/ToFPhysically measure the flight time of light and directly give accurate distanceHighHigh

Binocular stereo vision has become a common solution for robots and assisted driving due to its "low cost and good accuracy" characteristics.

3.2 Hands-on implementation: OpenCV binocular stereo matching

Among traditional methods, the SGBM (Semi-Global Matching) algorithm achieves a good balance between effect and speed. Below we use OpenCV to quickly calculate the disparity map and depth map:

import cv2
import numpy as np
import matplotlib.pyplot as plt

def sgbm_stereo_depth(left_path: str, right_path: str, 
                       baseline=0.2, focal_length=720):
    """
    使用SGBM计算双目图像的视差图和深度图。
    baseline: 相机基线距离(单位:米)
    focal_length: 相机焦距(单位:像素)
    """
    # 1. 读取左右灰度图(减少计算量)
    left = cv2.imread(left_path, cv2.IMREAD_GRAYSCALE)
    right = cv2.imread(right_path, cv2.IMREAD_GRAYSCALE)

    # 2. 创建SGBM匹配器(参数可调整)
    stereo = cv2.StereoSGBM_create(
        minDisparity=0,          # 最小视差
        numDisparities=16*5,     # 最大视差范围(16的倍数)
        blockSize=5,             # 匹配的窗口大小
        P1=8*3*5**2,             # 小视差变化惩罚
        P2=32*3*5**2,            # 大视差变化惩罚(保持边缘)
        disp12MaxDiff=1,         # 左右一致性检查允许的最大差异
        uniquenessRatio=15,      # 唯一性阈值
        preFilterCap=63          # 预处理截断值
    )

    # 3. 计算视差图(除以16得到真实视差值)
    disparity = stereo.compute(left, right).astype(np.float32) / 16.0
    # 归一化以便显示
    disparity_normalized = (disparity - disparity.min()) / (disparity.max() - disparity.min() + 1e-5)

    # 4. 视差转深度(Z = F * B / disparity,避开除零)
    depth = (baseline * focal_length) / (disparity + 1e-5)
    depth_normalized = (depth - depth.min()) / (depth.max() - depth.min() + 1e-5)

    # 5. 可视化对比
    plt.figure(figsize=(15, 5))
    plt.subplot(131), plt.imshow(left, cmap='gray'), plt.title('Left Image')
    plt.subplot(132), plt.imshow(disparity_normalized, cmap='jet'), plt.title('Disparity Map')
    plt.subplot(133), plt.imshow(depth_normalized, cmap='jet'), plt.title('Depth Map')
    plt.tight_layout()
    plt.show()

    return disparity, depth

Intuitive understanding of the relationship between parallax and depth: The closer the object is, the greater the position difference (parallax) between the left and right images, and the smaller the depth value; the farther the object is, the smaller the parallax, and the greater the depth. formula深度 = (基线 × 焦距) / 视差Quantify this inverse relationship.


4. Quick Start with Neural Radiation Field (NeRF)

4.1 Use neural networks to “remember” the entire 3D scene

NeRF (Neural Radiance Fields) is the star technology of 2020, which completely changes the way scene reconstruction is performed. Traditional 3D representation requires storing a large number of points/surfaces/voxels, while NeRF implicitly encodes the entire scene using only a multi-layer perceptron (MLP).

The working principle can be understood as follows:

  • You tell the network a point in space(x,y,z)and viewing direction, which directly tells you whether this point is transparent or solid (volume densityσ), and what color it looks like(r,g,b)
  • Then, sample many points along the camera ray and use Volume Rendering to add up the colors and densities of these points to generate the image seen from this perspective.

Because the network itself is a continuous function, we can query the properties of any fine location to render an ultra-high-definition image from any new perspective. This has great application potential in VR room viewing, movie special effects, digital twins and other fields.

📌 Although NeRF is effective, the training time is long and the rendering is slow. The academic community is currently actively researching accelerated versions (such as Instant NGP).


3D vision is a high-level branch of computer vision. It is recommended to lay a solid foundation first: Master **Image Processing, CNN, and PyTorch**, and then start with **Open3D Point Cloud Preprocessing** and **OpenCV Binocular Stereo Matching**, two practical tasks that can be quickly started, and gradually deepen into cutting-edge models such as PointNet and NeRF.

Summarize

Throughout this article, we can see the three major technical pillars of 3D vision:

  1. Point cloud processing: Open3D classic preprocessing + deep learning models such as PointNet to solve the classification and segmentation problems of disordered data.
  2. Depth Estimation: Use monocular CNN or binocular SGBM to recover the "thickness" of the scene from the image.
  3. High-fidelity reconstruction: Neural rendering technologies such as NeRF use implicit fields to achieve stunning new perspective synthesis.

The future trend is multi-sensor fusion (lidar + camera) and real-time lightweight reconstruction, which are also the most urgent needs in fields such as autonomous driving, robots, and the metaverse. You might as well run the code now and take the first step in 3D vision!