3D vision basics: point cloud processing, depth estimation, detailed explanation of stereoscopic vision
Introduction
Do you want self-driving cars to see pedestrians on the road clearly, use mobile phone cameras to accurately measure furniture dimensions, and reproduce real scenes 1:1 in VR? Behind these "magics" are inseparable from a technology - 3D Vision. It allows machines to understand the three-dimensional world like humans do: restoring the depth, shape and spatial position of objects from one or more flat photos.
📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: Vision-Language 多模态 · 模型轻量化
1. Basic concepts of 3D vision
1.1 What problem is 3D vision solving?
Simply put, the core task of 3D vision is: Use algorithms to reconstruct an interactive three-dimensional world model. It must answer three core questions:
- **Where does the depth come from? **——How to deduce how far each pixel is from the camera from one or more 2D images?
- **How to relate multiple perspectives? **——How do photos taken at different locations correspond to the same object in the image? (Involving epipolar geometry, camera calibration, etc.)
- **How to represent 3D data? **——What data structure is used to store point, surface, volume and other information to facilitate subsequent analysis?
Once these problems are overcome, 3D vision will be directly applied to urgently needed fields such as autonomous driving (lidar + camera fusion), industrial inspection (3D defect positioning), and medical imaging (CT reconstruction).
1.2 5 mainstream representations of 3D data
3D data is not a neat pixel grid like a picture. It has many "forms", and different representations have their own advantages and disadvantages:
2. Point cloud processing technology
2.1 Playing with point cloud from scratch: Open3D practice
Point cloud is the most primitive output of 3D vision sensors (such as lidar), but it often contains noise, uneven density, and outliers. Therefore, preprocessing is the first step and the key to determining the effect of subsequent analysis.
The following code usesOpen3DandNumpyWe have built a simple point cloud processor, you can run it directly to experience:
Interpretation of key steps:
- Voxel downsampling: Divide the space into small cubes, retaining only one point in each cube. It can effectively reduce the amount of data without losing the overall shape.
- Normal vector estimation: Calculate the orientation of the local surface where each point is located, which is the basis for point cloud segmentation and registration.
- Statistical Filtering: Detect those "orphan points" that are too far from their neighbors, they are usually caused by sensor errors.
2.2 Let the neural network "eat" point clouds directly: lightweight PointNet
Point clouds have two characteristics that cause headaches for traditional convolutional networks: The order of points is irrelevant (no matter how the points are arranged, they can represent the same object) and The number of points is not fixed. PointNet solves the problem with two ingenious designs:
- Shared MLP (Multi-Layer Perceptron): Features are extracted independently for each point, and changes in point order do not affect the output.
- Symmetric maximum pooling: Select the maximum value from all point features to obtain a fixed-dimensional global feature, which is not afraid of changes in the number of points.
Next, we use PyTorch to implement a simplified version of PointNet classifier (without the T-net spatial transformation module) and process it directly.(batch, 点数, 3)input:
Core advantages: PointNet relies on symmetric functions (maximum pooling) to achieve natural adaptation to the disorder and invariance of point clouds, which is the cornerstone of subsequent work such as PointNet++.
3. Depth estimation and stereo vision
3.1 How to “thicken” photos: Classification of depth estimation technology
Depth estimation is to assign a distance value to each pixel of an ordinary photo to generate a depth map. According to cost and accuracy, the mainstream methods are as follows:
Binocular stereo vision has become a common solution for robots and assisted driving due to its "low cost and good accuracy" characteristics.
3.2 Hands-on implementation: OpenCV binocular stereo matching
Among traditional methods, the SGBM (Semi-Global Matching) algorithm achieves a good balance between effect and speed. Below we use OpenCV to quickly calculate the disparity map and depth map:
Intuitive understanding of the relationship between parallax and depth: The closer the object is, the greater the position difference (parallax) between the left and right images, and the smaller the depth value; the farther the object is, the smaller the parallax, and the greater the depth. formula深度 = (基线 × 焦距) / 视差Quantify this inverse relationship.
4. Quick Start with Neural Radiation Field (NeRF)
4.1 Use neural networks to “remember” the entire 3D scene
NeRF (Neural Radiance Fields) is the star technology of 2020, which completely changes the way scene reconstruction is performed. Traditional 3D representation requires storing a large number of points/surfaces/voxels, while NeRF implicitly encodes the entire scene using only a multi-layer perceptron (MLP).
The working principle can be understood as follows:
- You tell the network a point in space
(x,y,z)and viewing direction, which directly tells you whether this point is transparent or solid (volume densityσ), and what color it looks like(r,g,b)。 - Then, sample many points along the camera ray and use Volume Rendering to add up the colors and densities of these points to generate the image seen from this perspective.
Because the network itself is a continuous function, we can query the properties of any fine location to render an ultra-high-definition image from any new perspective. This has great application potential in VR room viewing, movie special effects, digital twins and other fields.
📌 Although NeRF is effective, the training time is long and the rendering is slow. The academic community is currently actively researching accelerated versions (such as Instant NGP).
Related tutorials
Summarize
Throughout this article, we can see the three major technical pillars of 3D vision:
- Point cloud processing: Open3D classic preprocessing + deep learning models such as PointNet to solve the classification and segmentation problems of disordered data.
- Depth Estimation: Use monocular CNN or binocular SGBM to recover the "thickness" of the scene from the image.
- High-fidelity reconstruction: Neural rendering technologies such as NeRF use implicit fields to achieve stunning new perspective synthesis.
The future trend is multi-sensor fusion (lidar + camera) and real-time lightweight reconstruction, which are also the most urgent needs in fields such as autonomous driving, robots, and the metaverse. You might as well run the code now and take the first step in 3D vision!

