Key point detection: Detailed explanation of 68-point facial positioning and human posture estimation
Introduction
Key point detection is an important "annotation-positioning" task in computer vision. It directly outputs the precise coordinates of specific key locations (such as eye corners, mouth corners, shoulders, fingertips) in images or videos. This technology is the basic capability behind many popular applications such as face alignment, virtual makeup try-on, skeletal animation, fitness posture analysis, and gesture control.
📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 语义分割 (Semantic Segmentation) · Vision Transformer (ViT) 详解
After reading this article, you will:
- Understand the task definition and mainstream application scenarios of key point detection;
- Personally run three high-frequency practical cases of face 68 points, human body posture, and hand tracking;
- Understand the algorithm ideas and loss function selection of deep learning key point detection;
- Get practical advice from getting started to deployment.
1. Quick Start: Task Definition and Core Types
1.1 What exactly does the task do?
To put it simply, given a picture (or video frame), let the algorithm tell you where the key positions of interest are. These key positions can be the corners of the eyes, the tip of the nose, and the corners of the mouth on a person's face, or they can be the shoulders, elbows, wrists and other joints of the human body.
A standard keypoint detection output usually contains:
- Coordinates of N key points, for example
(x1, y1), (x2, y2)..., each coordinate corresponds to a specific body part. - Optional confidence or visibility flag, used to indicate whether the point is occluded and reliable.
1.2 What scenarios are often encountered in daily projects?
According to the application field, key point detection can be divided into the following four categories, each of which has its own commonly used pre-training models and tools.
Let’s start with the three most classic high-frequency scenarios and use code to directly run through the effects.
2. Get started in practice: Code implementation of three major high-frequency scenarios
2.1 68-point positioning of human face: dlib + OpenCV
The 68-point facial key point model provided by dlib is the first solution for many people to get started. It marks key positions such as eyebrows, eyes, nose, mouth, and facial contours. It is very suitable for tasks such as face alignment, blink detection, and expression analysis.
Preparation
First install the required libraries and download the pre-trained model files.
Complete code: detect and draw 68 points
Running Tips: If the face cannot be detected, you can try todetector(gray, 1)Parameters1Change to2Increase the number of upsampling (but the speed will be slower); if the misdetection is serious, you can subsequently add filter conditions such as size/position.
2.2 Human pose estimation: MediaPipe Pose
MediaPipe is a lightweight framework open source by Google. Its posture detection module uses 17 key point models (shoulders, elbows, wrists, hips, knees, ankles, etc.) trained by the COCO data set by default, which can easily achieve real-time whole-body skeleton tracking.
Preparation
Complete code: Camera real-time attitude detection
When you run this code, you can see the skeletal lines of your body on the screen, with smooth movements and extremely low latency. Want more precision and smoothness? Adjustmentmodel_complexity=2or increasemin_detection_confidenceThat’s it.
2.3 21-point hand tracking and simple gesture recognition
Hand key point detection is also the cornerstone of gesture interaction. MediaPipe Hands can simultaneously detect 21 key points on a single hand or both hands, including the wrist and finger joints.
Complete code: real-time gesture judgment (rock-paper-scissors style)
You can see that geometric judgment based on joint point coordinates can quickly realize simple gesture recognition, and more gestures can be recognized with more complex classifiers.
3. Technology overview: core ideas of deep learning methods
If you are an algorithm engineer or want to deeply understand the principles behind key point detection, here is a clear technical map for you.
3.1 Two mainstream algorithm frameworks
3.2 Commonly used loss functions
When designing the loss function, the core goal is to make the heat map output by the model have the highest peak value at the real key point position.
- Mean Squared Error (MSE) / Mean Absolute Error (MAE): The most naive regression loss, directly supervising heatmaps or coordinate values.
- Wing Loss: Specially designed for key points, with stricter penalties for small positioning errors. It is very suitable for face/hand tasks that require pixel-level alignment.
- Smooth L1 Loss: Combining the advantages of MSE and MAE, the training is more stable and is often used for coordinate regression.
When it comes to actual selection, most frameworks (such as MediaPipe, OpenPose) have already been optimized internally, and you only need to pick the appropriate model variant based on the accuracy and speed requirements of the task.
4. Learning and deployment suggestions
Related tutorials
🔗Extended resources

