Key point detection: Detailed explanation of 68-point facial positioning and human posture estimation

Introduction

Key point detection is an important "annotation-positioning" task in computer vision. It directly outputs the precise coordinates of specific key locations (such as eye corners, mouth corners, shoulders, fingertips) in images or videos. This technology is the basic capability behind many popular applications such as face alignment, virtual makeup try-on, skeletal animation, fitness posture analysis, and gesture control.

📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 语义分割 (Semantic Segmentation) · Vision Transformer (ViT) 详解

After reading this article, you will:

  • Understand the task definition and mainstream application scenarios of key point detection;
  • Personally run three high-frequency practical cases of face 68 points, human body posture, and hand tracking;
  • Understand the algorithm ideas and loss function selection of deep learning key point detection;
  • Get practical advice from getting started to deployment.

1. Quick Start: Task Definition and Core Types

1.1 What exactly does the task do?

To put it simply, given a picture (or video frame), let the algorithm tell you where the key positions of interest are. These key positions can be the corners of the eyes, the tip of the nose, and the corners of the mouth on a person's face, or they can be the shoulders, elbows, wrists and other joints of the human body.

A standard keypoint detection output usually contains:

  • Coordinates of N key points, for example(x1, y1), (x2, y2)..., each coordinate corresponds to a specific body part.
  • Optional confidence or visibility flag, used to indicate whether the point is occluded and reliable.

1.2 What scenarios are often encountered in daily projects?

According to the application field, key point detection can be divided into the following four categories, each of which has its own commonly used pre-training models and tools.

ScenariosCommonly used points/model examplesRecommended frameworks/tools
Face alignment/expression analysisdlib 68 points, MediaPipe 468 pointsdlib, MediaPipe Face
Human posture/action analysisCOCO 17 points, MPII 16 pointsMediaPipe Pose, OpenPose
Gesture interactionMediaPipe Blackjack (one hand)MediaPipe Hands
Object/foot key pointsCustomized points, dedicated foot modelCustomized CNN / Transformer

Let’s start with the three most classic high-frequency scenarios and use code to directly run through the effects.


2. Get started in practice: Code implementation of three major high-frequency scenarios

2.1 68-point positioning of human face: dlib + OpenCV

The 68-point facial key point model provided by dlib is the first solution for many people to get started. It marks key positions such as eyebrows, eyes, nose, mouth, and facial contours. It is very suitable for tasks such as face alignment, blink detection, and expression analysis.

Preparation

First install the required libraries and download the pre-trained model files.

pip install opencv-python dlib
# 从 http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2 下载
# shape_predictor_68_face_landmarks.dat 并解压到当前目录

Complete code: detect and draw 68 points

import cv2
import dlib

# 1. 初始化 dlib 的人脸检测器与关键点预测器
detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor("shape_predictor_68_face_landmarks.dat")

def draw_68_landmarks(img_path, save_path=None):
    # 2. 读图并转为灰度(加快检测速度)
    img = cv2.imread(img_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # 3. 检测所有人脸(参数1表示对图像放大1倍,有助于检测小脸)
    faces = detector(gray, 1)
    for face in faces:
        # 绘制人脸矩形框
        x1, y1, x2, y2 = face.left(), face.top(), face.right(), face.bottom()
        cv2.rectangle(img, (x1, y1), (x2, y2), (255, 0, 0), 2)

        # 提取并绘制68个关键点
        shape = predictor(gray, face)
        for i in range(68):
            x, y = shape.part(i).x, shape.part(i).y
            cv2.circle(img, (x, y), 2, (0, 255, 0), -1)

    # 4. 显示 / 保存结果
    cv2.imshow("dlib 68 Landmarks", img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()
    if save_path:
        cv2.imwrite(save_path, img)

# 测试一张图片
draw_68_landmarks("test_face.jpg", "result_face.jpg")

Running Tips: If the face cannot be detected, you can try todetector(gray, 1)Parameters1Change to2Increase the number of upsampling (but the speed will be slower); if the misdetection is serious, you can subsequently add filter conditions such as size/position.


2.2 Human pose estimation: MediaPipe Pose

MediaPipe is a lightweight framework open source by Google. Its posture detection module uses 17 key point models (shoulders, elbows, wrists, hips, knees, ankles, etc.) trained by the COCO data set by default, which can easily achieve real-time whole-body skeleton tracking.

Preparation

pip install opencv-python mediapipe

Complete code: Camera real-time attitude detection

import cv2
import mediapipe as mp

# 1. 初始化 MediaPipe 相关模块
mp_pose = mp.solutions.pose
mp_draw = mp.solutions.drawing_utils
mp_styles = mp.solutions.drawing_styles

pose = mp_pose.Pose(
    static_image_mode=False,    # 视频模式,启用追踪
    model_complexity=1,         # 精度与速度的平衡(0最快,2最准)
    enable_segmentation=False,
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5
)

# 2. 打开摄像头
cap = cv2.VideoCapture(0)

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # 3. 预处理:将 BGR 转为 RGB,再送入模型
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    results = pose.process(frame_rgb)

    # 4. 如果检测到姿态关键点,则绘制骨骼连线
    if results.pose_landmarks:
        mp_draw.draw_landmarks(
            frame,
            results.pose_landmarks,
            mp_pose.POSE_CONNECTIONS,
            mp_styles.get_default_pose_landmarks_style(),
            mp_styles.get_default_pose_connections_style()
        )

    # 5. 显示结果
    cv2.imshow("MediaPipe Pose", frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

When you run this code, you can see the skeletal lines of your body on the screen, with smooth movements and extremely low latency. Want more precision and smoothness? Adjustmentmodel_complexity=2or increasemin_detection_confidenceThat’s it.


2.3 21-point hand tracking and simple gesture recognition

Hand key point detection is also the cornerstone of gesture interaction. MediaPipe Hands can simultaneously detect 21 key points on a single hand or both hands, including the wrist and finger joints.

Complete code: real-time gesture judgment (rock-paper-scissors style)

import cv2
import mediapipe as mp

mp_hands = mp.solutions.hands
mp_draw = mp.solutions.drawing_utils

hands = mp_hands.Hands(
    static_image_mode=False,
    max_num_hands=1,                  # 只检测一只手(可调整)
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5
)

# 关键点索引:指尖和指根
FINGER_TIPS = [4, 8, 12, 16, 20]   # 拇指尖、食指尖、中指尖、无名指尖、小指尖
FINGER_MCP = [2, 5, 9, 13, 17]     # 拇指第二关节、其余指根

def get_simple_gesture(hand_landmarks):
    fingers_up = [0] * 5
    # 拇指:根据手的方向判断(左右手逻辑不同,这里简单处理)
    if hand_landmarks.landmark[FINGER_TIPS[0]].x < hand_landmarks.landmark[FINGER_TIPS[0]-1].x:
        # 若为左手,拇指竖起条件:指尖 x 小于指根 x
        if hand_landmarks.landmark[FINGER_TIPS[0]].x > hand_landmarks.landmark[FINGER_MCP[0]].x:
            fingers_up[0] = 1
    else:
        # 右手
        if hand_landmarks.landmark[FINGER_TIPS[0]].x < hand_landmarks.landmark[FINGER_MCP[0]].x:
            fingers_up[0] = 1

    # 其余四指:指尖 y 值小于第二关节 y 值即为竖立(y 越小,越靠近图像顶部)
    for i in range(1, 5):
        if hand_landmarks.landmark[FINGER_TIPS[i]].y < hand_landmarks.landmark[FINGER_TIPS[i]-2].y:
            fingers_up[i] = 1

    # 简单的手势映射
    if fingers_up == [1,0,0,0,0]:
        return "Thumbs Up"
    elif fingers_up == [0,1,1,0,0]:
        return "Scissors"
    elif fingers_up == [0,0,0,0,0]:
        return "Rock"
    elif fingers_up == [1,1,1,1,1]:
        return "Paper"
    else:
        return "Unknown"

cap = cv2.VideoCapture(0)

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    results = hands.process(frame_rgb)

    if results.multi_hand_landmarks:
        for hand_landmarks in results.multi_hand_landmarks:
            mp_draw.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS)
            gesture = get_simple_gesture(hand_landmarks)
            cv2.putText(frame, gesture, (50, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (0,255,0), 2)

    cv2.imshow("MediaPipe Hands + Simple Gesture", frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

You can see that geometric judgment based on joint point coordinates can quickly realize simple gesture recognition, and more gestures can be recognized with more complex classifiers.


3. Technology overview: core ideas of deep learning methods

If you are an algorithm engineer or want to deeply understand the principles behind key point detection, here is a clear technical map for you.

3.1 Two mainstream algorithm frameworks

MethodsIdeasAdvantages and Disadvantages
Top-DownFirst detect the bounding box of "person/face/hand", and then locate key points in each box.High accuracy, suitable for single/few people scenarios; when there are multiple people, the speed is affected by the performance of the detector.
Bottom-UpFirst find all possible key points in the image, and then pair them into different individuals through the "association strategy".Multi-person scenes are more efficient, OpenPose is a typical representative; it is more advantageous for occlusion and dense crowds.

3.2 Commonly used loss functions

When designing the loss function, the core goal is to make the heat map output by the model have the highest peak value at the real key point position.

  • Mean Squared Error (MSE) / Mean Absolute Error (MAE): The most naive regression loss, directly supervising heatmaps or coordinate values.
  • Wing Loss: Specially designed for key points, with stricter penalties for small positioning errors. It is very suitable for face/hand tasks that require pixel-level alignment.
  • Smooth L1 Loss: Combining the advantages of MSE and MAE, the training is more stable and is often used for coordinate regression.

When it comes to actual selection, most frameworks (such as MediaPipe, OpenPose) have already been optimized internally, and you only need to pick the appropriate model variant based on the accuracy and speed requirements of the task.


4. Learning and deployment suggestions

1. **Run it first**: Use the MediaPipe three-piece set (Face/Pose/Hands) to build your own real-time demo and establish perceptual understanding. 2. **Re-understanding**: Go deep into dlib’s traditional machine learning solutions and experience ideas such as “cascade regression”; understand how the pre-trained model works. 3. **Principle of attack**: Read OpenPose papers or source code, and understand the core designs such as Bottom-Up's Part Affinity Fields. - **Real-time application optimization**: reduce the input resolution (such as from 640×480 to 256×256), detect key points every 3-5 frames, and use the coordinates of the previous frame for tracking updates in intermediate frames. - **Mobile/Edge Devices**: Prioritize using MediaPipe's Lite model, or convert to TensorFlow Lite/ONNX for inference acceleration, and use floating point to fixed point (quantization) for further compression if necessary.

🔗Extended resources