Practical project: Intelligent face attendance system

Introduction

The intelligent face attendance system is a typical application of computer vision in the digital transformation of enterprises - no contact required, fast, and highly accurate, replacing the pain points of traditional fingerprint/time card machines. This article will help you quickly build a lightweight prototype based on MTCNN + ArcFace, covering core technology, complete implementation, performance optimization and security considerations.

📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 边缘计算初探 · 实战项目二:工业缺陷检测


1. System architecture and technology stack

1.1 Lightweight prototype architecture

We adopt a modular design of "front-end collection + back-end reasoning + local storage", which is suitable for small and medium-sized enterprises to quickly implement:

flowchart LR
    A[摄像头/图片] --> B[MTCNN 人脸检测+对齐]
    B --> C[ArcFace 特征提取]
    C --> D[余弦相似度匹配]
    D --> E[判断是否为注册员工]
    E -->|是| F[SQLite 记录考勤]
    E -->|否| G[标记为未知人员]
    F --> H[OpenCV/Tkinter 实时展示]
    G --> H

The overall process is very simple: the camera captures the picture → MTCNN finds the face in the picture and crops and aligns it → ArcFace converts the face into a 512-dimensional feature vector → Searches for the most similar one in the vector library of registered employees → If the similarity exceeds the threshold, attendance is recorded, otherwise it is marked as "stranger".

1.2 Core technology stack

Function modulesRecommended technologiesDescription
Deep learning frameworkPyTorchMature ecology, many pre-trained models (such asfacenet-pytorch
Face detection/alignmentMTCNN (facenet-pytorchPackage)Lightweight, friendly to small devices, and supports key point positioning
Feature extractionArcFace (IR-SE50 pre-trained)Industrial-grade accuracy, optimized inter-class/intra-class distance through angular edge loss
DatabaseSQLiteLocal storage, no need to build additional services, suitable for small and medium-sized scenarios
Real-time interactionOpenCV (basic) + Tkinter (optional UI)OpenCV is responsible for camera acquisition and real-time frame, and Tkinter can develop a more friendly management background

The reason for choosing these technologies is very simple: using PyTorch to call the ready-made MTCNN and ArcFace pre-trained models, you can run a usable attendance system with a few hundred lines of code without training from scratch; SQLite allows all data to stay local, without the need to deploy a dedicated database server.


2. Get started quickly with core technologies

2.1 MTCNN face detection and alignment

The full name of MTCNN is Multi-task Cascaded Convolutional Networks. It can be simply understood as a three-level cascaded convolutional neural network. Each level is responsible for different granularity of work:

  • P-Net (Proposal Network): Use a small network to quickly scan the entire image and generate a large number of candidate frames that may contain faces;
  • R-Net (Refine Network): further filter the candidate frames and eliminate a large number of areas that are not faces;
  • O-Net (Output Network): Finely locate the face frame and output 5 facial key points (eyes, nose tip, two mouth corners) at the same time.

This step-by-step filtering can not only ensure speed, but also obtain very accurate face frames and key points. The core structure of P-Net is given below to help establish an intuitive impression of MTCNN:

# 简化版 P-Net 仅展示核心分支(实际使用直接用封装好的库)
import torch
import torch.nn as nn

class PNet(nn.Module):
    """Proposal Network:生成候选窗口,输出分类/框回归/关键点"""
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(3, 10, 3), nn.PReLU(),
            nn.MaxPool2d(2, 2, ceil_mode=True),
            nn.Conv2d(10, 16, 3), nn.PReLU(),
            nn.Conv2d(16, 32, 3), nn.PReLU()
        )
        # 三个并行分支
        self.cls = nn.Conv2d(32, 2, 1)       # 人脸/非人脸分类
        self.box = nn.Conv2d(32, 4, 1)       # 边界框偏移量
        self.landmark = nn.Conv2d(32, 10, 1) # 5个关键点坐标
    
    def forward(self, x):
        x = self.layers(x)
        return self.cls(x), self.box(x), self.landmark(x)

During actual development, we can directly usefacenet-pytorchWith the encapsulated MTCNN, face detection and key point positioning can be completed with just a few lines of code:

import cv2
from facenet_pytorch import MTCNN

# 初始化轻量级MTCNN
mtcnn = MTCNN(
    image_size=160,                # 输出的对齐后人脸尺寸
    min_face_size=20,              # 最小人脸检测尺寸(像素)
    thresholds=[0.6, 0.7, 0.7],   # 三阶段的置信度阈值
    device="cuda" if torch.cuda.is_available() else "cpu"
)

def detect_and_crop_face(img_path):
    img = cv2.imread(img_path)
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    # 检测人脸框、置信度、5个关键点
    boxes, probs, landmarks = mtcnn.detect(img_rgb, landmarks=True)
    
    # 只保留置信度>0.8的人脸
    if boxes is not None:
        for i, box in enumerate(boxes):
            if probs[i] > 0.8:
                x1, y1, x2, y2 = box.astype(int)
                # 画框
                cv2.rectangle(img, (x1, y1), (x2, y2), (0,255,0), 2)
                # 画关键点
                for point in landmarks[i]:
                    cv2.circle(img, tuple(point.astype(int)), 2, (0,0,255), -1)
    return img

# 测试
result = detect_and_crop_face("test_face.jpg")
cv2.imshow("MTCNN检测", result)
cv2.waitKey(0)
cv2.destroyAllWindows()

After running, you can see that a green rectangular frame and red key points are drawn on the person's face. This is the first step for the attendance system to "find the person".


2.2 ArcFace feature extraction

After face detection is completed, we need to "encode" the face into a fixed vector for comparison. ArcFace is a very popular feature extraction model in the industry. The vectors it learns are very distinguishable: face vectors of the same person are very close, and face vectors of different people are far apart.

The core improvement of ArcFace is the use of Additive Angular Margin Loss during training. To explain it in simple terms:

  • When training the classifier, the features of each category (each employee) will be distributed on the hypersphere, and the angle (cosine distance) between the two features is used to measure the similarity;
  • In order to separate different people further, ArcFace will artificially increase the angle of the "target category" by a fixed edge value m (such as 0.5 radians) when working, which makes the network have to learn a tighter intra-class distribution and greater inter-class differences;
  • The classification layer is no longer used during inference, but the normalized feature vector (512 dimensions) of the middle layer of the network is directly extracted to represent a face.

The following is a code example of the core logic of ArcFace loss calculation:

# 简化版ArcFace损失核心逻辑
import torch
import torch.nn as nn
import math
import torch.nn.functional as F

class ArcMarginProduct(nn.Module):
    def __init__(self, in_features=512, out_features=1000, s=30.0, m=0.50):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.s = s  # 特征缩放因子,让分类器输出更“尖锐”
        self.m = m  # 角度边缘,拉开类别间距离
        self.weight = nn.Parameter(torch.FloatTensor(out_features, in_features))
        nn.init.xavier_uniform_(self.weight)
        
        # 预计算cos(m)和sin(m)
        self.cos_m = math.cos(m)
        self.sin_m = math.sin(m)
        self.th = math.cos(math.pi - m)  # 当θ+m超过π时的边界
        self.mm = math.sin(math.pi - m) * m  # 替代θ+m的线性惩罚

    def forward(self, features, labels):
        # L2归一化特征和权重
        features = F.normalize(features)
        weight = F.normalize(self.weight)
        
        # 计算cosθ = features · weight^T
        cos_theta = F.linear(features, weight)
        # 计算sinθ = sqrt(1 - cos²θ)
        sin_theta = torch.sqrt((1.0 - cos_theta.pow(2)).clamp(0, 1))
        # 计算cos(θ + m) = cosθ*cosm - sinθ*sinm
        cos_theta_m = cos_theta * self.cos_m - sin_theta * self.sin_m
        
        # 当θ+m超出范围时,用cosθ - mm替代,保持梯度稳定
        cos_theta_m = torch.where(cos_theta > self.th, cos_theta_m, cos_theta - self.mm)
        
        # 构建one-hot标签,只对目标类别应用加了边缘的余弦值
        one_hot = torch.zeros_like(cos_theta)
        one_hot.scatter_(1, labels.view(-1, 1).long(), 1)
        
        # 目标类别用cos(θ+m),其他类别保持原cosθ
        output = (one_hot * cos_theta_m) + ((1.0 - one_hot) * cos_theta)
        # 缩放输出,提高区分度
        output *= self.s
        return output

For our actual use, directly usefacenet-pytorchThe provided IR-SE50 skeleton and ArcFace model pre-trained on VGGFace2 can be used to extract features with one line of code:

from facenet_pytorch import InceptionResnetV1
import torchvision.transforms as transforms
from PIL import Image
import torch

# 加载预训练模型(在 VGGFace2 上训练)
model = InceptionResnetV1(pretrained='vggface2').eval().to(
    "cuda" if torch.cuda.is_available() else "cpu"
)

# 预处理必须和训练时一致:Resize → ToTensor → Normalize
transform = transforms.Compose([
    transforms.Resize((160, 160)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

def extract_feature(img_path):
    img = Image.open(img_path).convert('RGB')
    img_tensor = transform(img).unsqueeze(0).to(next(model.parameters()).device)
    with torch.no_grad():
        feature = model(img_tensor).squeeze().cpu().numpy()
    return feature  # 512 维归一化特征向量

Now, any aligned face can be converted into a 512-dimensional array of floating point numbers, which is its "coordinate" in face space.


2.3 Cosine similarity matching

The normalized vectors extracted by the two ArcFace are both unit length, so the most direct way to measure their similarity is to calculate the cosine similarity - simply speaking, it is to calculate the dot product of the two vectors. The closer the dot product result is to 1, the more likely it is that the two faces belong to the same person; the closer the dot product result is to 0 or a negative number, it can basically be concluded that they are not the same person.

import numpy as np

def cosine_sim(feat1, feat2):
    """计算两个归一化特征的余弦相似度"""
    return np.dot(feat1, feat2)  # 已经归一化,直接点积即可

def find_best_match(input_feat, db_feats, db_ids, db_names, threshold=0.6):
    """
    在数据库中找最佳匹配
    - threshold: 相似度阈值(IR-SE50 通常在 0.5-0.7 之间效果最佳)
    """
    best_idx = -1
    best_sim = 0
    for i, db_feat in enumerate(db_feats):
        sim = cosine_sim(input_feat, db_feat)
        if sim > best_sim and sim > threshold:
            best_sim = sim
            best_idx = i
    if best_idx == -1:
        return None, 0
    return (db_ids[best_idx], db_names[best_idx]), best_sim

This method is very fast, and it only takes a few milliseconds to match the vector library of thousands of employees, fully meeting the needs of real-time attendance.


3. Core implementation and optimization

3.1 Simplified database design

In order to implement quickly, we use SQLite to store employee information and attendance records, and only need two core tables:

-- 用户表(存储员工信息和人脸特征)
CREATE TABLE IF NOT EXISTS users (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    emp_id VARCHAR(20) UNIQUE NOT NULL,
    name VARCHAR(100) NOT NULL,
    face_feat BLOB NOT NULL,  -- 用 pickle 序列化的 512 维 numpy 数组
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- 考勤记录表
CREATE TABLE IF NOT EXISTS attendance (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    user_id INTEGER NOT NULL,
    emp_id VARCHAR(20) NOT NULL,
    name VARCHAR(100) NOT NULL,
    check_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    sim REAL NOT NULL,  -- 识别时的相似度,用于追溯
    FOREIGN KEY (user_id) REFERENCES users(id)
);

face_featDirectly using PythonpickleConvert the numpy array into binary data and store it in the database, which is very convenient for reading and writing. Each time it is recognized, all user characteristics are read from the database into memory, and after matching the best results, they are written into the attendance table.

3.2 Lightweight optimization suggestions

For small and medium-sized enterprises or devices with lower performance, the following four optimizations are almost necessary:

Optimization strategySpecific practicesEffects
Haar cascade preliminary screeningUse OpenCV's own Haar classifier to quickly determine whether there is a face in the picture. If not, skip MTCNN directlySignificantly reduce invalid calculations and halve CPU usage
Reduce camera resolutionSet the camera input to 640x480, do not blindly pursue high resolutionThe speed is increased by 30%-50%, and the recognition accuracy is almost lossless
Frame skipping recognitionA complete detection + recognition is done every 5 frames, and the remaining frames only display the last resultThe real-time performance is significantly improved, and the CPU/GPU pressure is greatly reduced
Model QuantificationUsetorch.quantizationConvert FP32 model to INT8Increase inference speed by 2-4 times and reduce size by 75%

These optimizations do not require any changes to the architecture. Just adding a few judgments and configuration items to the code can make the system run smoothly on ordinary laptops or even Raspberry Pi.


4. Security and Privacy Reminder

Facial recognition technology involves a large amount of personal biometric information and must be used in compliance with laws and regulations. Here are a few very key practical principles:

  • Data Minimization: Only store necessary 512-dimensional feature vectors, never save original face photos;
  • Local processing: Detection and recognition are all completed locally, and the original images or facial features are not uploaded to the cloud;
  • Access Control: Strictly restrict the read permission of database files and system codes to prevent feature data leakage;
  • User Consent: Explicit written consent from employees must be obtained in advance to inform the data purpose and storage period;
  • Data Period: Set the storage period of attendance records and characteristics, and delete them automatically or manually after expiration.
If you don’t want to write code from scratch, you can directly use`facenet-pytorch` + `OpenCV` + `SQLite`When building a prototype, the recognition accuracy and speed can meet the needs of small and medium-sized enterprises. By adjusting the threshold and frame skipping strategy, recognition accuracy of more than 95% can be achieved in office scenes.

Summarize

This article sorts out the complete implementation path of the intelligent face attendance system. The core points can be summarized as follows:

  1. Face Detection + Alignment: Use lightweight MTCNN to quickly locate faces, and crop and align them to a unified size;
  2. Feature extraction: Use the pre-trained ArcFace model to extract 512-dimensional robust feature vectors;
  3. Similarity matching: Use cosine similarity for quick comparison, and it is more reasonable to set the threshold between 0.5-0.7;
  4. Implementation Optimization: Add Haar initial screening, resolution reduction, frame skip processing, model quantization and other techniques to ensure smoothness;
  5. Security and Compliance: Adhere to the principles of data minimization, local processing, and user consent to ensure that technology is good.

After mastering the basic framework of MTCNN + ArcFace, you can not only use it for face attendance, but also flexibly migrate it to visitor registration, stranger alarm, smart access control and other scenarios.

🔗 Extended reading