Practical project: Intelligent face attendance system
Introduction
The intelligent face attendance system is a typical application of computer vision in the digital transformation of enterprises - no contact required, fast, and highly accurate, replacing the pain points of traditional fingerprint/time card machines. This article will help you quickly build a lightweight prototype based on MTCNN + ArcFace, covering core technology, complete implementation, performance optimization and security considerations.
📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: 边缘计算初探 · 实战项目二:工业缺陷检测
1. System architecture and technology stack
1.1 Lightweight prototype architecture
We adopt a modular design of "front-end collection + back-end reasoning + local storage", which is suitable for small and medium-sized enterprises to quickly implement:
The overall process is very simple: the camera captures the picture → MTCNN finds the face in the picture and crops and aligns it → ArcFace converts the face into a 512-dimensional feature vector → Searches for the most similar one in the vector library of registered employees → If the similarity exceeds the threshold, attendance is recorded, otherwise it is marked as "stranger".
1.2 Core technology stack
The reason for choosing these technologies is very simple: using PyTorch to call the ready-made MTCNN and ArcFace pre-trained models, you can run a usable attendance system with a few hundred lines of code without training from scratch; SQLite allows all data to stay local, without the need to deploy a dedicated database server.
2. Get started quickly with core technologies
2.1 MTCNN face detection and alignment
The full name of MTCNN is Multi-task Cascaded Convolutional Networks. It can be simply understood as a three-level cascaded convolutional neural network. Each level is responsible for different granularity of work:
- P-Net (Proposal Network): Use a small network to quickly scan the entire image and generate a large number of candidate frames that may contain faces;
- R-Net (Refine Network): further filter the candidate frames and eliminate a large number of areas that are not faces;
- O-Net (Output Network): Finely locate the face frame and output 5 facial key points (eyes, nose tip, two mouth corners) at the same time.
This step-by-step filtering can not only ensure speed, but also obtain very accurate face frames and key points. The core structure of P-Net is given below to help establish an intuitive impression of MTCNN:
During actual development, we can directly usefacenet-pytorchWith the encapsulated MTCNN, face detection and key point positioning can be completed with just a few lines of code:
After running, you can see that a green rectangular frame and red key points are drawn on the person's face. This is the first step for the attendance system to "find the person".
2.2 ArcFace feature extraction
After face detection is completed, we need to "encode" the face into a fixed vector for comparison. ArcFace is a very popular feature extraction model in the industry. The vectors it learns are very distinguishable: face vectors of the same person are very close, and face vectors of different people are far apart.
The core improvement of ArcFace is the use of Additive Angular Margin Loss during training. To explain it in simple terms:
- When training the classifier, the features of each category (each employee) will be distributed on the hypersphere, and the angle (cosine distance) between the two features is used to measure the similarity;
- In order to separate different people further, ArcFace will artificially increase the angle of the "target category" by a fixed edge value m (such as 0.5 radians) when working, which makes the network have to learn a tighter intra-class distribution and greater inter-class differences;
- The classification layer is no longer used during inference, but the normalized feature vector (512 dimensions) of the middle layer of the network is directly extracted to represent a face.
The following is a code example of the core logic of ArcFace loss calculation:
For our actual use, directly usefacenet-pytorchThe provided IR-SE50 skeleton and ArcFace model pre-trained on VGGFace2 can be used to extract features with one line of code:
Now, any aligned face can be converted into a 512-dimensional array of floating point numbers, which is its "coordinate" in face space.
2.3 Cosine similarity matching
The normalized vectors extracted by the two ArcFace are both unit length, so the most direct way to measure their similarity is to calculate the cosine similarity - simply speaking, it is to calculate the dot product of the two vectors. The closer the dot product result is to 1, the more likely it is that the two faces belong to the same person; the closer the dot product result is to 0 or a negative number, it can basically be concluded that they are not the same person.
This method is very fast, and it only takes a few milliseconds to match the vector library of thousands of employees, fully meeting the needs of real-time attendance.
3. Core implementation and optimization
3.1 Simplified database design
In order to implement quickly, we use SQLite to store employee information and attendance records, and only need two core tables:
face_featDirectly using PythonpickleConvert the numpy array into binary data and store it in the database, which is very convenient for reading and writing. Each time it is recognized, all user characteristics are read from the database into memory, and after matching the best results, they are written into the attendance table.
3.2 Lightweight optimization suggestions
For small and medium-sized enterprises or devices with lower performance, the following four optimizations are almost necessary:
These optimizations do not require any changes to the architecture. Just adding a few judgments and configuration items to the code can make the system run smoothly on ordinary laptops or even Raspberry Pi.
4. Security and Privacy Reminder
Facial recognition technology involves a large amount of personal biometric information and must be used in compliance with laws and regulations. Here are a few very key practical principles:
- ✅ Data Minimization: Only store necessary 512-dimensional feature vectors, never save original face photos;
- ✅ Local processing: Detection and recognition are all completed locally, and the original images or facial features are not uploaded to the cloud;
- ✅ Access Control: Strictly restrict the read permission of database files and system codes to prevent feature data leakage;
- ✅ User Consent: Explicit written consent from employees must be obtained in advance to inform the data purpose and storage period;
- ✅ Data Period: Set the storage period of attendance records and characteristics, and delete them automatically or manually after expiration.
Summarize
This article sorts out the complete implementation path of the intelligent face attendance system. The core points can be summarized as follows:
- Face Detection + Alignment: Use lightweight MTCNN to quickly locate faces, and crop and align them to a unified size;
- Feature extraction: Use the pre-trained ArcFace model to extract 512-dimensional robust feature vectors;
- Similarity matching: Use cosine similarity for quick comparison, and it is more reasonable to set the threshold between 0.5-0.7;
- Implementation Optimization: Add Haar initial screening, resolution reduction, frame skip processing, model quantization and other techniques to ensure smoothness;
- Security and Compliance: Adhere to the principles of data minimization, local processing, and user consent to ensure that technology is good.
After mastering the basic framework of MTCNN + ArcFace, you can not only use it for face attendance, but also flexibly migrate it to visitor registration, stranger alarm, smart access control and other scenarios.
🔗 Extended reading

