MAE (Masked Autoencoders): Detailed explanation of visual pre-training method for self-supervised learning
Introduction
Masked Autoencoders (MAE) is a revolutionary self-supervised learning method proposed by He Kaiming and others in 2021. It cleverly transfers BERT's masked language modeling ideas in natural language processing to the field of computer vision. The core idea of MAE is to randomly occlude up to 75% of patches in the image, and then train a model to reconstruct these occluded parts. In this process, the model is forced to learn the deep structural information of the image, thereby obtaining powerful visual representation capabilities. This method has greatly promoted the development of self-supervised learning in the field of vision and provided a new and efficient paradigm for pre-training of Vision Transformer (ViT).
📂 Stage: Stage 2 - Deep Learning Visual Basics (Visual Transformer) 🔗 Related chapters: Swin Transformer · Vision-Language 多模态
1. MAE core ideas and motivations
1.1 The rise of self-supervised learning
Self-supervised learning is an important direction in current deep learning. Its goal is to use massive unlabeled data to allow the model to construct its own supervision signals for pre-training. This idea is strongly motivated by the following factors:
- Data efficiency: Avoid expensive and time-consuming manual annotation, and can directly reuse a large number of public or private unlabeled images on the Internet or in industry.
- Cost Effectiveness: No need for a professional labeling team, significantly reducing the development cost of AI models.
- Generalization ability: The model can learn more general underlying visual features from unconstrained natural data, rather than just being limited to a specific labeling task.
- Scalability: Naturally adaptable to large-scale data sets and large model training. As the amount of data and model parameters increases, performance can continue to improve.
1.2 Innovations of MAE
MAE’s success mainly comes from three key technological innovations:
- Asymmetric encoder-decoder architecture: The encoder only processes visible patches that are not occluded (the amount of calculation is reduced by about 75%), so it is very lightweight and efficient; while the decoder needs to process all patches and is specifically responsible for reconstructing the occluded image content.
- High-proportion random mask: Using an extreme random occlusion ratio of 75% forces the model to understand the global semantic association between different areas in the image, and cannot simply rely on local texture filling.
- Lightweight pixel-level reconstruction target: Directly predict the original RGB pixel value of the occluded patch. There is no need to introduce additional auxiliary modules such as pre-trained VAE (variational autoencoder) or Tokenizer. The implementation is simple and the training is stable.
2. Detailed explanation of MAE architecture
2.1 Asymmetric encoder-decoder design
The encoder is based on the standard Vision Transformer (ViT), but only retains the processing logic for unmasked patches. It is responsible for compressing visible patches into high-dimensional feature representations. The core modules include patch embedding, position encoding and Transformer encoding layers.
The following code shows the specific implementation of the MAE encoder:
2.2 MAE decoder design
Although the decoder is overall lighter than the encoder, it needs to process all 196 patches including occluded patches, so it focuses more on the reconstruction task. Key components of the decoder include:
- A linear mapping that maps the high-dimensional features output by the encoder to the lower dimensions used by the decoder.
- A learnable mask token used to occupy all occluded patches.
- Complete position embedding, ensuring that regardless of whether patches are occluded, the model knows their location in the original image.
- Several Transformer layers used to fuse information from visible patches and infer the contents of occluded areas.
- A linear projection layer that maps the features of each patch back to the original pixel values (e.g. 16×16×3 RGB values).
3. Masking strategy and complete model
3.1 Random high proportion mask implementation
MAE's masking strategy may seem simple, but it is crucial to training. The specific method is: for each sample, randomly shuffle the order of all patches, then select the previous part as the retained visible patch, and the remaining as occlusion patch. In order for the decoder to correctly restore the position of the original image block, a "restoration index" needs to be saved, which is used to put the occluded tokens back into the original order.
3.2 MAE complete model and training process
The complete model encapsulates the encoder, decoder, and loss calculation. During training, loss is only calculated on occluded patches, and standardized pixel loss is used by default (that is, the pixels of each patch are normalized by mean variance before calculating MSE), which can further improve the stability and final performance of the model.
4. Pre-training and downstream applications
4.1 Key points of pre-training
In actual pre-training, it is recommended to use the AdamW optimizer in conjunction with the learning rate warm-up and cosine annealing strategies. Data augmentation only requires simple random scaling and cropping and horizontal flipping, without the need for complex data augmentation operations, because MAE's own high-ratio mask already provides a strong regularization effect.
4.2 Fine-tuning steps
After pre-training is completed, applying the model to downstream tasks (such as image classification, target detection, etc.) is generally divided into the following steps:
- Extract encoder: Discard the decoder used during training, and only retain the ViT encoder part of MAE.
- Add task header: For example, in the ImageNet classification task, the category token output by the encoder is followed by a linear layer and mapped to 1000 categories.
- Fine-tuning strategy: You can first use Linear Probe to freeze the encoder parameters and only train the classification head; then unfreeze all parameters for end-to-end full fine-tuning to obtain the best results.
4.3 Use the timm library to load the pre-trained model
With the help oftimmlibrary, you can easily load pre-trained MAE models and extract image features without writing all the above code from scratch.
Summarize
MAE successfully migrates the mask modeling ideas in the NLP field to computer vision through the concise combination of high-proportion random mask + asymmetric encoder-decoder + pixel-level reconstruction. It significantly improves the performance of Vision Transformer in downstream tasks (e.g., ImageNet Top-1 accuracy increases from 82.2% for purely supervised ViT-B to 83.6% for ViT-B after MAE pretraining). This method is simple to implement and has high data efficiency. It has now become a standard paradigm for modern visual Transformer pre-training.
💡 Extended Reading

