Vision-Language multi-modality: detailed explanation of CLIP model and image-text alignment
Introduction
In the field of artificial intelligence, visual perception and language understanding used to be two relatively independent tracks. Vision-Language multi-modal technology is a bridge between the two - it allows the machine to "understand" both images and text, and more importantly, it can align these two different types of information.
In 2021, CLIP (Contrastive Language-Image Pre-training) proposed by OpenAI is a milestone on this bridge. It relies only on a simple contrastive learning method and is trained on a large number of image and text pairs to map images and text to the same semantic space, thus possessing capabilities such as zero-sample classification and image and text retrieval** out of the box. The subsequent popular Vincentian graph models such as DALL·E and Midjourney are also inseparable from the foreshadowing of this kind of alignment idea.
📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: MAE (Masked Autoencoders) · 模型轻量化
1. Multimodal foundation and task positioning of CLIP
1.1 Core application scenarios of vision-language
Multimodal learning is not just talk on paper, it has penetrated into a large number of real products. The following table helps you quickly create an impression:
1.2 Understand CLIP in one sentence
CLIP is a universal image and text semantic aligner. Its training goal is very simple: to make the originally "matching" images and text closer in the feature space, while the "mismatching" image-text pairs are further apart.
This idea seems simple, but because of the use of large-scale weakly supervised data (image-text pairing naturally occurring on the Internet), the final alignment effect learned is surprisingly good.
2. Dismantling of CLIP core technology
2.1 Dual encoder architecture: one for images and one for text
The structure of CLIP can be described in one word: Twin Towers. There are separate encoders for images and text, and there is no cross-modal attention mechanism in the middle. The similarity is only compared through simple calculations at the end.
- Image Encoder: You can use ResNet or Vision Transformer (ViT). The classic strong model uses ViT‑L/14, that is, Large level ViT, and the input image is cut into 14×14 blocks.
- Text Encoder: It is the encoder part of Transformer (similar to BERT, but without a decoder).
- Projection layer: Map image features and text features to the same dimension (such as 512 or 768), and then perform L2 normalization so that all vectors fall on the unit sphere.
This design has two obvious benefits:
- Fast speed: During inference, image encoding and text encoding can be performed completely independently, and text features can even be calculated in advance;
- Deployment-friendly: Image services use GPU and text services use CPU, without interfering with each other.
The following is a streamlined but complete PyTorch implementation. In order to focus on the core logic, some initialization parameter details are omitted.
Image Encoder (ViT simplified version)
Text encoder (simplified version)
2.2 Contrastive learning and InfoNCE loss
The training soul of CLIP is a contrast loss called InfoNCE. Only by understanding it can you truly understand CLIP.
Suppose there are N pairs of images and text in a batch, then:
- Positive samplesOnly N pairs: the i-th image and the i-th text match.
- There are many negative samples: image i and all texts j (j≠i); text i and all images j (j≠i). Total N×(N‑1) negative samples.
CLIP does two things at the same time:
- Let the picture find the correct text: Calculate the similarity between each picture and all texts, hoping that the i-th picture is the most similar to the i-th text.
- Let the text find the correct image: In the same way, we hope that each text can accurately find its "original" image.
These two parts together are two-way contrast loss.
Loss function implementation
The code is simpler than you think:
3. CLIP’s trump card: zero-sample classification
3.1 Why can it be “self-taught without a teacher”?
Traditional image classification models need to define categories in advance, and each category requires a large amount of annotated data. CLIP skips this step entirely - it doesn't require any downstream annotation, or even knowing what categories there are.
The principle is actually very straightforward:
- Each category you want to classify can be described in natural language, such as "cat", "a photo of a dog".
- CLIP encodes these category texts into vectors, which serve as "templates" for this category.
- Encode the images to be classified into vectors.
- Compare the similarity between the image vector and the text vectors of all categories. The most similar one is the classification result.
In order to improve robustness, in actual use, not only a single description is used, but a set of prompt templates with similar semantics are constructed, for example:
- “a photo of a {}”
- “a blurry photo of a {}”
- “a close-up of a {}”
Finally, by averaging the similarities of these templates, a more stable prediction can be obtained.
3.2 Hands-on experience: Using Hugging Face to achieve zero-sample classification
The original CLIP has specific requirements for the PyTorch version. Now the more convenient way is to use Hugging Face directly.transformerslibrary.
In the following example, you only need to installtransformersandpillowYou can run it directly:
Even if the model has never seen the type of image you provide, it can still give a fairly reliable classification result. This is the charm of zero-shot generalization.
4. Advantages, disadvantages and improvement directions of CLIP
4.1 Understand the length and shortness of a table
4.2 What improvements have been derived from CLIP?
CLIP is like a Swiss Army Knife, but different scenarios require more sophisticated tools. The current mainstream improvement directions include:
- Data Efficiency:
- ALBEF: Introduce momentum distillation and additional image-text matching tasks to reduce dependence on the amount of data.
- BLIP‑2: Freeze a ready-made large language model and a visual model respectively, and only train a lightweight Q‑Former to bridge the two, greatly reducing the computing overhead.
- Fine-grained understanding:
- FLAVA: Do global alignment and local area-word alignment at the same time.
- CLIP‑Dissect: Try to decouple the semantic representation of CLIP and understand what it "learned".
- Vertical field adaptation:
- MedCLIP: Pre-trained specifically for medical images and clinical text.
- AgriCLIP: Image and text alignment in agricultural scenarios.
5. Summary
The core contribution of CLIP is not that the structure is complex, but that it uses a minimalist paradigm, proving that large-scale weakly supervised data + contrastive learning is enough to break down the barrier between vision and language.
Because the idea is clean enough, CLIP has transformed from a model into an "infrastructure": it can be used as a starting point for almost all tasks involving image and text alignment. Although we rarely pre-train a CLIP from scratch, its open source pre-trained weights and out-of-the-box solutions provided by communities such as Hugging Face allow us to easily integrate this capability into our own projects.
🔗 Core Reference Papers
- Learning Transferable Visual Models From Natural Language Supervision (CLIP original paper)

