Vision-Language multi-modality: detailed explanation of CLIP model and image-text alignment

Introduction

In the field of artificial intelligence, visual perception and language understanding used to be two relatively independent tracks. Vision-Language multi-modal technology is a bridge between the two - it allows the machine to "understand" both images and text, and more importantly, it can align these two different types of information.

In 2021, CLIP (Contrastive Language-Image Pre-training) proposed by OpenAI is a milestone on this bridge. It relies only on a simple contrastive learning method and is trained on a large number of image and text pairs to map images and text to the same semantic space, thus possessing capabilities such as zero-sample classification and image and text retrieval** out of the box. The subsequent popular Vincentian graph models such as DALL·E and Midjourney are also inseparable from the foreshadowing of this kind of alignment idea.

📂 Stage: Stage 2 - Deep Learning Vision Basics (CNN) 🔗 Related chapters: MAE (Masked Autoencoders) · 模型轻量化

1. Multimodal foundation and task positioning of CLIP

1.1 Core application scenarios of vision-language

Multimodal learning is not just talk on paper, it has penetrated into a large number of real products. The following table helps you quickly create an impression:

Task Types	Typical Applications
Image and text retrieval	E-commerce "search for goods by image", text search of mobile photo albums, cross-modal knowledge base query within the enterprise
Zero/Few Sample Classification	Automatic classification of new categories that have never been seen before, rapid implementation in vertical fields such as medical and agriculture
Preliminary tasks for content understanding	As the basic alignment module for image-based text and text-based image models, it provides semantic support for subsequent generation
Other downstream tasks	Pre-training backbone for visual question answering (VQA), image description (Image Captioning) and other tasks

1.2 Understand CLIP in one sentence

CLIP is a universal image and text semantic aligner. Its training goal is very simple: to make the originally "matching" images and text closer in the feature space, while the "mismatching" image-text pairs are further apart.

This idea seems simple, but because of the use of large-scale weakly supervised data (image-text pairing naturally occurring on the Internet), the final alignment effect learned is surprisingly good.

2. Dismantling of CLIP core technology

2.1 Dual encoder architecture: one for images and one for text

The structure of CLIP can be described in one word: Twin Towers. There are separate encoders for images and text, and there is no cross-modal attention mechanism in the middle. The similarity is only compared through simple calculations at the end.

Image Encoder: You can use ResNet or Vision Transformer (ViT). The classic strong model uses ViT‑L/14, that is, Large level ViT, and the input image is cut into 14×14 blocks.
Text Encoder: It is the encoder part of Transformer (similar to BERT, but without a decoder).
Projection layer: Map image features and text features to the same dimension (such as 512 or 768), and then perform L2 normalization so that all vectors fall on the unit sphere.

This design has two obvious benefits:

Fast speed: During inference, image encoding and text encoding can be performed completely independently, and text features can even be calculated in advance;
Deployment-friendly: Image services use GPU and text services use CPU, without interfering with each other.

The following is a streamlined but complete PyTorch implementation. In order to focus on the core logic, some initialization parameter details are omitted.

Image Encoder (ViT simplified version)

import torch
import torch.nn as nn
import torch.nn.functional as F

class ImageEncoder(nn.Module):
    def __init__(self, embed_dim=512, img_size=224, patch_size=16,
                 vision_width=768, vision_layers=12):
        super().__init__()
        num_patches = (img_size // patch_size) ** 2

        # 把图像切成小块，并用卷积将每个小块投影到 vision_width 维度
        self.patch_embed = nn.Conv2d(3, vision_width, patch_size, patch_size, bias=False)
        # 学习一个类别 token（类似 BERT 的 [CLS]）
        self.cls_token = nn.Parameter(torch.randn(1, 1, vision_width))
        # 位置编码
        self.pos_embed = nn.Parameter(torch.randn(1, num_patches + 1, vision_width))
        self.ln_pre = nn.LayerNorm(vision_width)

        # Transformer 编码器
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(vision_width, nhead=12, batch_first=True),
            num_layers=vision_layers
        )
        self.ln_post = nn.LayerNorm(vision_width)
        # 投影到统一的嵌入维度
        self.proj = nn.Parameter(torch.randn(vision_width, embed_dim))

    def forward(self, x):
        # x: (B, 3, H, W)
        x = self.patch_embed(x).flatten(2).transpose(1, 2)  # (B, num_patches, vision_width)
        # 在前面拼接 class token
        x = torch.cat([self.cls_token.expand(x.shape[0], -1, -1), x], dim=1)
        x = x + self.pos_embed
        x = self.ln_pre(x)
        x = self.transformer(x)
        # 取出 class token 对应的输出，再通过投影矩阵映射
        x = self.ln_post(x[:, 0, :]) @ self.proj
        return x

Text encoder (simplified version)

class TextEncoder(nn.Module):
    def __init__(self, embed_dim=512, context_len=77, vocab_size=49408,
                 text_width=512, text_layers=12):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, text_width)
        self.pos_embed = nn.Parameter(torch.randn(1, context_len, text_width))
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(text_width, nhead=8, batch_first=True),
            num_layers=text_layers
        )
        self.ln_final = nn.LayerNorm(text_width)
        self.proj = nn.Parameter(torch.randn(text_width, embed_dim))

    def forward(self, text):
        # text: (B, 77)，每个序列的结束符是最后出现的特殊 token（eot_token）
        x = self.token_embed(text) + self.pos_embed
        x = self.transformer(x)
        x = self.ln_final(x)
        # 取 eot_token 位置的输出进行投影
        x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.proj
        return x

2.2 Contrastive learning and InfoNCE loss

The training soul of CLIP is a contrast loss called InfoNCE. Only by understanding it can you truly understand CLIP.

Suppose there are N pairs of images and text in a batch, then:

Positive samplesOnly N pairs: the i-th image and the i-th text match.
There are many negative samples: image i and all texts j (j≠i); text i and all images j (j≠i). Total N×(N‑1) negative samples.

CLIP does two things at the same time:

Let the picture find the correct text: Calculate the similarity between each picture and all texts, hoping that the i-th picture is the most similar to the i-th text.
Let the text find the correct image: In the same way, we hope that each text can accurately find its "original" image.

These two parts together are two-way contrast loss.

Loss function implementation

The code is simpler than you think:

class CLIPLoss(nn.Module):
    def __init__(self, temperature=0.07):
        super().__init__()
        # 温度参数：控制相似度矩阵的“锐度”
        # 设为一个可学习的 log 值，训练过程中也会调整
        self.temperature = nn.Parameter(torch.tensor(temperature).log())

    def forward(self, img_feat, text_feat):
        # 1. L2 归一化，确保向量长度都为 1
        img_feat = F.normalize(img_feat, dim=-1)
        text_feat = F.normalize(text_feat, dim=-1)

        # 2. 计算相似度矩阵，并乘以温度系数的倒数（即放缩）
        logits_per_img = img_feat @ text_feat.t() * self.temperature.exp()
        logits_per_text = logits_per_img.t()

        # 3. 构造标签：第 i 张图的正确答案就是文本 i
        batch_size = img_feat.shape[0]
        labels = torch.arange(batch_size, device=img_feat.device)

        # 4. 两个方向的交叉熵损失，取平均
        loss_img = F.cross_entropy(logits_per_img, labels)
        loss_text = F.cross_entropy(logits_per_text, labels)

        return (loss_img + loss_text) / 2

- The lower the temperature, the "sharper" the similarity distribution, and the model will pay more attention to those very certain negative samples, which is prone to overfitting; - The higher the temperature, the smoother the distribution, and the discrimination of negative samples will decrease. The default use of **0.07** in pre-training is an empirical balance point. If it is unstable at the beginning of training, you can first set it to 0.1 and then gradually reduce it.

3. CLIP’s trump card: zero-sample classification

3.1 Why can it be “self-taught without a teacher”?

Traditional image classification models need to define categories in advance, and each category requires a large amount of annotated data. CLIP skips this step entirely - it doesn't require any downstream annotation, or even knowing what categories there are.

The principle is actually very straightforward:

Each category you want to classify can be described in natural language, such as "cat", "a photo of a dog".
CLIP encodes these category texts into vectors, which serve as "templates" for this category.
Encode the images to be classified into vectors.
Compare the similarity between the image vector and the text vectors of all categories. The most similar one is the classification result.

In order to improve robustness, in actual use, not only a single description is used, but a set of prompt templates with similar semantics are constructed, for example:

“a photo of a {}”
“a blurry photo of a {}”
“a close-up of a {}”

Finally, by averaging the similarities of these templates, a more stable prediction can be obtained.

3.2 Hands-on experience: Using Hugging Face to achieve zero-sample classification

The original CLIP has specific requirements for the PyTorch version. Now the more convenient way is to use Hugging Face directly.transformerslibrary. In the following example, you only need to installtransformersandpillowYou can run it directly:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

# 1. 加载模型和处理器（ViT‑B/32 速度快，ViT‑L/14 精度更高）
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "openai/clip-vit-base-patch32"
processor = CLIPProcessor.from_pretrained(model_name)
model = CLIPModel.from_pretrained(model_name).to(device)

# 2. 准备图像和类别描述（配合提示模板）
image = Image.open("cat_dog.jpg").convert("RGB")
class_names = ["cat", "dog", "bird", "car"]
templates = [
    "a photo of a {}",
    "a photo of the {}",
    "a blurry photo of a {}",
    "a close-up of a {}"
]
# 将所有模板组合成完整的句子列表
texts = [template.format(cls) for cls in class_names for template in templates]

# 3. 预处理并推理
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
    outputs = model(**inputs)

# 4. 计算每个类别的平均相似度，并转换为概率
logits_per_img = outputs.logits_per_image  # shape: [1, num_texts]
# 按类别重新分组，取平均
class_logits = logits_per_img.view(len(class_names), len(templates)).mean(dim=-1)
probs = class_logits.softmax(dim=-1).cpu().numpy()[0]

for cls, prob in zip(class_names, probs):
    print(f"{cls}: {prob:.2%}")

Even if the model has never seen the type of image you provide, it can still give a fairly reliable classification result. This is the charm of zero-shot generalization.

4. Advantages, disadvantages and improvement directions of CLIP

4.1 Understand the length and shortness of a table

Advantages	Limitations
✅ Out-of-the-box zero-sample/few-sample capabilities, generalization far exceeds traditional supervised models	❌ Relying on 400 million high-quality image-text pairs, it is almost impossible for individuals or small teams to reproduce pre-training
✅ Twin-tower structure, fast reasoning speed, image and text coding can be deployed separately	❌ Weak ability to understand abstract concepts, fine-grained classification, and spatial relationships
✅ Provides a unified alignment framework for generative AI with strong scalability	❌ Sensitive to adversarial samples, small image perturbations may change the classification results
✅ Supports any free text as a category, not limited to fixed labels	❌ Bias (gender, race, etc.) in the training data will be directly reflected in the model output

4.2 What improvements have been derived from CLIP?

CLIP is like a Swiss Army Knife, but different scenarios require more sophisticated tools. The current mainstream improvement directions include:

Data Efficiency:
ALBEF: Introduce momentum distillation and additional image-text matching tasks to reduce dependence on the amount of data.
BLIP‑2: Freeze a ready-made large language model and a visual model respectively, and only train a lightweight Q‑Former to bridge the two, greatly reducing the computing overhead.
Fine-grained understanding:
FLAVA: Do global alignment and local area-word alignment at the same time.
CLIP‑Dissect: Try to decouple the semantic representation of CLIP and understand what it "learned".
Vertical field adaptation:
MedCLIP: Pre-trained specifically for medical images and clinical text.
AgriCLIP: Image and text alignment in agricultural scenarios.

5. Summary

The core contribution of CLIP is not that the structure is complex, but that it uses a minimalist paradigm, proving that large-scale weakly supervised data + contrastive learning is enough to break down the barrier between vision and language.

Because the idea is clean enough, CLIP has transformed from a model into an "infrastructure": it can be used as a starting point for almost all tasks involving image and text alignment. Although we rarely pre-train a CLIP from scratch, its open source pre-trained weights and out-of-the-box solutions provided by communities such as Hugging Face allow us to easily integrate this capability into our own projects.

1. First understand how contrastive learning and InfoNCE loss work; 2. Use Hugging Face to run through zero-sample classification and image-text retrieval, and experience the effect yourself; 3. Go back and study the internal details of ViT and Text Encoder to understand the calculations of each layer.

🔗 Core Reference Papers

Learning Transferable Visual Models From Natural Language Supervision (CLIP original paper)

#Vision-Language multi-modality: detailed explanation of CLIP model and image-text alignment

#Introduction

#1. Multimodal foundation and task positioning of CLIP

#1.1 Core application scenarios of vision-language

#1.2 Understand CLIP in one sentence

#2. Dismantling of CLIP core technology

#2.1 Dual encoder architecture: one for images and one for text

#Image Encoder (ViT simplified version)

#Text encoder (simplified version)

#2.2 Contrastive learning and InfoNCE loss

#Loss function implementation

#3. CLIP’s trump card: zero-sample classification

#3.1 Why can it be “self-taught without a teacher”?

#3.2 Hands-on experience: Using Hugging Face to achieve zero-sample classification

#4. Advantages, disadvantages and improvement directions of CLIP

#4.1 Understand the length and shortness of a table

#4.2 What improvements have been derived from CLIP?

#5. Summary