title: Detailed explanation of Diffusion Model: Diffusion model principle and PyTorch implementation | Daoman PythonAI description: In-depth analysis of Diffusion Model (diffusion model), introducing its application in tasks such as image generation, image generation, video generation, etc., including detailed architecture analysis, PyTorch implementation and practical application scenarios, covering core technologies such as DDPM and Stable Diffusion. keywords: [Diffusion Model, diffusion model, image generation, DDPM, Stable Diffusion, Vincent diagram, AI painting, deep learning, computer vision, PyTorch]
Detailed explanation of Diffusion Model: Diffusion model principle and PyTorch implementation
Hello everyone, this is Daoman PythonAI! Today we are dismantling the core technology that made Midjourney and Sora popular - Diffusion Model.
If GAN "forges" data in the game, and CycleGAN "translates" the style in the cycle, then the diffusion model will be the absolute overlord of image generation after 2021. Its inspiration does not come from art, but from non-equilibrium thermodynamics: first "reverse" the process of ink diffusion into black liquid, and restore noise to clear data.
1. Two cores of diffusion model
The diffusion model is a generative model that is divided into two stages: fixed forward destruction and learnable reverse reconstruction:
1.1 Forward Destruction: Adding Noise to a Wasted Image
to the original picturex₀Tiny Gaussian noise is continuously added to theTAfter steps (usually 1000-4000),x_Tbecomes completely unrecognizable standard normal distribution noise. This step requires no training and is a preset Markov chain.
1.2 Reverse reconstruction: denoising and making new images
The model learns to predict the noise being added at each stepε, and then use mathematical formulas fromx_tPush backx_{t-1}. After thousands of tiny denoising, finally fromx_TGenerate a meaningful new image.
2. Mathematical minimalist derivation
2.1 Forward formula: Calculate noise in one step
The original Markov chain needs to add noise step by step, but with the help of heavy parameterization technique, we can directly start from the original imagex₀deduce anytNoisy imagex_t, greatly simplifying the calculation:
Noise addition formula (text description)
x_t = sqrt(alpha_bar_t) * x₀ + sqrt(1 - alpha_bar_t) * ε
Among themεis from the standard normal distributionN(0, I)Randomly sampled noise.
The meaning of the key symbols in the formula:
α_t = 1 - β_t,β_tIt is noise scheduling, which controls how much noise is added at each step.β_tThe larger the value, the faster the image will be destroyed, but the value is usually no more than 0.02.alpha_bar_tIt's beforetstep allα_tThe cumulative product of:alpha_bar_t = α_1 * α_2 * ... * α_t。
With this one-step formula, only one command is needed during training to generate images with any noise level, which is extremely efficient.
2.2 Reverse training: just predict the noise
The training goal of the diffusion model is extremely simple - directly let the model predict the noiseε_θwith real added noiseεDo the mean square error (MSE). The form of the loss function is:
Loss function
L_t = 期望值 || ε - ε_θ(x_t, t) ||²
That is, we train a neural networkε_θ, input the current noisy imagex_tand time stept, output a guess about the noise, and then calculate the squared difference between it and the true noise. This simple and stable loss function is an important cornerstone of the success of diffusion models.
3. DDPM core implementation
DDPM (Denoising Diffusion Probabilistic Models) is the pioneering classic of diffusion models. We use PyTorch to implement the core module.
3.1 Core Network: Simplified U-Net with Time Embedding
The diffusion model needs to know "which step is the noise addition" currently, so it is necessary to add sine position coding to the time steptEncode into vectors and fuse with image features:
3.2 Noise scheduling and auxiliary functions
3.3 Training and sampling loop
4. Mainstream improvements and applications
4.1 Latent Diffusion (Stable Diffusion Core)
Training the diffusion model in pixel space takes up too much video memory (a 1024×1024 image has 3 million+ pixels), so Stable Diffusion compresses the image to 1/8 resolution latent space (4 channels) and then trains:
- Compress/decompress images with Pretrained VAE
- Use CLIP text encoder to convert prompt words into conditional vectors
- Conditional U-Net denoising in latent space
4.2 Core application scenarios
- 文生图/图生图: Midjourney, Stable Diffusion WebUI
- Video Generation: Sora, Pika Labs (diffusion in space-time potential space)
- Image Repair: Inpainting, Outpainting, Super Resolution
- Science research: molecular generation, material design
5. Daoman’s practical suggestions
5.1 Get started quickly
- First reproduce the simplified DDPM of MNIST/CIFAR-10
- Use Stable Diffusion WebUI to play with Vincent pictures
- Finally read the source code to understand conditional control and potential space
5.2 Pitfall avoidance guide
- The image** must be normalized to [-1, 1]**
- for time steps
torch.longtype - Try to use mixed precision (FP16) acceleration when sampling
Summarize
With three major advantages: stable training, high-quality generation, and strong controllability, the diffusion model has become the absolute core in the field of AI generation. From DDPM to Stable Diffusion to Sora, its evolution is extremely fast and its application scenarios are becoming wider and wider.
If you are interested in learning more, we recommend reading the extended essay at the end of the article!
🔗 Extended reading

