Computer Vision (CV) Interview and Practical Red Book

Recently, I have received a lot of private messages from people who are stuck in practice/interviews: either they only know how to adjust YOLOv8 but cannot implement it, or they are asked by the interviewer "Why can ResNet do deep work" and "How do you adjust the Focal Loss alpha/gamma you use?" and cannot answer the question.

This little red book is here to fill this gap - Condensed 3 years of algorithm job interview experience + 2 years of landing pitfalls, from the bottom to engineering, the text only says "used in interviews/actual combat", and the codes are all "reproducible/modifiable".

1. Basic skills: a stepping stone for interviews and fine-tuning basic skills

1. Color space and channel (🎯 common interview application scenarios)

In interviews, questions related to color space usually revolve around "when should I use which space?" The following code snippet demonstrates the most classic HSV color segmentation scenario (such as green screen matting, traffic sign extraction):

import cv2
import numpy as np

def hsv_color_segment():
    """【实战必备】绿幕/交通标志颜色分割"""
    img = cv2.imread("traffic_red.jpg")
    hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
    # 红色在 HSV 色相环上环绕 0° 附近，需要分成两段
    lower1, upper1 = np.array([0,50,50]), np.array([10,255,255])
    lower2, upper2 = np.array([170,50,50]), np.array([180,255,255])
    mask = cv2.inRange(hsv, lower1, upper1) + cv2.inRange(hsv, lower2, upper2)
    return cv2.bitwise_and(img, img, mask=mask)

A comparison of several commonly used color spaces is summarized as follows:

Color space	Application scenarios	Advantages and disadvantages
RGB	Display, deep learning basic training	Channels are highly coupled and sensitive to illumination changes
HSV	Color segmentation (such as traffic signs, green screens)	Illumination robust, hue channel independent of brightness
GRAY	Low-level feature extraction and dimensionality reduction	Lose color information, but require less calculation

Interview Technique: Why not do RGB segmentation? Because the three RGB channels are affected by lighting at the same time, and in HSV the hue H is basically independent of brightness and saturation, the threshold is easy to set.

2. Filtering and noise reduction (⚠️Avoiding pitfalls in actual combat: choose the right filter)

Image filtering is often the first step in image preprocessing, and different noises require different filters. Here are the actual combat tips + pitfall scenarios:

Salt and Pepper Noise (random black and white points): Use Median Filtercv2.medianBlur, which can effectively remove discrete noise points while protecting edges.
Gaussian noise (the overall picture is blurry and the particles are fine): use Gaussian filtercv2.GaussianBlur。
Skin resurfacing/Edge preserving denoising: Recommended Bilateral filteringcv2.bilateralFilter, which smoothes areas while keeping edges clear. The disadvantage is that the calculation speed is relatively slow.

If you are asked "Why not use mean filtering" during the interview, you can answer this: mean filtering will blur the noise and edges at the same time, while median filtering is better at removing salt and pepper noise and has stronger edge preservation capabilities.

3. Classic features and edges (🎯 traditional job/entry interview must ask)

Canny edge detection (4 steps to memorize)

Canny is a traditional algorithm that is almost unavoidable in computer vision. Its complete process only has 4 steps, but it often appears as an interview question:

Gaussian denoising – First use Gaussian filtering to smooth the image and reduce noise interference.
Gradient amplitude and direction calculation – Use the Sobel operator to obtain the gradient intensity and direction of each pixel.
Non-maximum suppression – Only retain the local maximum value in the gradient direction to make the edges “thinner”.
Dual threshold hysteresis connection – Set two thresholds, high and low: strong edges are retained directly, weak edges are retained only when they are connected to strong edges, otherwise they are discarded.

Comparison of classic features (record interview answers directly)

Features	Immutability	Speed	Patents	Applications
SIFT	Rotation/Scale/Brightness	Slower	Yes (now expired)	High-precision stitching, image matching
ORB	Rotation/partial scaling	Extremely fast	None	Real-time SLAM, mobile matching

Interview tip: When asked "Why not use SIFT", it can be said that although SIFT has high accuracy, it is too slow in real-time scenarios and has patent issues in the early years; ORB, as a free alternative, is more commonly used in applications such as SLAM.

2. Deep learning core: interview hardest hit area (accounting for 60%+)

1. CNN basics (3 core points)

Convolutional neural networks significantly reduce the number of parameters through local connections and shared weights. The network goes from shallow to deep, and the features also go from simple to complex: Edge → Texture → Object Parts → Overall Scene.

3 functions of 1×1 convolution (🚀ResNet/MobileNet/Inception are all used)

1×1 convolution may seem simple, but it plays a key role in modern networks:

Cross-channel feature fusion: Linearly combine information from different channels.
Dimensionality reduction / dimensionality enhancement: Reduce or increase the number of feature maps by changing the number of output channels to control the amount of calculation.
Add nonlinearity: Cooperate with the activation function to improve the model expression ability without changing the spatial size.

Interview questions: Why do both MobileNet and ResNet use 1×1 convolution? Answer: 1×1 convolution in MobileNet is the Pointwise part, responsible for the flow of information between channels; in the ResNet bottleneck structure, 1×1 is first used to reduce the dimension, then 3×3 convolution, and finally 1×1 is used to restore the dimension, which greatly saves parameters.

3 functions of BatchNorm (🎯 must be memorized for interviews)

Reducing gradient disappearance/explosion: Stabilize the input distribution of each layer within a reasonable range.
Speed up convergence: Make the network less sensitive to initialization and learning rate.
Allows the use of larger learning rates: The training process is more stable and the speed is obvious.

2. Classic architecture (only remember the most commonly tested ResNet residual block)

The core of ResNet is residual learning. The following code implements a basic residual block, Identity Mapping, which ensures that the deep network is at least no worse than the shallow network, effectively solving the degradation problem.

import torch
import torch.nn as nn
import torch.nn.functional as F

class ResBlock(nn.Module):
    """【面试核心】解决深层网络退化问题：Identity Mapping 至少不比前一层差"""
    def __init__(self, in_ch, out_ch, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_ch, out_ch, 3, stride, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_ch)
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, 1, 1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_ch)
        # 短接处理通道/尺寸不一致
        self.shortcut = nn.Sequential()
        if stride != 1 or in_ch != out_ch:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 1, stride, bias=False),
                nn.BatchNorm2d(out_ch)
            )
    def forward(self, x):
        residual = self.shortcut(x)
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        return F.relu(out + residual)

Remember one sentence from other classic architectures:

MobileNet: Depthwise separable convolution = Depthwise convolution + Pointwise convolution, the number of parameters and calculation amount are reduced to about 1/9 of the original.
Inception: Use multi-scale convolution kernels in parallel on the same layer to allow the network to adaptively select an appropriate receptive field.
Transformer/ViT: ViT is a pure Transformer visual model, but it is often combined with CNN in actual implementation. For example, Swin Transformer introduces a self-attention mechanism based on the traditional CNN structure.

3. Target detection (🎯 core question of algorithm post)

Comparison of odd and double stages (one sentence summary)

Type	Representative	Speed	Accuracy	Application
One-stage	YOLO/SSD	Fast	Slightly lower	Real-time detection, mobile terminal
Two-stage	Faster R-CNN	Slow	High	Medical imaging, precision quality inspection

Interview Tips: Why is Two-stage more accurate? Because it first uses RPN to generate candidate areas and then classifies them, the foreground/background screening is more refined; while One-stage directly makes dense predictions on the feature map, which easily produces a large number of negative samples.

NMS (non-maximum suppression)

Standard NMS core logic: Sort by classification score → Suppress redundant boxes with excessively large IOUs of the highest-scoring boxes (directly delete them).
Improvement solution Soft-NMS: Instead of directly deleting overlapping frames, reduce their scores to alleviate the problem of missed detection in dense scenes (such as pedestrian occlusion).

3. Practical tuning: the leap from student to engineer

1. Loss function (⚠️ category imbalance/segmentation must be adjusted)

Focal Loss (target detection background is much more than the foreground)

The core idea: Reduce the loss weight of easy-to-classify samples and let the model pay more attention to difficult-to-classify samples. Key parameters:

α(alpha): balances positive and negative samples, usually ranging from 0.25 to 0.75.
γ(gamma): Adjust the weight of difficult samples, usually set to 2 to 5. The larger the value, the higher the attention paid to difficult samples.

# 伪代码，展示 Focal Loss 核心计算
alpha = 0.25
gamma = 2.0
p_t = pred[target]  # 目标类别的预测概率
loss = -alpha * (1 - p_t) ** gamma * torch.log(p_t)

Interview pitfalls: Don't just say "Using Focal Loss can solve the imbalance", you must be able to explain itgammaThe role of - whengamma=0When Focal Loss degenerates into a bandalphaThe cross entropy ofgammaThe larger the value, the smaller the "penalty" the model will have on the samples that have been classified into pairs, and the more attention will be focused on the samples that are difficult to classify.

Dice Loss (semantic segmentation small target)

Directly optimizes the IOU (intersection and union ratio), especially suitable for scenes with a very small proportion of foreground pixels. Its form can be viewed as a loss function that measures the overlap between predictions and labels, which can alleviate the class imbalance problem better than cross-entropy.

2. Optimizer (novices choose Adam, and for tuning, switch to SGD)

Optimizer	Features	Applicability
Adam	Adaptive learning rate, fast convergence	Novice training, rapid iteration, prototype verification
SGD + Momentum	Stable, higher final accuracy	Implementation optimization, competition and ranking, pursuit of ultimate performance

Interview Experience: Why not just use Adam all the time? Because Adam's adaptive learning rate may cause the model's generalization ability to be inferior to SGD in the later stage, it is often switched to SGD + Momentum for fine-tuning when pursuing final accuracy.

4. Engineering deployment: the last mile of algorithm implementation

1. Model acceleration (🚀 production environment 3-piece set)

Quantification: Convert FP32 weights to INT8, increasing the inference speed by 2 to 4 times, and reducing the memory usage to 1/4 of the original. Powered by PyTorchtorch.quantizationModule that supports dynamic quantization and QAT (Quantization Aware Training).
Knowledge Distillation: Use the soft labels generated by the large model (teacher network) to guide the learning of the small model (student network), so that the small model is close to the large model in accuracy while retaining reasoning efficiency.
TensorRT: NVIDIA's inference optimization engine, which further speeds up through operator fusion, memory optimization, etc., is a standard tool for GPU production deployment.

2. Performance tuning checklist (⚠️ must be checked before implementation)

Slow speed troubleshooting

IO bottleneck: usenum_workersLoad data, preloading data into memory if necessary.
Frequent data transfer: reducecpu() / gpu()Unnecessary switching.
Preprocessing time-consuming: try to batch process, use OpenCV instead of PIL for image operations.

Low accuracy troubleshooting

Whether the input size is aligned with the training size (to avoid feature inconsistency caused by size deviation).
Is the intensity of data enhancement reasonable (too strong may cause the target to be cropped and disappear, while too weak may result in insufficient generalization).
Category imbalance problem: Use Focal Loss, oversampling, category weight and other methods to deal with it.

5. Frequently asked calculation questions (a guide to avoid pitfalls, no need to memorize too many)

Output size formula

When designing networks or analyzing receptive fields, it is often necessary to calculate the convolution output size. The formula itself is very simple, and the code representation is clear at a glance:

H_out = floor((H_in + 2*P - K)/S) + 1
W_out 同理

in:

H_in / W_in: Enter height/width
P: padding
K: Convolution kernel size
S: step length (stride)
floor: Round down

Interview Example: Input 224×224, 3×3 convolution, stride 1, padding 1, what is the output size? → 224×224 (called “same” convolution because the input and output sizes are the same).

6. Summary

Computer vision is much more than just switching packages and running YOLO. From interview to landing, a complete knowledge system should include:

Basic skills – Image processing and feature engineering are the basis for tuning and preprocessing.
Deep Learning – CNN design paradigm and classic architecture are the core of algorithm job interviews.
Practical Tuning – Loss function, optimizer, and data strategy determine whether you can make a baseline available online.
Project Deployment – Quantification, distillation, and TensorRT are the last steps in realizing the value of the algorithm.

Combining theory + practice: first look at the architecture/algorithm principles, then run open source code to reproduce, and finally find a small data set (such as Kaggle cat and dog recognition, traffic light detection) to practice. In this way, there are theories, projects, and details during the interview, and any follow-up questions can be answered.

#Computer Vision (CV) Interview and Practical Red Book

#1. Basic skills: a stepping stone for interviews and fine-tuning basic skills

#1. Color space and channel (🎯 common interview application scenarios)

#2. Filtering and noise reduction (⚠️Avoiding pitfalls in actual combat: choose the right filter)

#3. Classic features and edges (🎯 traditional job/entry interview must ask)

#Canny edge detection (4 steps to memorize)

#Comparison of classic features (record interview answers directly)

#2. Deep learning core: interview hardest hit area (accounting for 60%+)

#1. CNN basics (3 core points)

#Local perception + parameter sharing

#3 functions of 1×1 convolution (🚀ResNet/MobileNet/Inception are all used)

#3 functions of BatchNorm (🎯 must be memorized for interviews)

#2. Classic architecture (only remember the most commonly tested ResNet residual block)

#3. Target detection (🎯 core question of algorithm post)

#Comparison of odd and double stages (one sentence summary)

#NMS (non-maximum suppression)

#3. Practical tuning: the leap from student to engineer

#1. Loss function (⚠️ category imbalance/segmentation must be adjusted)

#Focal Loss (target detection background is much more than the foreground)

#Dice Loss (semantic segmentation small target)

#2. Optimizer (novices choose Adam, and for tuning, switch to SGD)

#4. Engineering deployment: the last mile of algorithm implementation

#1. Model acceleration (🚀 production environment 3-piece set)

#2. Performance tuning checklist (⚠️ must be checked before implementation)

#Slow speed troubleshooting

#Low accuracy troubleshooting

#5. Frequently asked calculation questions (a guide to avoid pitfalls, no need to memorize too many)

#Output size formula

#6. Summary

#Related tutorials