Computer Vision (CV) Interview and Practical Red Book
Recently, I have received a lot of private messages from people who are stuck in practice/interviews: either they only know how to adjust YOLOv8 but cannot implement it, or they are asked by the interviewer "Why can ResNet do deep work" and "How do you adjust the Focal Loss alpha/gamma you use?" and cannot answer the question.
This little red book is here to fill this gap - Condensed 3 years of algorithm job interview experience + 2 years of landing pitfalls, from the bottom to engineering, the text only says "used in interviews/actual combat", and the codes are all "reproducible/modifiable".
1. Basic skills: a stepping stone for interviews and fine-tuning basic skills
1. Color space and channel (🎯 common interview application scenarios)
In interviews, questions related to color space usually revolve around "when should I use which space?" The following code snippet demonstrates the most classic HSV color segmentation scenario (such as green screen matting, traffic sign extraction):
A comparison of several commonly used color spaces is summarized as follows:
Interview Technique: Why not do RGB segmentation? Because the three RGB channels are affected by lighting at the same time, and in HSV the hue H is basically independent of brightness and saturation, the threshold is easy to set.
2. Filtering and noise reduction (⚠️Avoiding pitfalls in actual combat: choose the right filter)
Image filtering is often the first step in image preprocessing, and different noises require different filters. Here are the actual combat tips + pitfall scenarios:
- Salt and Pepper Noise (random black and white points): Use Median Filter
cv2.medianBlur, which can effectively remove discrete noise points while protecting edges. - Gaussian noise (the overall picture is blurry and the particles are fine): use Gaussian filter
cv2.GaussianBlur。 - Skin resurfacing/Edge preserving denoising: Recommended Bilateral filtering
cv2.bilateralFilter, which smoothes areas while keeping edges clear. The disadvantage is that the calculation speed is relatively slow.
If you are asked "Why not use mean filtering" during the interview, you can answer this: mean filtering will blur the noise and edges at the same time, while median filtering is better at removing salt and pepper noise and has stronger edge preservation capabilities.
3. Classic features and edges (🎯 traditional job/entry interview must ask)
Canny edge detection (4 steps to memorize)
Canny is a traditional algorithm that is almost unavoidable in computer vision. Its complete process only has 4 steps, but it often appears as an interview question:
- Gaussian denoising – First use Gaussian filtering to smooth the image and reduce noise interference.
- Gradient amplitude and direction calculation – Use the Sobel operator to obtain the gradient intensity and direction of each pixel.
- Non-maximum suppression – Only retain the local maximum value in the gradient direction to make the edges “thinner”.
- Dual threshold hysteresis connection – Set two thresholds, high and low: strong edges are retained directly, weak edges are retained only when they are connected to strong edges, otherwise they are discarded.
Comparison of classic features (record interview answers directly)
Interview tip: When asked "Why not use SIFT", it can be said that although SIFT has high accuracy, it is too slow in real-time scenarios and has patent issues in the early years; ORB, as a free alternative, is more commonly used in applications such as SLAM.
2. Deep learning core: interview hardest hit area (accounting for 60%+)
1. CNN basics (3 core points)
Local perception + parameter sharing
Convolutional neural networks significantly reduce the number of parameters through local connections and shared weights. The network goes from shallow to deep, and the features also go from simple to complex: Edge → Texture → Object Parts → Overall Scene.
3 functions of 1×1 convolution (🚀ResNet/MobileNet/Inception are all used)
1×1 convolution may seem simple, but it plays a key role in modern networks:
- Cross-channel feature fusion: Linearly combine information from different channels.
- Dimensionality reduction / dimensionality enhancement: Reduce or increase the number of feature maps by changing the number of output channels to control the amount of calculation.
- Add nonlinearity: Cooperate with the activation function to improve the model expression ability without changing the spatial size.
Interview questions: Why do both MobileNet and ResNet use 1×1 convolution? Answer: 1×1 convolution in MobileNet is the Pointwise part, responsible for the flow of information between channels; in the ResNet bottleneck structure, 1×1 is first used to reduce the dimension, then 3×3 convolution, and finally 1×1 is used to restore the dimension, which greatly saves parameters.
3 functions of BatchNorm (🎯 must be memorized for interviews)
- Reducing gradient disappearance/explosion: Stabilize the input distribution of each layer within a reasonable range.
- Speed up convergence: Make the network less sensitive to initialization and learning rate.
- Allows the use of larger learning rates: The training process is more stable and the speed is obvious.
2. Classic architecture (only remember the most commonly tested ResNet residual block)
The core of ResNet is residual learning. The following code implements a basic residual block, Identity Mapping, which ensures that the deep network is at least no worse than the shallow network, effectively solving the degradation problem.
Remember one sentence from other classic architectures:
- MobileNet: Depthwise separable convolution = Depthwise convolution + Pointwise convolution, the number of parameters and calculation amount are reduced to about 1/9 of the original.
- Inception: Use multi-scale convolution kernels in parallel on the same layer to allow the network to adaptively select an appropriate receptive field.
- Transformer/ViT: ViT is a pure Transformer visual model, but it is often combined with CNN in actual implementation. For example, Swin Transformer introduces a self-attention mechanism based on the traditional CNN structure.
3. Target detection (🎯 core question of algorithm post)
Comparison of odd and double stages (one sentence summary)
Interview Tips: Why is Two-stage more accurate? Because it first uses RPN to generate candidate areas and then classifies them, the foreground/background screening is more refined; while One-stage directly makes dense predictions on the feature map, which easily produces a large number of negative samples.
NMS (non-maximum suppression)
- Standard NMS core logic: Sort by classification score → Suppress redundant boxes with excessively large IOUs of the highest-scoring boxes (directly delete them).
- Improvement solution Soft-NMS: Instead of directly deleting overlapping frames, reduce their scores to alleviate the problem of missed detection in dense scenes (such as pedestrian occlusion).
3. Practical tuning: the leap from student to engineer
1. Loss function (⚠️ category imbalance/segmentation must be adjusted)
Focal Loss (target detection background is much more than the foreground)
The core idea: Reduce the loss weight of easy-to-classify samples and let the model pay more attention to difficult-to-classify samples. Key parameters:
α(alpha): balances positive and negative samples, usually ranging from 0.25 to 0.75.γ(gamma): Adjust the weight of difficult samples, usually set to 2 to 5. The larger the value, the higher the attention paid to difficult samples.
Interview pitfalls: Don't just say "Using Focal Loss can solve the imbalance", you must be able to explain itgammaThe role of - whengamma=0When Focal Loss degenerates into a bandalphaThe cross entropy ofgammaThe larger the value, the smaller the "penalty" the model will have on the samples that have been classified into pairs, and the more attention will be focused on the samples that are difficult to classify.
Dice Loss (semantic segmentation small target)
Directly optimizes the IOU (intersection and union ratio), especially suitable for scenes with a very small proportion of foreground pixels. Its form can be viewed as a loss function that measures the overlap between predictions and labels, which can alleviate the class imbalance problem better than cross-entropy.
2. Optimizer (novices choose Adam, and for tuning, switch to SGD)
Interview Experience: Why not just use Adam all the time? Because Adam's adaptive learning rate may cause the model's generalization ability to be inferior to SGD in the later stage, it is often switched to SGD + Momentum for fine-tuning when pursuing final accuracy.
4. Engineering deployment: the last mile of algorithm implementation
1. Model acceleration (🚀 production environment 3-piece set)
- Quantification: Convert FP32 weights to INT8, increasing the inference speed by 2 to 4 times, and reducing the memory usage to 1/4 of the original. Powered by PyTorch
torch.quantizationModule that supports dynamic quantization and QAT (Quantization Aware Training). - Knowledge Distillation: Use the soft labels generated by the large model (teacher network) to guide the learning of the small model (student network), so that the small model is close to the large model in accuracy while retaining reasoning efficiency.
- TensorRT: NVIDIA's inference optimization engine, which further speeds up through operator fusion, memory optimization, etc., is a standard tool for GPU production deployment.
2. Performance tuning checklist (⚠️ must be checked before implementation)
Slow speed troubleshooting
- IO bottleneck: use
num_workersLoad data, preloading data into memory if necessary. - Frequent data transfer: reduce
cpu()/gpu()Unnecessary switching. - Preprocessing time-consuming: try to batch process, use OpenCV instead of PIL for image operations.
Low accuracy troubleshooting
- Whether the input size is aligned with the training size (to avoid feature inconsistency caused by size deviation).
- Is the intensity of data enhancement reasonable (too strong may cause the target to be cropped and disappear, while too weak may result in insufficient generalization).
- Category imbalance problem: Use Focal Loss, oversampling, category weight and other methods to deal with it.
5. Frequently asked calculation questions (a guide to avoid pitfalls, no need to memorize too many)
Output size formula
When designing networks or analyzing receptive fields, it is often necessary to calculate the convolution output size. The formula itself is very simple, and the code representation is clear at a glance:
in:
H_in/W_in: Enter height/widthP: paddingK: Convolution kernel sizeS: step length (stride)floor: Round down
Interview Example: Input 224×224, 3×3 convolution, stride 1, padding 1, what is the output size? → 224×224 (called “same” convolution because the input and output sizes are the same).
6. Summary
Computer vision is much more than just switching packages and running YOLO. From interview to landing, a complete knowledge system should include:
- Basic skills – Image processing and feature engineering are the basis for tuning and preprocessing.
- Deep Learning – CNN design paradigm and classic architecture are the core of algorithm job interviews.
- Practical Tuning – Loss function, optimizer, and data strategy determine whether you can make a baseline available online.
- Project Deployment – Quantification, distillation, and TensorRT are the last steps in realizing the value of the algorithm.

