Computer vision interview high-frequency core questions and answers

Introduction

Computer vision is one of the most widely implemented branches of AI. The interview tests both theory and engineering integration skills. This guide has screened out the most core high-frequency questions within 30, streamlined redundancy, and helped you prepare for battle efficiently.

1. High-frequency basics of image processing

Q1: How to choose RGB vs HSV?

A：

RGB: A universal additive color model for monitors. The disadvantage is that brightness and color are strongly coupled—when the illumination changes, the three channels are all messed up, making it unsuitable for color segmentation.
HSV: separates hue (H, color essence), saturation (S, vividness), and brightness (V). After H is locked, changes in V/S will cause minimal interference.
Switching scenes: green screen cutout, traffic light/red apple recognition and other pure color related tasks.

Q2: How to choose Gaussian/median/bilateral filter?

A： For applicable scenarios of these filtering tools, you can write down this cheat sheet:

Filter type	Applicable noise	Core features
Gaussian filtering	Gaussian noise (uniform blur)	Neighborhood weighted average, blurs everything including edges
Median filtering	Salt and pepper noise (black and white points)	Take the neighborhood median, remove noise while preserving edges
Bilateral filtering	Normal noise + details to be preserved	Added "pixel value difference weight" - if the difference is too big, it will not be smooth, it can both smooth the skin and keep eyebrows (classic scene: beauty)

Q3: What is the core of Canny edge detection?

A： 5 steps but Only focus on the high-frequency simplified version of the interview:

Gaussian filter denoising
Sobel operator calculates gradient strength + direction
Non-maximum suppression (NMS) “slims” edges to 1 pixel wide
Double threshold + hysteresis tracking: only “high threshold edges” or “median edges connected to high threshold edges” are retained.

2. High-frequency basics of deep learning network

Q4: Why is CNN more suitable for images than MLP?

A： 3 cores of reducing the number of parameters + fitting image characteristics:

Local perception: The convolution kernel only scans the neighborhood (the image is locally relevant, for example, the pixels around the eyes constitute the eyes)
Parameter sharing: The same convolution kernel scans the entire image (for example, the rule for finding "eye edges" is common to all corners of the image)
Translation invariance: After adding MaxPooling, the object moves slightly and the output features remain unchanged.

Here is an intuitive parameter comparison code to feel how big the difference is:

import torch
import torch.nn as nn

def cnn_mlp_params():
    # 输入224x224x3，输出224x224x64
    # CNN参数量
    cnn = nn.Conv2d(3, 64, 3, padding=1)
    print(f"CNN参数量: {sum(p.numel() for p in cnn.parameters()) // 1000}K")  # ~1.7K
    # 对比MLP（仅理论简化，实际不可能做全连接）
    print(f"MLP理论参数量: {3*224*224 * 64*224*224 // 1e12}T")  # 超天文数字！

cnn_mlp_params()

Q5: What is the core role of 1×1 convolution?

A： Don’t just say “dimensionality reduction”! There are 3 high-frequency things to mention in interviews:

Cross-channel information fusion: For example, linearly kneading the information of three RGB channels together
Dimension raising/lowering control calculation: ResNet's Bottleneck relies on 2 times of 1×1 to raise the dimension, cutting the number of parameters to 1/4 of the original.
Add nonlinearity: 1×1 followed by ReLU can enhance expression capabilities without changing the resolution.

Q6: What are the core principles and functions of BN?

A：

Principle: Pull the features into "mean 0, variance 1" in each batch, and then learn γ (zooming) β (translation) to restore expression ability
Role: (by interview priority)

Prevent gradient explosion/disappearance
Significantly accelerate convergence (larger learning rate can be used)
Weak regularization (relying on the randomness of Batch)

3. Classic architecture high-frequency interview points

Q7: What problems does ResNet solve? How do you understand residuals?

A：

Core Pain Point: Deep Network Degradation - When the network is stacked above 50 layers, the training set error is higher than that of 20 layers (not overfitting, but unable to learn identity mapping)
Residual Structure: Add Skip Connection to let the network learnH(x) = F(x) + x(F(x) is the residual)
If the residual F(x)→0, the model can at least achieve "identity replication of the shallow network", ensuring that the performance will not decrease
If F(x) is useful, deeper features can be learned.

The following is a minimalist residual block implementation to help you understand the details of skip connection processing:

class ResidualBlock(nn.Module):
    """极简ResNet BasicBlock（不含Bottleneck）"""
    def __init__(self, in_c, out_c, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_c, out_c, 3, stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_c)
        self.conv2 = nn.Conv2d(out_c, out_c, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_c)
        # 当通道/分辨率变了，跳跃连接也要变
        self.shortcut = nn.Sequential()
        if stride != 1 or in_c != out_c:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_c, out_c, 1, stride, bias=False),
                nn.BatchNorm2d(out_c)
            )
    
    def forward(self, x):
        out = torch.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        return torch.relu(out)

Q8: What is the core design of MobileNet?

A： Depth separable convolution - split the standard convolution into two steps, the calculation amount is reduced to approximately 1/9 of the original:

Depthwise (depth convolution): A 3×3 convolution kernel only scans one input channel and does not fuse the channels
Pointwise (point convolution): 1×1 convolution sweeps all channels and specifically fuses information.

4. Target detection/image segmentation high frequency

Q9: How to choose One-stage vs Two-stage detector?

A： To sum up in one sentence: Choose Two-stage for ultimate accuracy, choose One-stage for real-time speed.

Type	Representative model	Core process	Advantages and disadvantages	Applicable scenarios
Two-stage	Faster R-CNN	First find the candidate frame → then classify/correct the frame	High accuracy, but slow (within 10FPS)	Offline annotation assistance, medical image analysis, etc. that have low speed requirements
One-stage	YOLO, SSD	Regress the box + classification directly on the feature map	Extremely fast (YOLOv8n can run 300+FPS), the accuracy is slightly inferior to the latest Two-stage	Autonomous driving, security monitoring, real-time live broadcast special effects, etc. that have high speed requirements

Q10: What is the difference between NMS and Soft-NMS?

A：

NMS process: Sort by confidence → take the highest score box → delete all boxes with its IoU > threshold → loop
Soft-NMS improvement: Instead of directly deleting, reduce the confidence of redundant boxes according to the degree of overlap (for example, IoU=0.8, score × 0.2), solving the problem of "small objects being mistakenly deleted" in dense scenes.

Q11: What is the difference between semantic/instance/panoramic segmentation?

A： Understand by interview priority:

Semantic Segmentation: Only classify "categories", regardless of "individuals" - for example, paint all "people" in the picture red and all "cars" blue
Instance segmentation: It is necessary to divide into "categories" and "individuals" - for example, paint "person A" in the picture red 1 and "person B" red 2
Panorama Segmentation: Semantics + examples, but also "countable/uncountable" - for example, "people" are countable (divided into individuals), and "sky" is uncountable (divided into colors only).

Q12: The core idea of U-Net?

A： Encoder-Decoder + full size jump connection:

Encoder: convolution + pooling, extracting deep semantic features (such as "there are cancer cells in this area"), but the resolution will be reduced
Decoder: transpose convolution + convolution, raise the resolution back, but the details will be lost
Full-size skip connection: Splice the high-resolution details of each layer of the Encoder directly to the corresponding layer of the Decoder, perfectly making up for the loss of details (classic scenario: medical image segmentation, because small lesions cannot be lost).

5. Loss function/training optimization high frequency

Q13: How to choose cross entropy/Focal Loss/Dice Loss?

A： A table to help you make quick decisions:

Loss function	Applicable scenarios	Core principles
Cross-entropy	General classification/segmentation, sample balancing	When the predicted probability deviates from the true label, the loss grows rapidly
Focal Loss	Target detection/segmentation (severe category imbalance, for example, background accounts for 99%)	Give weight attenuation to "easy-to-classify samples" (such as those that are determined to be background), and let the model focus on "difficult-to-classify samples" (such as small objects at the boundary)
Dice Loss	Medical image segmentation (very small target, such as a 1mm nodule)	Directly optimize IoU, only caring about the "coincidence between the predicted area and the real area", not the total number of pixels.

Q14: How to deal with sample imbalance?

A： Sort by project priority:

Algorithm level (fastest implementation):

Loss weighting: Add greater Loss weight to the minority class
Change to Focal Loss

Data level:

Data enhancement: do more geometric/color transformations (such as rotation, flipping, brightness dithering) for minority classes
Oversampling (small amount of data)/undersampling (large amount of redundant data)

Q15: How to choose SGD vs Adam?

A： The core is a trade-off between speed and generalization:

Adam: Convergence is extremely fast and insensitive to the initial learning rate → Quickly verify ideas and adjust the process in the early stage
SGD with momentum: slow convergence, difficult to adjust parameters, but generally better generalization ability → Pursue ultimate accuracy (such as competitions, high-precision landing).

Q16: How to solve overfitting?

A： Sorted by project implementation priority:

Data side (most efficient):

Add data
Strong data enhancement (Mixup, Cutout, CutMix)

Training end (quick implementation):

Early Stopping: The verification set Loss will stop if it does not decrease for N consecutive rounds.
L2 regularization (weight decay)
Label smoothing (prevents the model from being too confident)

Model side:

Reduce model complexity (such as changing from ResNet101 to ResNet50)
Add Dropout (note that Dropout and BN should not be used together in the post-training inference phase).

6. Project deployment and actual combat high frequency (super important!!)

Q17: What are the three core methods of model compression?

A： Sorted by ease of implementation + effect:

Quantification: reduce FP32→INT8/FP16, the size is reduced by 4 times, the inference speed is increased by 2-4 times, and the accuracy loss is minimal
Pruning: Remove channels with small weights/weak contribution to the results, directly reducing the amount of calculations
Knowledge Distillation: The large model (teacher) teaches the small model (students) to learn the output probability distribution (Soft Label), and the small model can approach the accuracy of the large model.

Q18: What do ONNX/TensorRT/NCNN do respectively?

A： Understand by deployment link:

ONNX: Cross-framework "translator" - converts PyTorch/TensorFlow models into a common format to facilitate migration
TensorRT: NVIDIA GPU inference "accelerator" - operator fusion, memory optimization, speed can be increased by 2-5 times
NCNN/MNN: Mobile phone/ARM embedded reasoning framework - small package, does not rely on third-party libraries, suitable for mobile terminal implementation.

Q19: What are the core pitfalls of pre-processing/post-processing?

A： **The first tip to avoid pitfalls: Preprocessing must be 100% consistent with training! **

For example, when training,(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225])ImageNet normalization must be exactly the same during inference
For example, when scaling to 640×640 during training, "maintain aspect ratio + fill black edges" is used, and cannot be stretched directly during inference (it will cause the object to deform and the accuracy to plummet)

Post-processing pitfall avoidance:

Post-processing of large model output (such as NMS) is often slower than inference → try to use GPU acceleration (such as TensorRT Plugin, NMS implemented by CUDA).

Q20: How to locate the problem of slow model speed?

A： **The core is segmented speed measurement! **

Measure the entire link:数据加载→预处理→推理→后处理→结果传输
Find bottlenecks:
Low GPU utilization (<50%) → the bottleneck is CPU/IO (data cannot be read, preprocessing is too slow)
High GPU utilization → too much model calculation → change to smaller Backbone, quantize, and prune
Long transfer time → don't return the original image, only the compressed result of the coordinates/mask.

7. Project in-depth interviews (reflecting the core of personal abilities!)

Don't just say "I used YOLOv8 for target detection"! **Must add "quantitative indicators, specific pain points to be solved, and project details"**.

Reference question 1: What is the difficulty of your project?

Reference answer (combined with engineering):

“The difficulty is to control the end-to-end delay of target detection within 100ms on edge devices (such as the Jetson Nano with 20TOPS computing power), and at the same time, the mAP cannot be reduced by more than 2%**. I first did FP16 quantization, which doubled the speed but still had a latency of 150ms. Then I changed Backbone from ResNet50 to MobileNetV2, and the accuracy dropped by 3%. Finally, I added TensorRT’s operator fusion and dynamic batch, and the latency dropped to 85ms, and the accuracy returned to only a 1.8% drop. "

Reference question 2: Why did you choose this model?

Reference answer (reflecting Trade-off):

“I chose YOLOv8s because it achieves the optimal balance between accuracy, speed, and deployment difficulty. Compared with Two-stage Faster R-CNN, it is 10 times faster and suitable for real-time scenarios; compared with pure Transformer DETR, it converges faster on small data sets (I only have 20,000 annotated images) and does not require too many parameter adjustments; compared with the lighter YOLOv8n, its mAP is 5% higher, meeting business requirements. "

When preparing for the interview, you must be familiar with the 1-2 core projects you have worked on. You must be able to describe the quantitative indicators, pain points solved, and project details of each link.

#Computer vision interview high-frequency core questions and answers

#Introduction

#1. High-frequency basics of image processing

#Q1: How to choose RGB vs HSV?

#Q2: How to choose Gaussian/median/bilateral filter?

#Q3: What is the core of Canny edge detection?

#2. High-frequency basics of deep learning network

#Q4: Why is CNN more suitable for images than MLP?

#Q5: What is the core role of 1×1 convolution?

#Q6: What are the core principles and functions of BN?

#3. Classic architecture high-frequency interview points

#Q7: What problems does ResNet solve? How do you understand residuals?

#Q8: What is the core design of MobileNet?

#4. Target detection/image segmentation high frequency

#Q9: How to choose One-stage vs Two-stage detector?

#Q10: What is the difference between NMS and Soft-NMS?

#Q11: What is the difference between semantic/instance/panoramic segmentation?

#Q12: The core idea of ​​U-Net?

#5. Loss function/training optimization high frequency

#Q13: How to choose cross entropy/Focal Loss/Dice Loss?

#Q14: How to deal with sample imbalance?

#Q15: How to choose SGD vs Adam?

#Q16: How to solve overfitting?

#6. Project deployment and actual combat high frequency (super important!!)

#Q17: What are the three core methods of model compression?

#Q18: What do ONNX/TensorRT/NCNN do respectively?

#Q19: What are the core pitfalls of pre-processing/post-processing?

#Q20: How to locate the problem of slow model speed?

#7. Project in-depth interviews (reflecting the core of personal abilities!)

#Reference question 1: What is the difficulty of your project?

#Reference question 2: Why did you choose this model?

#Related tutorials