Computer vision interview high-frequency core questions and answers


Introduction

Computer vision is one of the most widely implemented branches of AI. The interview tests both theory and engineering integration skills. This guide has screened out the most core high-frequency questions within 30, streamlined redundancy, and helped you prepare for battle efficiently.


1. High-frequency basics of image processing

Q1: How to choose RGB vs HSV?

A:

  • RGB: A universal additive color model for monitors. The disadvantage is that brightness and color are strongly coupled—when the illumination changes, the three channels are all messed up, making it unsuitable for color segmentation.
  • HSV: separates hue (H, color essence), saturation (S, vividness), and brightness (V). After H is locked, changes in V/S will cause minimal interference.
  • Switching scenes: green screen cutout, traffic light/red apple recognition and other pure color related tasks.

Q2: How to choose Gaussian/median/bilateral filter?

A: For applicable scenarios of these filtering tools, you can write down this cheat sheet:

Filter typeApplicable noiseCore features
Gaussian filteringGaussian noise (uniform blur)Neighborhood weighted average, blurs everything including edges
Median filteringSalt and pepper noise (black and white points)Take the neighborhood median, remove noise while preserving edges
Bilateral filteringNormal noise + details to be preservedAdded "pixel value difference weight" - if the difference is too big, it will not be smooth, it can both smooth the skin and keep eyebrows (classic scene: beauty)

Q3: What is the core of Canny edge detection?

A: 5 steps but Only focus on the high-frequency simplified version of the interview:

  1. Gaussian filter denoising
  2. Sobel operator calculates gradient strength + direction
  3. Non-maximum suppression (NMS) “slims” edges to 1 pixel wide
  4. Double threshold + hysteresis tracking: only “high threshold edges” or “median edges connected to high threshold edges” are retained.

2. High-frequency basics of deep learning network

Q4: Why is CNN more suitable for images than MLP?

A: 3 cores of reducing the number of parameters + fitting image characteristics:

  1. Local perception: The convolution kernel only scans the neighborhood (the image is locally relevant, for example, the pixels around the eyes constitute the eyes)
  2. Parameter sharing: The same convolution kernel scans the entire image (for example, the rule for finding "eye edges" is common to all corners of the image)
  3. Translation invariance: After adding MaxPooling, the object moves slightly and the output features remain unchanged.

Here is an intuitive parameter comparison code to feel how big the difference is:

import torch
import torch.nn as nn

def cnn_mlp_params():
    # 输入224x224x3,输出224x224x64
    # CNN参数量
    cnn = nn.Conv2d(3, 64, 3, padding=1)
    print(f"CNN参数量: {sum(p.numel() for p in cnn.parameters()) // 1000}K")  # ~1.7K
    # 对比MLP(仅理论简化,实际不可能做全连接)
    print(f"MLP理论参数量: {3*224*224 * 64*224*224 // 1e12}T")  # 超天文数字!

cnn_mlp_params()

Q5: What is the core role of 1×1 convolution?

A: Don’t just say “dimensionality reduction”! There are 3 high-frequency things to mention in interviews:

  1. Cross-channel information fusion: For example, linearly kneading the information of three RGB channels together
  2. Dimension raising/lowering control calculation: ResNet's Bottleneck relies on 2 times of 1×1 to raise the dimension, cutting the number of parameters to 1/4 of the original.
  3. Add nonlinearity: 1×1 followed by ReLU can enhance expression capabilities without changing the resolution.

Q6: What are the core principles and functions of BN?

A:

  • Principle: Pull the features into "mean 0, variance 1" in each batch, and then learn γ (zooming) β (translation) to restore expression ability
  • Role: (by interview priority)
  1. Prevent gradient explosion/disappearance
  2. Significantly accelerate convergence (larger learning rate can be used)
  3. Weak regularization (relying on the randomness of Batch)

3. Classic architecture high-frequency interview points

Q7: What problems does ResNet solve? How do you understand residuals?

A:

  • Core Pain Point: Deep Network Degradation - When the network is stacked above 50 layers, the training set error is higher than that of 20 layers (not overfitting, but unable to learn identity mapping)
  • Residual Structure: Add Skip Connection to let the network learnH(x) = F(x) + x(F(x) is the residual)
  • If the residual F(x)→0, the model can at least achieve "identity replication of the shallow network", ensuring that the performance will not decrease
  • If F(x) is useful, deeper features can be learned.

The following is a minimalist residual block implementation to help you understand the details of skip connection processing:

class ResidualBlock(nn.Module):
    """极简ResNet BasicBlock(不含Bottleneck)"""
    def __init__(self, in_c, out_c, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_c, out_c, 3, stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_c)
        self.conv2 = nn.Conv2d(out_c, out_c, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_c)
        # 当通道/分辨率变了,跳跃连接也要变
        self.shortcut = nn.Sequential()
        if stride != 1 or in_c != out_c:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_c, out_c, 1, stride, bias=False),
                nn.BatchNorm2d(out_c)
            )
    
    def forward(self, x):
        out = torch.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        return torch.relu(out)

Q8: What is the core design of MobileNet?

A: Depth separable convolution - split the standard convolution into two steps, the calculation amount is reduced to approximately 1/9 of the original:

  1. Depthwise (depth convolution): A 3×3 convolution kernel only scans one input channel and does not fuse the channels
  2. Pointwise (point convolution): 1×1 convolution sweeps all channels and specifically fuses information.

4. Target detection/image segmentation high frequency

Q9: How to choose One-stage vs Two-stage detector?

A: To sum up in one sentence: Choose Two-stage for ultimate accuracy, choose One-stage for real-time speed.

TypeRepresentative modelCore processAdvantages and disadvantagesApplicable scenarios
Two-stageFaster R-CNNFirst find the candidate frame → then classify/correct the frameHigh accuracy, but slow (within 10FPS)Offline annotation assistance, medical image analysis, etc. that have low speed requirements
One-stageYOLO, SSDRegress the box + classification directly on the feature mapExtremely fast (YOLOv8n can run 300+FPS), the accuracy is slightly inferior to the latest Two-stageAutonomous driving, security monitoring, real-time live broadcast special effects, etc. that have high speed requirements

Q10: What is the difference between NMS and Soft-NMS?

A:

  • NMS process: Sort by confidence → take the highest score box → delete all boxes with its IoU > threshold → loop
  • Soft-NMS improvement: Instead of directly deleting, reduce the confidence of redundant boxes according to the degree of overlap (for example, IoU=0.8, score × 0.2), solving the problem of "small objects being mistakenly deleted" in dense scenes.

Q11: What is the difference between semantic/instance/panoramic segmentation?

A: Understand by interview priority:

  1. Semantic Segmentation: Only classify "categories", regardless of "individuals" - for example, paint all "people" in the picture red and all "cars" blue
  2. Instance segmentation: It is necessary to divide into "categories" and "individuals" - for example, paint "person A" in the picture red 1 and "person B" red 2
  3. Panorama Segmentation: Semantics + examples, but also "countable/uncountable" - for example, "people" are countable (divided into individuals), and "sky" is uncountable (divided into colors only).

Q12: The core idea of ​​U-Net?

A: Encoder-Decoder + full size jump connection:

  • Encoder: convolution + pooling, extracting deep semantic features (such as "there are cancer cells in this area"), but the resolution will be reduced
  • Decoder: transpose convolution + convolution, raise the resolution back, but the details will be lost
  • Full-size skip connection: Splice the high-resolution details of each layer of the Encoder directly to the corresponding layer of the Decoder, perfectly making up for the loss of details (classic scenario: medical image segmentation, because small lesions cannot be lost).

5. Loss function/training optimization high frequency

Q13: How to choose cross entropy/Focal Loss/Dice Loss?

A: A table to help you make quick decisions:

Loss functionApplicable scenariosCore principles
Cross-entropyGeneral classification/segmentation, sample balancingWhen the predicted probability deviates from the true label, the loss grows rapidly
Focal LossTarget detection/segmentation (severe category imbalance, for example, background accounts for 99%)Give weight attenuation to "easy-to-classify samples" (such as those that are determined to be background), and let the model focus on "difficult-to-classify samples" (such as small objects at the boundary)
Dice LossMedical image segmentation (very small target, such as a 1mm nodule)Directly optimize IoU, only caring about the "coincidence between the predicted area and the real area", not the total number of pixels.

Q14: How to deal with sample imbalance?

A: Sort by project priority:

  1. Algorithm level (fastest implementation):
  • Loss weighting: Add greater Loss weight to the minority class
  • Change to Focal Loss
  1. Data level:
  • Data enhancement: do more geometric/color transformations (such as rotation, flipping, brightness dithering) for minority classes
  • Oversampling (small amount of data)/undersampling (large amount of redundant data)

Q15: How to choose SGD vs Adam?

A: The core is a trade-off between speed and generalization:

  • Adam: Convergence is extremely fast and insensitive to the initial learning rate → Quickly verify ideas and adjust the process in the early stage
  • SGD with momentum: slow convergence, difficult to adjust parameters, but generally better generalization ability → Pursue ultimate accuracy (such as competitions, high-precision landing).

Q16: How to solve overfitting?

A: Sorted by project implementation priority:

  1. Data side (most efficient):
  • Add data
  • Strong data enhancement (Mixup, Cutout, CutMix)
  1. Training end (quick implementation):
  • Early Stopping: The verification set Loss will stop if it does not decrease for N consecutive rounds.
  • L2 regularization (weight decay)
  • Label smoothing (prevents the model from being too confident)
  1. Model side:
  • Reduce model complexity (such as changing from ResNet101 to ResNet50)
  • Add Dropout (note that Dropout and BN should not be used together in the post-training inference phase).

6. Project deployment and actual combat high frequency (super important!!)

Q17: What are the three core methods of model compression?

A: Sorted by ease of implementation + effect:

  1. Quantification: reduce FP32→INT8/FP16, the size is reduced by 4 times, the inference speed is increased by 2-4 times, and the accuracy loss is minimal
  2. Pruning: Remove channels with small weights/weak contribution to the results, directly reducing the amount of calculations
  3. Knowledge Distillation: The large model (teacher) teaches the small model (students) to learn the output probability distribution (Soft Label), and the small model can approach the accuracy of the large model.

Q18: What do ONNX/TensorRT/NCNN do respectively?

A: Understand by deployment link:

  1. ONNX: Cross-framework "translator" - converts PyTorch/TensorFlow models into a common format to facilitate migration
  2. TensorRT: NVIDIA GPU inference "accelerator" - operator fusion, memory optimization, speed can be increased by 2-5 times
  3. NCNN/MNN: Mobile phone/ARM embedded reasoning framework - small package, does not rely on third-party libraries, suitable for mobile terminal implementation.

Q19: What are the core pitfalls of pre-processing/post-processing?

A: **The first tip to avoid pitfalls: Preprocessing must be 100% consistent with training! **

  • For example, when training,(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225])ImageNet normalization must be exactly the same during inference
  • For example, when scaling to 640×640 during training, "maintain aspect ratio + fill black edges" is used, and cannot be stretched directly during inference (it will cause the object to deform and the accuracy to plummet)

Post-processing pitfall avoidance:

  • Post-processing of large model output (such as NMS) is often slower than inference → try to use GPU acceleration (such as TensorRT Plugin, NMS implemented by CUDA).

Q20: How to locate the problem of slow model speed?

A: **The core is segmented speed measurement! **

  • Measure the entire link:数据加载→预处理→推理→后处理→结果传输
  • Find bottlenecks:
  • Low GPU utilization (<50%) → the bottleneck is CPU/IO (data cannot be read, preprocessing is too slow)
  • High GPU utilization → too much model calculation → change to smaller Backbone, quantize, and prune
  • Long transfer time → don't return the original image, only the compressed result of the coordinates/mask.

7. Project in-depth interviews (reflecting the core of personal abilities!)

Don't just say "I used YOLOv8 for target detection"! **Must add "quantitative indicators, specific pain points to be solved, and project details"**.

Reference question 1: What is the difficulty of your project?

Reference answer (combined with engineering):

“The difficulty is to control the end-to-end delay of target detection within 100ms on edge devices (such as the Jetson Nano with 20TOPS computing power), and at the same time, the mAP cannot be reduced by more than 2%**. I first did FP16 quantization, which doubled the speed but still had a latency of 150ms. Then I changed Backbone from ResNet50 to MobileNetV2, and the accuracy dropped by 3%. Finally, I added TensorRT’s operator fusion and dynamic batch, and the latency dropped to 85ms, and the accuracy returned to only a 1.8% drop. "

Reference question 2: Why did you choose this model?

Reference answer (reflecting Trade-off):

“I chose YOLOv8s because it achieves the optimal balance between accuracy, speed, and deployment difficulty. Compared with Two-stage Faster R-CNN, it is 10 times faster and suitable for real-time scenarios; compared with pure Transformer DETR, it converges faster on small data sets (I only have 20,000 annotated images) and does not require too many parameter adjustments; compared with the lighter YOLOv8n, its mAP is 5% higher, meeting business requirements. "


When preparing for the interview, you must be familiar with the 1-2 core projects you have worked on. You must be able to describe the quantitative indicators, pain points solved, and project details of each link.