🚀 Selected high-frequency interview questions for crawler engineers

Hello everyone! This article is a simplified and optimized version of the 20 high-frequency crawler engineer interview questions that I compiled** - removing duplicates of the original content, adding practical code snippets, and adding key folding tips. It is controlled to more than 2k words to facilitate quick review of interview questions~

1. Basic understanding of crawlers (must memorize 5 introductory questions)

1. Briefly describe the definition of crawler and optimization process of big factory version

The normal version is "Request→Response→Parse→Storage", and the process of the Big Factory Premium Edition is as follows:

Remove duplication layer: Go through Bloom/Redis Set to remove duplication before requesting
Request control layer: construct compliance header, select proxy IP, and add adaptive delay
Request/response exception capture layer: try-except timeout, connection, status code exception
Response parsing layer: Choose the appropriate parser
Storage front-end deduplication layer: avoid duplication into the database
Monitoring Alarm Layer: Abnormal failure rate and resolution rate trigger notifications

2. The actual difference between GET and POST in crawlers (the key when choosing an API)

Dimensions	GET	POST
Parameter position	URL splicing (explicitly visible)	Request Body (can be hidden)
Length limit	Browser/server limit (about 2KB)	No clear limit (JSON can carry large objects)
Purpose of crawler	Page turning, public data acquisition (preferred!)	Login, sensitive query, large form submission
Cache trigger	Easily cached by browser/CDN	Rarely cached

3. Crawler response logic for common HTTP status codes

The original article only talked about the meaning, here is a response that can be said in an interview:

200 OK: First check whether the returned content is real data (anti-crawling may return empty shell HTML!)
301/302 Redirect: requests are automatically followed by default and can be turned off.LocationField; Scrapy inDOWNLOADER_MIDDLEWARESCustomizable processing
403 Forbidden: High probability of triggering anti-crawling (UA is recognized, IP exceeds limit, Referer/Cookie is missing)
404 Not Found: Either the URL is misspelled or the page is invalid. Mark it after setting the retry limit.
500 Internal Server Error: Server problem, set exponential backoff and try again

4. Code example of GET request with parameters

Using Pythonrequestslibrary (a must for newbies) andhttpxLibrary (supports asynchronous) write:

# requests 同步写法
import requests

url = "https://api.example.com/data"
params = {
    "page": 1,
    "size": 20,
    "keyword": "爬虫面试"
}
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
}

response = requests.get(url, params=params, headers=headers, timeout=10)
print(response.json())

5. The difference between HTTP and HTTPS, how does the crawler production environment handle SSL?

HTTPS = HTTP + SSL/TLS encryption layer, the transmission is more secure but the handshake is slow and there may be certificate issues. Don’t add it casually to the production environmentverify=False(Although simple, there will be security warnings + may be recognized by anti-crawling), suggestions:

UpdatecertifiLibrary:pip install --upgrade certifi
If it is a self-signed certificate, manually specify the path:verify="/path/to/your/self-signed.crt"

2. Data analysis and deduplication (5 engineering questions)

6. Usage Scenario Decision Trees of the three mainstream parsers

The original article directly lists the advantages and disadvantages, and it is more intuitive to draw a decision tree:

flowchart TD
    A[拿到原始数据] --> B{数据类型}
    B -->|JSON/结构化JS变量| C[Python json/execjs取JSON片段]
    B -->|HTML/XHTML| D{定位需求复杂度}
    D -->|层级溯源（找父/祖节点）| E[lxml+XPath]
    D -->|前端语法习惯/简单定位| F[BeautifulSoup+CSS选择器/lxml+CSS选择器]
    D -->|非标签内容/简单字符串提取| G[Python re]

7. Bloom Filter The first choice for removing duplicates from billions of URLs

The original article talked about the advantages and principles. Here is a tip for Python code implementation + Redis version (interviews often ask how to use Redis for persistent blooming):

# 本地内存版（小规模测试用）
import mmh3
from bitarray import bitarray

class SimpleBloomFilter:
    def __init__(self, capacity=1e8, error_rate=0.001):
        self.capacity = int(capacity)
        self.error_rate = error_rate
        self.bit_size = self._get_bit_size()
        self.hash_count = self._get_hash_count()
        self.bit_array = bitarray(self.bit_size)
        self.bit_array.setall(0)
    
    def _get_bit_size(self):
        from math import log, ceil
        return ceil(-self.capacity * log(self.error_rate) / (log(2)**2))
    
    def _get_hash_count(self):
        from math import log
        return round(self.bit_size / self.capacity * log(2))
    
    def add(self, url):
        for i in range(self.hash_count):
            index = mmh3.hash(url, i) % self.bit_size
            self.bit_array[index] = 1
    
    def contains(self, url):
        for i in range(self.hash_count):
            index = mmh3.hash(url, i) % self.bit_size
            if not self.bit_array[index]:
                return False
        return True

Production environment Redis version

Recommendedredis-py-cluster+pyprobableslibrary, or Redis 4.0+BF.ADD/BF.EXISTSNative bloom filter command!

8. Three core reasons why MongoDB is the first choice for database selection for billion-level crawlers**

The original article mentioned 3 points, but here they are simplified into what can be said quickly during an interview:

Schema-free (no fixed table structure): Target site page revisions and field changes do not require table modifications
Perfect support for JSON nesting: Directly store structured data returned by the API
Horizontal scaling (sharded cluster): Processing TB/PB level data is much simpler than MySQL

9. Two methods to remove duplicate fields in MongoDB crawler

Method 1: Set unique index (easiest)

from pymongo import MongoClient, ASCENDING

client = MongoClient("mongodb://localhost:27017/")
db = client["spider_db"]
collection = db["product_data"]

# 对url字段设置唯一索引
collection.create_index([("url", ASCENDING)], unique=True)

# 插入时忽略重复项（不抛错）
collection.insert_one({"url": "https://example.com/1", "name": "测试"}, upsert=False)

::: :::details Method 2: Use Redis Set/Bloom pre-check before inserting (high performance) If there are both a crawler request layer and a storage layer for deduplication, the duplication rate will be even lower!

3. Anti-climbing breakthrough and tool selection (8 core questions)

10. Classification shorthand list of anti-crawling methods**

The original article lists 4 categories, here we use emoji + key examples as shorthand:

🕵️ Identity class: UA/Referer/Cookie check, TLS JA3 fingerprint recognition
⏱️ Behavior Category: IP access frequency/duration/number of requests restrictions, single IP multiple cookie restrictions
🚫 Content Category: JS dynamic rendering, CSS offset visual deception, FontAwesome font encryption
🤖 Interaction type: slider/text click/rotate verification code

11. 4 core solutions to the IP ban problem

Proxy IP pool: Build yourself or connect to a high-anonymity proxy (transparent proxy will expose the real IP)
Adaptive crawling delay: Don’t use fixed delay, userandom.uniform(1, 3)Simulate human clicks
Distributed crawler: Multiple machines/multiple Docker containers distribute traffic
Find alternate interface: Give priority to the App interface (the verification intensity is usually lower than the PC interface)

12. 3 practical solutions for JS dynamic rendering

:::warning The preferred solution: reverse API interface! Directly capture the XHR/Fetch request in F12 Network and obtain structured JSON. The performance is 10-100 times higher than that of the rendering tool! ** ::: If the API interface cannot be found, use the rendering tool:

Playwright/Puppeteer: Modern headless browser, better performance than Selenium, with automatic waiting function
Selenium+undetected-chromedriver: suitable for old projects and can bypass most basic browser feature detection

13. Selenium/Playwright3 tips to reduce the risk of being identified

useundetected-chromedriver(Selenium) orplaywright.chromium.launch(headless=False, args=["--disable-blink-features=AutomationControlled"])(Playwright)Hidenavigator.webdriverand other characteristics
Forge real user UA, viewport size, mouse trajectory, keyboard input speed
Don’t start automated operations too quickly, addpage.wait_for_timeout(random.uniform(500, 1500))Simulate human thinking

14. Basic processing flow of font encryption (Iconfont)

Download the target site.woff/.ttffont file
usefontToolsThe library converts font files into XML or TXT format and obtains the mapping relationship between character encoding and coordinates.
useOCR(likepytesseract) or coordinate comparison to restore the real characters

15. Quick Check on Getting Started Tools for JS Reverse

Debugging Tools: Chrome DevTools (Sources panel breakpoint debugging)
Execution Tool:PyExecJS、nodejs(Write JS script directly to execute)
Reduction Tool: AST abstract syntax tree (such asjs-beautifySolve basic confusion,obfuscator-io-deobfuscatorSolving OB confusion)

16. Mainstream bypass solution for TLS JA3 fingerprint detection

JA3 fingerprint is that the server determines whether it is Python by checking the characteristics of the client's SSL handshake (such as TLS version, cipher suite, elliptic curve)requests/urllib。 Mainstream bypass libraries:

curl_cffi: Simulate Chrome/Firefox’s JA3/JA4 fingerprint
httpx: Supports custom TLS configuration (manual adjustment is required, not as convenient as curl_cffi)

17. Dimensionality reduction and priority strategy for verification code processing

:::warning Never use the AI model/coding platform from the beginning! High cost and slow speed! ::: Priority:

Find the app/mini program/light version of the webpage (usually no verification code or very low verification strength)
Find out if there is a possibility of cookie reuse (save cookies after logging in once for long-term use)
Only use the coding platform (cheap one such as Super Eagle) or AI model (CNN for characters, YOLO+OpenCV template matching+Bezier curve trajectory simulation for sliders)

4. Engineering and Compliance (2 closing questions)

18. Core design ideas of highly available crawlers (Must be memorized orally)

Research Phase: Analyze the target site data loading method (static/dynamic), find hidden interfaces, check robots.txt, and comply with the robots protocol
Architecture design phase: Choose a coroutine/framework (for small and medium-scale useaiohttp+asyncio, used on a large scaleScrapy-Redis), monitoring and alarm module (Prometheus+Grafana+Feishu/DingTalk Robot), persistence module
Robustness design phase: exception capture + exponential backoff retry, random UA + random referer + random agent switching, double deduplication (request layer + storage layer)
Compliance design phase: Set a reasonable QPS (requests per second), do not put pressure on the other party’s operation and maintenance, and do not crawl sensitive data

19. 3 core tool chains for crawler engineering

Development Tools: Python 3.10+, PyCharm/VS Code, Chrome DevTools
Deployment tools: Docker, Docker Compose, K8s (for large-scale use)
Monitoring and Alert Tools: Prometheus (monitoring indicator collection), Grafana (visual display), Alertmanager (alarm rule configuration), Feishu/DingTalk/Enterprise WeChat robot (alarm notification)

5. Easter egg: bonus points during the interview

"I usually check robots.txt before crawling and comply with the Disallow rule"
"I usually give priority to the App interface. The verification intensity is lower than the PC interface and the performance is better."
"I will use double deduplication (request layer bloom + storage layer unique index) to reduce the duplication rate"
"I will use Prometheus+Grafana to monitor the crawling success rate, 403 rate, and data growth curve. If there is an abnormality, I will use Feishu Robot to notify you in seconds."

🚀 Selected high-frequency interview questions for crawler engineers#

#1. Basic understanding of crawlers (must memorize 5 introductory questions)

#1. Briefly describe the definition of crawler and optimization process of big factory version

#2. The actual difference between GET and POST in crawlers (the key when choosing an API)

#3. Crawler response logic for common HTTP status codes

#4. Code example of GET request with parameters

#5. The difference between HTTP and HTTPS, how does the crawler production environment handle SSL?

#2. Data analysis and deduplication (5 engineering questions)

#6. Usage Scenario Decision Trees of the three mainstream parsers

#7. Bloom Filter The first choice for removing duplicates from billions of URLs

#8. Three core reasons why MongoDB is the first choice for database selection for billion-level crawlers**

#9. Two methods to remove duplicate fields in MongoDB crawler

#3. Anti-climbing breakthrough and tool selection (8 core questions)

#10. Classification shorthand list of anti-crawling methods**

#11. 4 core solutions to the IP ban problem

#12. 3 practical solutions for JS dynamic rendering

#13. Selenium/Playwright3 tips to reduce the risk of being identified

#14. Basic processing flow of font encryption (Iconfont)

#15. Quick Check on Getting Started Tools for JS Reverse

#16. Mainstream bypass solution for TLS JA3 fingerprint detection

#17. Dimensionality reduction and priority strategy for verification code processing

#4. Engineering and Compliance (2 closing questions)

#18. Core design ideas of highly available crawlers (Must be memorized orally)

#19. 3 core tool chains for crawler engineering

#5. Easter egg: bonus points during the interview