🚀 Selected high-frequency interview questions for crawler engineers
Hello everyone! This article is a simplified and optimized version of the 20 high-frequency crawler engineer interview questions that I compiled** - removing duplicates of the original content, adding practical code snippets, and adding key folding tips. It is controlled to more than 2k words to facilitate quick review of interview questions~
1. Basic understanding of crawlers (must memorize 5 introductory questions)
1. Briefly describe the definition of crawler and optimization process of big factory version
The normal version is "Request→Response→Parse→Storage", and the process of the Big Factory Premium Edition is as follows:
- Remove duplication layer: Go through Bloom/Redis Set to remove duplication before requesting
- Request control layer: construct compliance header, select proxy IP, and add adaptive delay
- Request/response exception capture layer: try-except timeout, connection, status code exception
- Response parsing layer: Choose the appropriate parser
- Storage front-end deduplication layer: avoid duplication into the database
- Monitoring Alarm Layer: Abnormal failure rate and resolution rate trigger notifications
2. The actual difference between GET and POST in crawlers (the key when choosing an API)
3. Crawler response logic for common HTTP status codes
The original article only talked about the meaning, here is a response that can be said in an interview:
- 200 OK: First check whether the returned content is real data (anti-crawling may return empty shell HTML!)
- 301/302 Redirect: requests are automatically followed by default and can be turned off.
LocationField; Scrapy inDOWNLOADER_MIDDLEWARESCustomizable processing - 403 Forbidden: High probability of triggering anti-crawling (UA is recognized, IP exceeds limit, Referer/Cookie is missing)
- 404 Not Found: Either the URL is misspelled or the page is invalid. Mark it after setting the retry limit.
- 500 Internal Server Error: Server problem, set exponential backoff and try again
4. Code example of GET request with parameters
Using Pythonrequestslibrary (a must for newbies) andhttpxLibrary (supports asynchronous) write:
5. The difference between HTTP and HTTPS, how does the crawler production environment handle SSL?
HTTPS = HTTP + SSL/TLS encryption layer, the transmission is more secure but the handshake is slow and there may be certificate issues.
Don’t add it casually to the production environmentverify=False(Although simple, there will be security warnings + may be recognized by anti-crawling), suggestions:
- Update
certifiLibrary:pip install --upgrade certifi - If it is a self-signed certificate, manually specify the path:
verify="/path/to/your/self-signed.crt"
2. Data analysis and deduplication (5 engineering questions)
6. Usage Scenario Decision Trees of the three mainstream parsers
The original article directly lists the advantages and disadvantages, and it is more intuitive to draw a decision tree:
7. Bloom Filter The first choice for removing duplicates from billions of URLs
The original article talked about the advantages and principles. Here is a tip for Python code implementation + Redis version (interviews often ask how to use Redis for persistent blooming):
Recommendedredis-py-cluster+pyprobableslibrary, or Redis 4.0+BF.ADD/BF.EXISTSNative bloom filter command!
8. Three core reasons why MongoDB is the first choice for database selection for billion-level crawlers**
The original article mentioned 3 points, but here they are simplified into what can be said quickly during an interview:
- Schema-free (no fixed table structure): Target site page revisions and field changes do not require table modifications
- Perfect support for JSON nesting: Directly store structured data returned by the API
- Horizontal scaling (sharded cluster): Processing TB/PB level data is much simpler than MySQL
9. Two methods to remove duplicate fields in MongoDB crawler
Method 1: Set unique index (easiest)
::: :::details Method 2: Use Redis Set/Bloom pre-check before inserting (high performance) If there are both a crawler request layer and a storage layer for deduplication, the duplication rate will be even lower!
3. Anti-climbing breakthrough and tool selection (8 core questions)
10. Classification shorthand list of anti-crawling methods**
The original article lists 4 categories, here we use emoji + key examples as shorthand:
- 🕵️ Identity class: UA/Referer/Cookie check, TLS JA3 fingerprint recognition
- ⏱️ Behavior Category: IP access frequency/duration/number of requests restrictions, single IP multiple cookie restrictions
- 🚫 Content Category: JS dynamic rendering, CSS offset visual deception, FontAwesome font encryption
- 🤖 Interaction type: slider/text click/rotate verification code
11. 4 core solutions to the IP ban problem
- Proxy IP pool: Build yourself or connect to a high-anonymity proxy (transparent proxy will expose the real IP)
- Adaptive crawling delay: Don’t use fixed delay, use
random.uniform(1, 3)Simulate human clicks - Distributed crawler: Multiple machines/multiple Docker containers distribute traffic
- Find alternate interface: Give priority to the App interface (the verification intensity is usually lower than the PC interface)
12. 3 practical solutions for JS dynamic rendering
:::warning The preferred solution: reverse API interface! Directly capture the XHR/Fetch request in F12 Network and obtain structured JSON. The performance is 10-100 times higher than that of the rendering tool! ** ::: If the API interface cannot be found, use the rendering tool:
- Playwright/Puppeteer: Modern headless browser, better performance than Selenium, with automatic waiting function
- Selenium+undetected-chromedriver: suitable for old projects and can bypass most basic browser feature detection
13. Selenium/Playwright3 tips to reduce the risk of being identified
- use
undetected-chromedriver(Selenium) orplaywright.chromium.launch(headless=False, args=["--disable-blink-features=AutomationControlled"])(Playwright)Hidenavigator.webdriverand other characteristics - Forge real user UA, viewport size, mouse trajectory, keyboard input speed
- Don’t start automated operations too quickly, add
page.wait_for_timeout(random.uniform(500, 1500))Simulate human thinking
14. Basic processing flow of font encryption (Iconfont)
- Download the target site
.woff/.ttffont file - use
fontToolsThe library converts font files into XML or TXT format and obtains the mapping relationship between character encoding and coordinates. - use
OCR(likepytesseract) or coordinate comparison to restore the real characters
15. Quick Check on Getting Started Tools for JS Reverse
- Debugging Tools: Chrome DevTools (Sources panel breakpoint debugging)
- Execution Tool:
PyExecJS、nodejs(Write JS script directly to execute) - Reduction Tool: AST abstract syntax tree (such as
js-beautifySolve basic confusion,obfuscator-io-deobfuscatorSolving OB confusion)
16. Mainstream bypass solution for TLS JA3 fingerprint detection
JA3 fingerprint is that the server determines whether it is Python by checking the characteristics of the client's SSL handshake (such as TLS version, cipher suite, elliptic curve)requests/urllib。
Mainstream bypass libraries:
- curl_cffi: Simulate Chrome/Firefox’s JA3/JA4 fingerprint
- httpx: Supports custom TLS configuration (manual adjustment is required, not as convenient as curl_cffi)
17. Dimensionality reduction and priority strategy for verification code processing
:::warning Never use the AI model/coding platform from the beginning! High cost and slow speed! ::: Priority:
- Find the app/mini program/light version of the webpage (usually no verification code or very low verification strength)
- Find out if there is a possibility of cookie reuse (save cookies after logging in once for long-term use)
- Only use the coding platform (cheap one such as Super Eagle) or AI model (CNN for characters, YOLO+OpenCV template matching+Bezier curve trajectory simulation for sliders)
4. Engineering and Compliance (2 closing questions)
18. Core design ideas of highly available crawlers (Must be memorized orally)
- Research Phase: Analyze the target site data loading method (static/dynamic), find hidden interfaces, check robots.txt, and comply with the robots protocol
- Architecture design phase: Choose a coroutine/framework (for small and medium-scale use
aiohttp+asyncio, used on a large scaleScrapy-Redis), monitoring and alarm module (Prometheus+Grafana+Feishu/DingTalk Robot), persistence module - Robustness design phase: exception capture + exponential backoff retry, random UA + random referer + random agent switching, double deduplication (request layer + storage layer)
- Compliance design phase: Set a reasonable QPS (requests per second), do not put pressure on the other party’s operation and maintenance, and do not crawl sensitive data
19. 3 core tool chains for crawler engineering
- Development Tools: Python 3.10+, PyCharm/VS Code, Chrome DevTools
- Deployment tools: Docker, Docker Compose, K8s (for large-scale use)
- Monitoring and Alert Tools: Prometheus (monitoring indicator collection), Grafana (visual display), Alertmanager (alarm rule configuration), Feishu/DingTalk/Enterprise WeChat robot (alarm notification)
5. Easter egg: bonus points during the interview
- "I usually check robots.txt before crawling and comply with the Disallow rule"
- "I usually give priority to the App interface. The verification intensity is lower than the PC interface and the performance is better."
- "I will use double deduplication (request layer bloom + storage layer unique index) to reduce the duplication rate"
- "I will use Prometheus+Grafana to monitor the crawling success rate, 403 rate, and data growth curve. If there is an abnormality, I will use Feishu Robot to notify you in seconds."

