HTTP protocol basics and crawler principles
Introduction
If you want to become an excellent crawler developer, a deep understanding of the HTTP protocol is the first step you must take. HTTP is the cornerstone of data transmission on the Internet and the "language" in which crawlers talk to web servers. This article will take you from basic concepts to practical applications, systematically master the core knowledge of the HTTP protocol, and lay a solid foundation for your crawler development.
1. Detailed explanation of URI and URL
Basic concepts
We often hear URI and URL, what is the difference between them?
- URI (Uniform Resource Identifier): used to uniquely identify a resource on the Internet, just like the "ID card number" of the resource.
- URL (Uniform Resource Locator): It is a subset of URI. It not only identifies the resource, but also tells us how to find it, which is equivalent to the "home address" of the resource.
- URN (Uniform Resource Name): only names the resource without specifying the location. It is rarely used in the modern Internet.
Simply put: all URLs are URIs, but not all URIs are URLs. Almost all the addresses we use every day are URLs.
URL structure parsing
A complete URL is like a detailed address and contains multiple components:
Let’s break it down with a practical example:
https://www.example.com:8080/articles/index.html?page=1&sort=time#section2
Common characteristics of modern URLs
In daily development, there are some rules in the use of URLs worth paying attention to:
- Query parameters: have become a core component and are often used for paging, filtering, etc. The format is
?key1=value1&key2=value2。 - Fragment ID: widely used in front-end routing of single-page applications (Vue/React), or anchor jumps within the page.
- Default port: HTTP defaults to 80, HTTPS defaults to 443, which can usually be omitted.
2. HTTP / HTTPS protocol
HTTP protocol
HTTP (Hypertext Transfer Protocol) is the core of the Web, and its development has gone through several important versions:
- HTTP/1.0 (1996): Early version, a new connection was established for each request.
- HTTP/1.1 (1997): The current mainstream, supporting features such as persistent connections and virtual hosts.
- HTTP/2 (2015): Based on binary, supports multiplexing and greatly improves performance.
- HTTP/3 (2022): Based on QUIC and UDP, faster speeds and more stable connections.
HTTPS protocol
HTTPS is the "secure version of HTTP". It adds an SSL/TLS encryption layer to HTTP and has three main advantages:
- Encrypted transmission: Data is encrypted during transmission to prevent theft.
- Authentication: Confirm the true identity of the website through the CA certificate to prevent phishing websites.
- Data Integrity: It can prevent data from being tampered with during transmission.
Now, all major browsers will mark non-HTTPS websites as "unsafe", and platforms such as WeChat mini programs and app stores also mandate the use of HTTPS.
3. HTTP request-response process
Complete request process
When you enter a URL in the browser address bar and press Enter, the following events will occur in sequence:
- DNS resolution: The browser translates the domain name into the IP address of the server.
- Establish connection: Establish a TCP connection with the server (three-way handshake).
- Send request: The browser sends an HTTP request to the server.
- Processing request: The server receives the request and processes it.
- Return response: The server encapsulates the result as an HTTP response and sends it back.
- Render page: The browser parses the response content and renders the web page.
- Close connection: Communication ends, close TCP connection (wave four times).
Use developer tools to analyze
Chrome Developer Tools is the best practice tool for learning HTTP. After pressing F12 to open it, switch to the Network panel:
- General area: View basic information such as request URL, method, status code, etc.
- Headers area: View the details of request headers and response headers.
- Preview/Response: View the actual content returned by the server.
- Timing: Analyze the time consumption of each stage of the request.
4. Detailed explanation of HTTP requests
Request method
HTTP defines multiple request methods, each with a different purpose:
Impotence: Executing it once has the same effect as executing it multiple times. For example, a GET request will not change the state on the server no matter how many times it is called.
Request headers (Headers)
Request headers are "additional information" passed by the browser to the server, just like the notes written on the envelope when sending a letter. For crawlers, these headers are very important:
Request body (Body)
The request body is the actual data carried by the request, which is only available in POST, PUT and other methods. There are three common formats:
- application/x-www-form-urlencoded: traditional form format, such as
username=admin&password=123。 - application/json: JSON format, for example
{"username":"admin","password":"123"}, the modern API is most commonly used. - multipart/form-data: used for file upload.
5. Detailed explanation of HTTP response
###Status Code
The status code is a three-digit number that the server tells the browser "what is the processing result" and is divided into five categories:
- 1xx: Information prompt, temporary response.
- 2xx: Success, the request has been processed normally.
- 3xx: Redirect, further action required.
- 4xx: Client error, there is a problem with the request itself.
- 5xx: Server error, something went wrong on the server side.
Common status codes:
Response headers (Headers)
The response header is the "additional information" returned by the server. The important fields are:
Response body (Body)
The response body is the actual content returned by the server, according toContent-TypeIt varies:
- text/html: HTML web page, parsed with BeautifulSoup, etc.
- application/json: JSON data, parsed with json library.
- image/jpeg, image/png: pictures, saved directly.
- application/octet-stream: Binary files, such as downloaded PDF, ZIP.
6. Application practice in crawlers
Key Notes
When writing a crawler, the following points are crucial:
- Set reasonable request headers: Especially User-Agent, it must simulate a real browser, otherwise it will be easily banned.
- Handling Cookie: Maintain the login status and use the Session object to automatically handle it.
- Control request frequency: Do not put too much pressure on the server and increase the delay appropriately.
- Exception handling and retry: Network requests may fail at any time, and a retry mechanism must be in place.
- Follow robots.txt: Comply with the crawling rules set by the website.
Crawler code example
Here is a simple but practical crawler example:
7. Recommended commonly used tools
Develop debugging tools
- Chrome DevTools: The most practical debugging tool that comes with the browser.
- Postman: API testing tool, debugging interface is very convenient.
- cURL: Command line tool to quickly test HTTP requests.
- Charles/Fiddler: HTTP proxy tool, capable of intercepting and analyzing requests.
Python library
- requests: The most popular HTTP request library, simple and easy to use.
- httpx: Modern request library supporting HTTP/2 and async.
- beautifulsoup4: HTML parsing library, suitable for extracting data.
- lxml: Efficient XML/HTML parsing library.
- selenium/playwright: Browser automation tool for handling JavaScript-rendered pages.
8. Best practices and learning suggestions
Compliance crawling
- Comply with robots.txt: Check the website’s crawling rules before crawling.
- Control request frequency: Avoid putting pressure on the target server.
- Respect Copyright: Use the crawled data legally.
- Protect privacy: Do not crawl personal sensitive information.
Study suggestions
- Multi-purpose developer tools: Actual observation of HTTP requests and responses is more reliable than just reading.
- Start with a simple website: Crawl static pages first, and then gradually challenge complex targets.
- Learn JavaScript: Many modern websites rely on front-end rendering, and understanding JS will go a long way.
- Pay attention to the anti-climbing mechanism: Understanding common anti-climbing methods and countermeasures is the only way to advance.
Summarize
The HTTP protocol is the basis for crawler development. From URL structure to request-response process, from status codes to various request headers, every link plays a key role in crawlers. With the development of web technology, HTTP is also constantly evolving. From HTTP/1.1 to HTTP/2, and then to HTTP/3, the transmission efficiency is getting higher and higher.
In practice, we must learn to set reasonable request headers, handle various status codes, maintain sessions, and control frequency. We must also abide by laws and regulations and perform compliant crawling.
Understanding the HTTP protocol is not only the starting point for writing crawlers, but also the key to a deep understanding of how the entire Web works. I hope this article can help you lay a solid foundation and successfully open the door to crawler development!

