HTTP protocol basics and crawler principles

Introduction

If you want to become an excellent crawler developer, a deep understanding of the HTTP protocol is the first step you must take. HTTP is the cornerstone of data transmission on the Internet and the "language" in which crawlers talk to web servers. This article will take you from basic concepts to practical applications, systematically master the core knowledge of the HTTP protocol, and lay a solid foundation for your crawler development.

1. Detailed explanation of URI and URL

Basic concepts

We often hear URI and URL, what is the difference between them?

URI (Uniform Resource Identifier): used to uniquely identify a resource on the Internet, just like the "ID card number" of the resource.
URL (Uniform Resource Locator): It is a subset of URI. It not only identifies the resource, but also tells us how to find it, which is equivalent to the "home address" of the resource.
URN (Uniform Resource Name): only names the resource without specifying the location. It is rarely used in the modern Internet.

Simply put: all URLs are URIs, but not all URIs are URLs. Almost all the addresses we use every day are URLs.

URL structure parsing

A complete URL is like a detailed address and contains multiple components:

scheme://[username:password@]hostname[:port][/path][;parameters][?query][#fragment]

Let’s break it down with a practical example: https://www.example.com:8080/articles/index.html?page=1&sort=time#section2

Component	Content	Description
scheme	https	Protocol type, tells the browser how to access
hostname	www.example.com	Host address, that is, the "house number" of the server
port	8080	Port number, the specific "room number" on the server; HTTP defaults to 80, HTTPS defaults to 443, usually omitted
path	/articles/index.html	Resource path, specific "file location" on the server
query	?page=1&sort=time	Query parameters, used to pass additional information to the server
fragment	#section2	Fragment identifier, only used on the browser side and will not be sent to the server

Common characteristics of modern URLs

In daily development, there are some rules in the use of URLs worth paying attention to:

Query parameters: have become a core component and are often used for paging, filtering, etc. The format is?key1=value1&key2=value2。
Fragment ID: widely used in front-end routing of single-page applications (Vue/React), or anchor jumps within the page.
Default port: HTTP defaults to 80, HTTPS defaults to 443, which can usually be omitted.

2. HTTP / HTTPS protocol

HTTP protocol

HTTP (Hypertext Transfer Protocol) is the core of the Web, and its development has gone through several important versions:

HTTP/1.0 (1996): Early version, a new connection was established for each request.
HTTP/1.1 (1997): The current mainstream, supporting features such as persistent connections and virtual hosts.
HTTP/2 (2015): Based on binary, supports multiplexing and greatly improves performance.
HTTP/3 (2022): Based on QUIC and UDP, faster speeds and more stable connections.

HTTPS protocol

HTTPS is the "secure version of HTTP". It adds an SSL/TLS encryption layer to HTTP and has three main advantages:

Encrypted transmission: Data is encrypted during transmission to prevent theft.
Authentication: Confirm the true identity of the website through the CA certificate to prevent phishing websites.
Data Integrity: It can prevent data from being tampered with during transmission.

Now, all major browsers will mark non-HTTPS websites as "unsafe", and platforms such as WeChat mini programs and app stores also mandate the use of HTTPS.

3. HTTP request-response process

Complete request process

When you enter a URL in the browser address bar and press Enter, the following events will occur in sequence:

DNS resolution: The browser translates the domain name into the IP address of the server.
Establish connection: Establish a TCP connection with the server (three-way handshake).
Send request: The browser sends an HTTP request to the server.
Processing request: The server receives the request and processes it.
Return response: The server encapsulates the result as an HTTP response and sends it back.
Render page: The browser parses the response content and renders the web page.
Close connection: Communication ends, close TCP connection (wave four times).

Use developer tools to analyze

Chrome Developer Tools is the best practice tool for learning HTTP. After pressing F12 to open it, switch to the Network panel:

General area: View basic information such as request URL, method, status code, etc.
Headers area: View the details of request headers and response headers.
Preview/Response: View the actual content returned by the server.
Timing: Analyze the time consumption of each stage of the request.

4. Detailed explanation of HTTP requests

Request method

HTTP defines multiple request methods, each with a different purpose:

Method	Description	Idempotence	Crawler Application
GET	Get resources	Yes	Most commonly used, get web pages and API data
POST	Submit data	No	Form submission, login, upload data
PUT	Replace resource	Yes	Update entire resource
DELETE	Delete resource	Yes	Delete resource
HEAD	Get only response headers	Yes	Check if the resource exists, get only metadata

Impotence: Executing it once has the same effect as executing it multiple times. For example, a GET request will not change the state on the server no matter how many times it is called.

Request headers (Headers)

Request headers are "additional information" passed by the browser to the server, just like the notes written on the envelope when sending a letter. For crawlers, these headers are very important:

Fields	Description	Crawler Notes
User-Agent	Client ID	should be set to a common browser UA, otherwise it will be easily identified as a crawler
Cookie	Session information	Used to maintain logged in status
Referer	Source page	Tell the server which page you jumped from. The anti-crawling mechanism often checks this field
Content-Type	Request body type	Must be set correctly during POST request

Request body (Body)

The request body is the actual data carried by the request, which is only available in POST, PUT and other methods. There are three common formats:

application/x-www-form-urlencoded: traditional form format, such asusername=admin&password=123。
application/json: JSON format, for example{"username":"admin","password":"123"}, the modern API is most commonly used.
multipart/form-data: used for file upload.

5. Detailed explanation of HTTP response

###Status Code

The status code is a three-digit number that the server tells the browser "what is the processing result" and is divided into five categories:

1xx: Information prompt, temporary response.
2xx: Success, the request has been processed normally.
3xx: Redirect, further action required.
4xx: Client error, there is a problem with the request itself.
5xx: Server error, something went wrong on the server side.

Common status codes:

Status code	Meaning	Crawler processing
200	Success	Normal parsing response
301/302	Redirect	Follow jump to new address
304	Not modified	Using cache
400	Bad request	Check request parameters
401	Unauthorized	Need to log in and try again
403	Access Forbidden	IP may be blocked
404	Not Found	Page does not exist
429	Too many requests	Slow down the request frequency
500/502/503	Server error	Wait for a while and try again

Response headers (Headers)

The response header is the "additional information" returned by the server. The important fields are:

Field	Description	Crawler Application
Set-Cookie	Set Cookie	Save login information
Content-Type	Response content type	Determines how to parse the response body
Location	Redirect address	Used to handle jumps
Retry-After	Retry time	Know how long to wait when being throttled

Response body (Body)

The response body is the actual content returned by the server, according toContent-TypeIt varies:

text/html: HTML web page, parsed with BeautifulSoup, etc.
application/json: JSON data, parsed with json library.
image/jpeg, image/png: pictures, saved directly.
application/octet-stream: Binary files, such as downloaded PDF, ZIP.

6. Application practice in crawlers

Key Notes

When writing a crawler, the following points are crucial:

Set reasonable request headers: Especially User-Agent, it must simulate a real browser, otherwise it will be easily banned.
Handling Cookie: Maintain the login status and use the Session object to automatically handle it.
Control request frequency: Do not put too much pressure on the server and increase the delay appropriately.
Exception handling and retry: Network requests may fail at any time, and a retry mechanism must be in place.
Follow robots.txt: Comply with the crawling rules set by the website.

Crawler code example

Here is a simple but practical crawler example:

import requests
import time
import random

class SimpleSpider:
    def __init__(self):
        # 使用 Session 可以自动保持 Cookie 和连接
        self.session = requests.Session()
        # 设置请求头，模拟真实浏览器
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
        })
        
    def get(self, url, max_retries=3, **kwargs):
        """发送 GET 请求，带重试机制"""
        for attempt in range(max_retries):
            try:
                response = self.session.get(url, timeout=10, **kwargs)
                
                if response.status_code == 200:
                    return response
                elif response.status_code == 429:
                    # 请求过多，等待一段时间
                    retry_after = int(response.headers.get('Retry-After', 60))
                    print(f"请求过多，等待 {retry_after} 秒")
                    time.sleep(retry_after)
                elif response.status_code >= 500:
                    # 服务器错误，指数退避重试
                    wait_time = 2 ** attempt
                    print(f"服务器错误，等待 {wait_time} 秒后重试")
                    time.sleep(wait_time)
                else:
                    print(f"请求失败，状态码: {response.status_code}")
                    return response
                    
            except requests.exceptions.RequestException as e:
                print(f"请求异常: {e}")
                if attempt < max_retries - 1:
                    time.sleep(2 ** attempt)
                else:
                    raise
        return None
        
    def random_delay(self, min_sec=1, max_sec=3):
        """随机延时，避免请求过快"""
        delay = random.uniform(min_sec, max_sec)
        time.sleep(delay)

# 使用示例
def main():
    spider = SimpleSpider()
    
    # 访问页面
    response = spider.get("https://example.com")
    if response and response.status_code == 200:
        print("成功获取页面！")
        # 处理响应内容…
        
        # 随机延时
        spider.random_delay()
    
if __name__ == "__main__":
    main()

7. Recommended commonly used tools

Develop debugging tools

Chrome DevTools: The most practical debugging tool that comes with the browser.
Postman: API testing tool, debugging interface is very convenient.
cURL: Command line tool to quickly test HTTP requests.
Charles/Fiddler: HTTP proxy tool, capable of intercepting and analyzing requests.

Python library

requests: The most popular HTTP request library, simple and easy to use.
httpx: Modern request library supporting HTTP/2 and async.
beautifulsoup4: HTML parsing library, suitable for extracting data.
lxml: Efficient XML/HTML parsing library.
selenium/playwright: Browser automation tool for handling JavaScript-rendered pages.

8. Best practices and learning suggestions

Compliance crawling

Comply with robots.txt: Check the website’s crawling rules before crawling.
Control request frequency: Avoid putting pressure on the target server.
Respect Copyright: Use the crawled data legally.
Protect privacy: Do not crawl personal sensitive information.

Study suggestions

Multi-purpose developer tools: Actual observation of HTTP requests and responses is more reliable than just reading.
Start with a simple website: Crawl static pages first, and then gradually challenge complex targets.
Learn JavaScript: Many modern websites rely on front-end rendering, and understanding JS will go a long way.
Pay attention to the anti-climbing mechanism: Understanding common anti-climbing methods and countermeasures is the only way to advance.

Summarize

The HTTP protocol is the basis for crawler development. From URL structure to request-response process, from status codes to various request headers, every link plays a key role in crawlers. With the development of web technology, HTTP is also constantly evolving. From HTTP/1.1 to HTTP/2, and then to HTTP/3, the transmission efficiency is getting higher and higher.

In practice, we must learn to set reasonable request headers, handle various status codes, maintain sessions, and control frequency. We must also abide by laws and regulations and perform compliant crawling.

Understanding the HTTP protocol is not only the starting point for writing crawlers, but also the key to a deep understanding of how the entire Web works. I hope this article can help you lay a solid foundation and successfully open the door to crawler development!

#HTTP protocol basics and crawler principles

#Introduction

#1. Detailed explanation of URI and URL

#Basic concepts

#URL structure parsing

#Common characteristics of modern URLs

#2. HTTP / HTTPS protocol

#HTTP protocol

#HTTPS protocol

#3. HTTP request-response process

#Complete request process

#Use developer tools to analyze

#4. Detailed explanation of HTTP requests

#Request method

#Request headers (Headers)

#Request body (Body)

#5. Detailed explanation of HTTP response

#Response headers (Headers)

#Response body (Body)

#6. Application practice in crawlers

#Key Notes

#Crawler code example

#7. Recommended commonly used tools

#Develop debugging tools

#Python library

#8. Best practices and learning suggestions

#Compliance crawling

#Study suggestions

#Summarize