HTTP protocol basics and crawler principles

Introduction

If you want to become an excellent crawler developer, a deep understanding of the HTTP protocol is the first step you must take. HTTP is the cornerstone of data transmission on the Internet and the "language" in which crawlers talk to web servers. This article will take you from basic concepts to practical applications, systematically master the core knowledge of the HTTP protocol, and lay a solid foundation for your crawler development.

1. Detailed explanation of URI and URL

Basic concepts

We often hear URI and URL, what is the difference between them?

  • URI (Uniform Resource Identifier): used to uniquely identify a resource on the Internet, just like the "ID card number" of the resource.
  • URL (Uniform Resource Locator): It is a subset of URI. It not only identifies the resource, but also tells us how to find it, which is equivalent to the "home address" of the resource.
  • URN (Uniform Resource Name): only names the resource without specifying the location. It is rarely used in the modern Internet.

Simply put: all URLs are URIs, but not all URIs are URLs. Almost all the addresses we use every day are URLs.

URL structure parsing

A complete URL is like a detailed address and contains multiple components:

scheme://[username:password@]hostname[:port][/path][;parameters][?query][#fragment]

Let’s break it down with a practical example: https://www.example.com:8080/articles/index.html?page=1&sort=time#section2

ComponentContentDescription
schemehttpsProtocol type, tells the browser how to access
hostnamewww.example.comHost address, that is, the "house number" of the server
port8080Port number, the specific "room number" on the server; HTTP defaults to 80, HTTPS defaults to 443, usually omitted
path/articles/index.htmlResource path, specific "file location" on the server
query?page=1&sort=timeQuery parameters, used to pass additional information to the server
fragment#section2Fragment identifier, only used on the browser side and will not be sent to the server

Common characteristics of modern URLs

In daily development, there are some rules in the use of URLs worth paying attention to:

  1. Query parameters: have become a core component and are often used for paging, filtering, etc. The format is?key1=value1&key2=value2
  2. Fragment ID: widely used in front-end routing of single-page applications (Vue/React), or anchor jumps within the page.
  3. Default port: HTTP defaults to 80, HTTPS defaults to 443, which can usually be omitted.

2. HTTP / HTTPS protocol

HTTP protocol

HTTP (Hypertext Transfer Protocol) is the core of the Web, and its development has gone through several important versions:

  • HTTP/1.0 (1996): Early version, a new connection was established for each request.
  • HTTP/1.1 (1997): The current mainstream, supporting features such as persistent connections and virtual hosts.
  • HTTP/2 (2015): Based on binary, supports multiplexing and greatly improves performance.
  • HTTP/3 (2022): Based on QUIC and UDP, faster speeds and more stable connections.

HTTPS protocol

HTTPS is the "secure version of HTTP". It adds an SSL/TLS encryption layer to HTTP and has three main advantages:

  1. Encrypted transmission: Data is encrypted during transmission to prevent theft.
  2. Authentication: Confirm the true identity of the website through the CA certificate to prevent phishing websites.
  3. Data Integrity: It can prevent data from being tampered with during transmission.

Now, all major browsers will mark non-HTTPS websites as "unsafe", and platforms such as WeChat mini programs and app stores also mandate the use of HTTPS.

3. HTTP request-response process

Complete request process

When you enter a URL in the browser address bar and press Enter, the following events will occur in sequence:

  1. DNS resolution: The browser translates the domain name into the IP address of the server.
  2. Establish connection: Establish a TCP connection with the server (three-way handshake).
  3. Send request: The browser sends an HTTP request to the server.
  4. Processing request: The server receives the request and processes it.
  5. Return response: The server encapsulates the result as an HTTP response and sends it back.
  6. Render page: The browser parses the response content and renders the web page.
  7. Close connection: Communication ends, close TCP connection (wave four times).

Use developer tools to analyze

Chrome Developer Tools is the best practice tool for learning HTTP. After pressing F12 to open it, switch to the Network panel:

  • General area: View basic information such as request URL, method, status code, etc.
  • Headers area: View the details of request headers and response headers.
  • Preview/Response: View the actual content returned by the server.
  • Timing: Analyze the time consumption of each stage of the request.

4. Detailed explanation of HTTP requests

Request method

HTTP defines multiple request methods, each with a different purpose:

MethodDescriptionIdempotenceCrawler Application
GETGet resourcesYesMost commonly used, get web pages and API data
POSTSubmit dataNoForm submission, login, upload data
PUTReplace resourceYesUpdate entire resource
DELETEDelete resourceYesDelete resource
HEADGet only response headersYesCheck if the resource exists, get only metadata

Impotence: Executing it once has the same effect as executing it multiple times. For example, a GET request will not change the state on the server no matter how many times it is called.

Request headers (Headers)

Request headers are "additional information" passed by the browser to the server, just like the notes written on the envelope when sending a letter. For crawlers, these headers are very important:

FieldsDescriptionCrawler Notes
User-AgentClient IDshould be set to a common browser UA, otherwise it will be easily identified as a crawler
CookieSession informationUsed to maintain logged in status
RefererSource pageTell the server which page you jumped from. The anti-crawling mechanism often checks this field
Content-TypeRequest body typeMust be set correctly during POST request

Request body (Body)

The request body is the actual data carried by the request, which is only available in POST, PUT and other methods. There are three common formats:

  1. application/x-www-form-urlencoded: traditional form format, such asusername=admin&password=123
  2. application/json: JSON format, for example{"username":"admin","password":"123"}, the modern API is most commonly used.
  3. multipart/form-data: used for file upload.

5. Detailed explanation of HTTP response

###Status Code

The status code is a three-digit number that the server tells the browser "what is the processing result" and is divided into five categories:

  • 1xx: Information prompt, temporary response.
  • 2xx: Success, the request has been processed normally.
  • 3xx: Redirect, further action required.
  • 4xx: Client error, there is a problem with the request itself.
  • 5xx: Server error, something went wrong on the server side.

Common status codes:

Status codeMeaningCrawler processing
200SuccessNormal parsing response
301/302RedirectFollow jump to new address
304Not modifiedUsing cache
400Bad requestCheck request parameters
401UnauthorizedNeed to log in and try again
403Access ForbiddenIP may be blocked
404Not FoundPage does not exist
429Too many requestsSlow down the request frequency
500/502/503Server errorWait for a while and try again

Response headers (Headers)

The response header is the "additional information" returned by the server. The important fields are:

FieldDescriptionCrawler Application
Set-CookieSet CookieSave login information
Content-TypeResponse content typeDetermines how to parse the response body
LocationRedirect addressUsed to handle jumps
Retry-AfterRetry timeKnow how long to wait when being throttled

Response body (Body)

The response body is the actual content returned by the server, according toContent-TypeIt varies:

  • text/html: HTML web page, parsed with BeautifulSoup, etc.
  • application/json: JSON data, parsed with json library.
  • image/jpeg, image/png: pictures, saved directly.
  • application/octet-stream: Binary files, such as downloaded PDF, ZIP.

6. Application practice in crawlers

Key Notes

When writing a crawler, the following points are crucial:

  1. Set reasonable request headers: Especially User-Agent, it must simulate a real browser, otherwise it will be easily banned.
  2. Handling Cookie: Maintain the login status and use the Session object to automatically handle it.
  3. Control request frequency: Do not put too much pressure on the server and increase the delay appropriately.
  4. Exception handling and retry: Network requests may fail at any time, and a retry mechanism must be in place.
  5. Follow robots.txt: Comply with the crawling rules set by the website.

Crawler code example

Here is a simple but practical crawler example:

import requests
import time
import random

class SimpleSpider:
    def __init__(self):
        # 使用 Session 可以自动保持 Cookie 和连接
        self.session = requests.Session()
        # 设置请求头,模拟真实浏览器
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
        })
        
    def get(self, url, max_retries=3, **kwargs):
        """发送 GET 请求,带重试机制"""
        for attempt in range(max_retries):
            try:
                response = self.session.get(url, timeout=10, **kwargs)
                
                if response.status_code == 200:
                    return response
                elif response.status_code == 429:
                    # 请求过多,等待一段时间
                    retry_after = int(response.headers.get('Retry-After', 60))
                    print(f"请求过多,等待 {retry_after} 秒")
                    time.sleep(retry_after)
                elif response.status_code >= 500:
                    # 服务器错误,指数退避重试
                    wait_time = 2 ** attempt
                    print(f"服务器错误,等待 {wait_time} 秒后重试")
                    time.sleep(wait_time)
                else:
                    print(f"请求失败,状态码: {response.status_code}")
                    return response
                    
            except requests.exceptions.RequestException as e:
                print(f"请求异常: {e}")
                if attempt < max_retries - 1:
                    time.sleep(2 ** attempt)
                else:
                    raise
        return None
        
    def random_delay(self, min_sec=1, max_sec=3):
        """随机延时,避免请求过快"""
        delay = random.uniform(min_sec, max_sec)
        time.sleep(delay)

# 使用示例
def main():
    spider = SimpleSpider()
    
    # 访问页面
    response = spider.get("https://example.com")
    if response and response.status_code == 200:
        print("成功获取页面!")
        # 处理响应内容…
        
        # 随机延时
        spider.random_delay()
    
if __name__ == "__main__":
    main()

Develop debugging tools

  • Chrome DevTools: The most practical debugging tool that comes with the browser.
  • Postman: API testing tool, debugging interface is very convenient.
  • cURL: Command line tool to quickly test HTTP requests.
  • Charles/Fiddler: HTTP proxy tool, capable of intercepting and analyzing requests.

Python library

  • requests: The most popular HTTP request library, simple and easy to use.
  • httpx: Modern request library supporting HTTP/2 and async.
  • beautifulsoup4: HTML parsing library, suitable for extracting data.
  • lxml: Efficient XML/HTML parsing library.
  • selenium/playwright: Browser automation tool for handling JavaScript-rendered pages.

8. Best practices and learning suggestions

Compliance crawling

  1. Comply with robots.txt: Check the website’s crawling rules before crawling.
  2. Control request frequency: Avoid putting pressure on the target server.
  3. Respect Copyright: Use the crawled data legally.
  4. Protect privacy: Do not crawl personal sensitive information.

Study suggestions

  1. Multi-purpose developer tools: Actual observation of HTTP requests and responses is more reliable than just reading.
  2. Start with a simple website: Crawl static pages first, and then gradually challenge complex targets.
  3. Learn JavaScript: Many modern websites rely on front-end rendering, and understanding JS will go a long way.
  4. Pay attention to the anti-climbing mechanism: Understanding common anti-climbing methods and countermeasures is the only way to advance.

Summarize

The HTTP protocol is the basis for crawler development. From URL structure to request-response process, from status codes to various request headers, every link plays a key role in crawlers. With the development of web technology, HTTP is also constantly evolving. From HTTP/1.1 to HTTP/2, and then to HTTP/3, the transmission efficiency is getting higher and higher.

In practice, we must learn to set reasonable request headers, handle various status codes, maintain sessions, and control frequency. We must also abide by laws and regulations and perform compliant crawling.

Understanding the HTTP protocol is not only the starting point for writing crawlers, but also the key to a deep understanding of how the entire Web works. I hope this article can help you lay a solid foundation and successfully open the door to crawler development!