title: First experience with urllib crawler description: urllib is Python's built-in HTTP request library and can be used without additional installation. It contains the following main modules:

Python urllib module usage guide

1. Overview

urllibIt is the only built-in HTTP request tool set in Python's standard library and can be used directly without additional installation. It handles request sending, exception catching, URL processing and evenrobots.txtThe parsing is all packaged, which is very suitable for getting started with crawlers.

It mainly contains 4 sub-modules:

  • urllib.request: Responsible for sending HTTP requests such as GET and POST
  • urllib.error: Unified processing of URL and HTTP related exceptions that occur during the request process
  • urllib.parse: Specifically used to split, splice, encode/decode URLs
  • urllib.robotparser: parsingrobots.txtRules (occasionally used in the entry stage)

Note: Python 3 has confused Python 2urllib / urllib2Completely unified into the presenturllibpackage, and divided the above sub-modules with clearer responsibilities, so don’t confuse them anymore.


2. Send request

2.1 The simplest request:urlopen

urlopenIt is the most direct way to initiate a request. One line of code can complete a GET request:

from urllib.request import urlopen

# 发送一个无参数的 GET 请求
response = urlopen('https://www.python.org')
# 读取返回的字节流,并解码为 UTF-8 文本
print(response.read().decode('utf-8'))

In addition to URLs,urlopenIt also supports several very practical parameters:

Parameter 1:data→ Use POST request instead

If you want to send a POST request, you need to encode the parameter dictionary into a byte string before passing it in.data

from urllib.parse import urlencode
from urllib.request import urlopen

# 准备要提交的参数
data_dict = {'word': 'hello urllib', 'author': 'germey'}
# 用 urlencode 转换成 URL 编码的字符串,再编码成 UTF-8 字节
data_bytes = bytes(urlencode(data_dict), encoding='utf-8')

# 只要传入了 data 参数,就会自动变成 POST 请求
response = urlopen('https://httpbin.org/post', data=data_bytes)
print(response.read().decode('utf-8'))

Parameter 2:timeout→ Control timeout

To prevent the program from getting stuck due to network fluctuations:

import socket
import urllib.error
from urllib.request import urlopen

try:
    # 故意设置一个极短的超时时间来测试异常
    response = urlopen('https://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    # 判断是不是超时错误
    if isinstance(e.reason, socket.timeout):
        print('⚠️ 请求超时,请检查网络或适当增加 timeout')
  • context: Customize SSL verification rules (you can skip the entry stage, but do not turn off verification in the production environment)
  • cafile / capath: Specify the local CA certificate path (suitable for intranet self-signed HTTPS sites)

2.2 More flexible requests:Requestkind

When you need to customize request headers such as User-Agent and Referer, or explicitly specify the request method, just useurlopenIt's not quite enough. You can use it first at this timeRequestBuild a request object:

from urllib.request import Request, urlopen

# 1. 用 Request 构造请求
req = Request('https://python.org')
# 2. 动态添加请求头(伪装成浏览器非常重要!)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36')
# 3. 发送请求
response = urlopen(req)
print(response.status)  # 输出状态码 200

RequestComplete parameters supported by the constructor (use as needed):

Request(
    url,
    data=None,               # POST 时传入字节流
    headers={},              # 请求头字典,也可以后续用 add_header 添加
    origin_req_host=None,    # 发起请求的主机(通常自动生成)
    unverifiable=False,      # 很少用到,默认即可
    method=None              # 显式指定请求方法:GET / POST / PUT / DELETE 等
)

2.3 Advanced gameplay:OpenerDirector

urlopenBehind the scenes is actually a default Opener prepared by Python for us. If you want to handle complex scenarios such as identity authentication, proxies, and cookies, you need to build Opener yourself.

Scenario 1: The website requires Basic Auth authentication

For example, some test sites require a username and password to access:

from urllib.request import (
    HTTPPasswordMgrWithDefaultRealm,
    HTTPBasicAuthHandler,
    build_opener
)

# 1. 创建密码管理器
password_mgr = HTTPPasswordMgrWithDefaultRealm()
# 2. 添加认证信息
password_mgr.add_password(
    realm=None,   # 一般网站没有指定 realm,留空即可
    uri='https://ssr3.scrape.center/',
    user='admin',
    passwd='admin'
)
# 3. 创建认证 Handler
auth_handler = HTTPBasicAuthHandler(password_mgr)
# 4. 用 Handler 构建自定义 Opener
opener = build_opener(auth_handler)
# 5. 发送带认证信息的请求
response = opener.open('https://ssr3.scrape.center/')
print(response.read().decode('utf-8'))

Scenario 2: Set proxy IP

To prevent the same IP from being banned due to too frequent requests:

from urllib.request import ProxyHandler, build_opener

# 1. 配置代理字典(HTTP 和 HTTPS 需要分开写)
proxy_dict = {
    'http': 'http://127.0.0.1:8080',
    'https': 'https://127.0.0.1:8080'
}
# 2. 创建代理 Handler
proxy_handler = ProxyHandler(proxy_dict)
# 3. 构建 Opener
opener = build_opener(proxy_handler)
# 4. 通过代理发送请求
try:
    response = opener.open('https://www.baidu.com', timeout=3)
    print('✅ 代理连接成功')
except Exception as e:
    print(f'❌ 代理连接失败:{e}')

Scenario 3: Handling Cookies

Continue to maintain session status after simulated login:

from http.cookiejar import CookieJar, MozillaCookieJar
from urllib.request import HTTPCookieProcessor, build_opener

# -- 方式1:只在内存中保存 Cookies(临时会话)
cookie_jar = CookieJar()
cookie_handler = HTTPCookieProcessor(cookie_jar)
opener = build_opener(cookie_handler)
opener.open('https://www.baidu.com')

print('内存中的 Cookies:')
for cookie in cookie_jar:
    print(f'  {cookie.name}: {cookie.value}')

# -- 方式2:把 Cookies 持久化到本地文件
# MozillaCookieJar 兼容浏览器格式的 cookies.txt
local_cookie = MozillaCookieJar('baidu_cookies.txt')
local_handler = HTTPCookieProcessor(local_cookie)
local_opener = build_opener(local_handler)
local_opener.open('https://www.baidu.com')
# 保存到文件,忽略过期和临时性限制
local_cookie.save(ignore_discard=True, ignore_expires=True)
print('\n✅ Cookies 已保存到 baidu_cookies.txt')

3. exception-handling

The crawler will inevitably encounter network failures, 404/500 and other problems during operation. Be sure to capture exceptions to make the program more robust.urllibTwo main exception classes are provided, among whichHTTPErroryesURLErrorWhen capturing subclasses, you need to pay attention to catch the subclass first, then the parent class:

from urllib.request import urlopen
from urllib.error import HTTPError, URLError

try:
    response = urlopen('https://httpbin.org/status/404', timeout=3)
except HTTPError as e:
    # HTTP 类错误,例如 404 Not Found、500 Internal Server Error
    print(f'❌ HTTP错误:状态码 {e.code},原因 {e.reason}')
    # 也可以进一步查看响应头
    # print(e.headers)
except URLError as e:
    # URL 类错误,例如域名不存在、网络不可达、超时等
    print(f'❌ URL错误:{e.reason}')
else:
    # 没有异常时才会执行到这里
    print('✅ 请求成功')

4. URL parsing

Crawlers often split, splice or encode URLs.urllib.parseModules can help you do it easily.

4.1 Split URL:urlparse / urlsplit

Split the complete URL into its component parts:

from urllib.parse import urlparse

url = 'https://www.baidu.com/index.html;user?id=5&name=test#comment'
result = urlparse(url)
print(result)
# 输出:ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5&name=test', fragment='comment')

# 通过属性获取指定部分
print(f'协议:{result.scheme}')
print(f'域名:{result.netloc}')
print(f'路径:{result.path}')

urlsplitandurlparseVery similar, just not separated separatelyparams(Modern websites rarely use this section anymore).

4.2 Splicing URLs:urlunparse / urlunsplit / urljoin

  • urlunparseandurlunsplitIt is the reverse operation of the above splitting function, which can piece the tuple/list back to the complete URL.
  • urljoinIt is specially used to convert relative paths into absolute paths, which is very useful when dealing with relative links in web pages.
from urllib.parse import urlunparse, urljoin

# 用 urlunparse 拼接
data = ['https', 'www.baidu.com', '/index.html', '', 'id=6', 'new_comment']
print(urlunparse(data))  # 输出:https://www.baidu.com/index.html?id=6#new_comment

# urljoin 处理相对链接
base_url = 'https://www.example.com/blog/'
rel_url1 = 'article1.html'
rel_url2 = '../about.html'
print(urljoin(base_url, rel_url1))  # https://www.example.com/blog/article1.html
print(urljoin(base_url, rel_url2))  # https://www.example.com/about.html

4.3 URL encoding/decoding:urlencode / quote / unquote

Chinese and special symbols placed in the URL must be encoded in order to be correctly recognized by the server:

from urllib.parse import urlencode, quote, unquote

# urlencode:把字典转换成 URL 查询字符串
params = {'keyword': 'Python爬虫', 'page': 1, 'size': 10}
base_url = 'https://www.example.com/search?'
full_url = base_url + urlencode(params)
print(full_url)  # 中文会被自动编码

# quote / unquote:单独处理字符串的编码和解码
chinese = 'Python入门到放弃'
encoded = quote(chinese)
decoded = unquote(encoded)
print(f'编码后:{encoded}')
print(f'解码后:{decoded}')

5. Best Practices (Pitfall Avoidance Guide)

  1. Be sure to set User-Agent The default urllib User-Agent will directly expose your crawler identity. There is a high probability of encountering 403. Be sure to disguise yourself as a browser.

  2. Set timeout reasonably It is generally recommended to wait for 3 to 10 seconds, and cooperate with exception capture to prevent the program from being unresponsive for a long time.

  3. Explicit handling of encoding When converting the response byte stream into a string, it is best to first use the response headerContent-TypeTo determine the encoding, write it down directlyutf-8Sometimes you will encounter garbled characters.

  4. Comply with robots.txt You can leave it alone when getting started, but if you want to write a public crawler or large-scale collection, you must first passrobotparserCheck if access is allowed.

  5. Timely replacement of databases for complex requirements urllibAlthough built-in, the syntax is relatively cumbersome. When encountering cookie pool, proxy pool, asynchronous concurrency and other requirements, it is more recommended to use requests (synchronous) or aiohttp (asynchronous).


Summarize

urllibIt is the "stepping stone" to get started with Python crawlers. Master it, and you will be able to understand the basic flow of HTTP requests: construct a request → send → receive a response → handle exceptions. Once these foundations are laid, learning other third-party libraries will be particularly smooth in the future.