Python urllib 模块使用指南

1. 概述

urllib 是 Python 内置的 HTTP 请求库,无需额外安装即可使用。它包含以下主要模块:

  • request:基本的 HTTP 请求模块
  • error:异常处理模块
  • parse:URL 处理工具模块
  • robotparser:robots.txt 解析模块(使用较少)

注意:Python 2 中有 urllib 和 urllib2 两个库,但在 Python 3 中已统一为 urllib。

2. 发送请求

2.1 urlopen 方法

urlopen 是最基本的请求发送方法:

from urllib.request import urlopen

response = urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

参数说明:

  1. data 参数:用于 POST 请求

    from urllib.parse import urlencode
    data = bytes(urlencode({'word': 'hello'}), encoding='utf-8')
    response = urlopen('https://httpbin.org/post', data=data)
  2. timeout 参数:设置超时时间(秒)

    try:
        response = urlopen('https://httpbin.org/get', timeout=0.1)
    except urllib.error.URLError as e:
        if isinstance(e.reason, socket.timeout):
            print('TIME OUT')
  3. 其他参数

    • context:SSL 设置
    • cafile/capath:CA 证书路径

2.2 Request 类

用于构建更复杂的请求:

from urllib.request import Request

request = Request('https://python.org')
request.add_header('User-Agent', 'Mozilla/5.0')
response = urlopen(request)

完整构造方法:

Request(url, data=None, headers={}, 
        origin_req_host=None, 
        unverifiable=False, 
        method=None)

2.3 高级用法

验证处理

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener

auth_handler = HTTPBasicAuthHandler(HTTPPasswordMgrWithDefaultRealm())
auth_handler.add_password(realm=None, uri='https://ssr3.scrape.center/', user='admin', passwd='admin')
opener = build_opener(auth_handler)
response = opener.open('https://ssr3.scrape.center/')

代理设置

from urllib.request import ProxyHandler

proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:8080',
    'https': 'https://127.0.0.1:8080'
})
opener = build_opener(proxy_handler)
response = opener.open('https://www.baidu.com')

Cookies 处理

from http.cookiejar import CookieJar
from urllib.request import HTTPCookieProcessor

cookie = CookieJar()
handler = HTTPCookieProcessor(cookie)
opener = build_opener(handler)
response = opener.open('https://www.baidu.com')

保存 Cookies 到文件:

from http.cookiejar import MozillaCookieJar

cookie = MozillaCookieJar('cookie.txt')
handler = HTTPCookieProcessor(cookie)
opener = build_opener(handler)
response = opener.open('https://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

3. 异常处理

3.1 URLError

from urllib.error import URLError

try:
    response = urlopen('https://nonexistent.com')
except URLError as e:
    print(e.reason)

3.2 HTTPError

from urllib.error import HTTPError

try:
    response = urlopen('https://httpbin.org/status/404')
except HTTPError as e:
    print(e.code, e.reason, e.headers)
except URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

4. URL 解析

4.1 urlparse

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html;user?id=5#comment')
print(result.scheme, result.netloc, result.path)

4.2 urlunparse

from urllib.parse import urlunparse

data = ['https', 'www.baidu.com', 'index.html', '', 'a=6', 'comment']
print(urlunparse(data))

4.3 urlencode

from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 22
}
base_url = 'https://www.baidu.com?'
url = base_url + urlencode(params)

4.4 quote/unquote

from urllib.parse import quote, unquote

keyword = '壁纸'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(unquote(url))

5. Robots 协议分析

5.1 基本用法

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://www.baidu.com/robots.txt')
rp.read()
print(rp.can_fetch('Baiduspider', 'https://www.baidu.com/homepage/'))

5.2 常用方法

  • set_url(): 设置 robots.txt URL
  • read(): 读取并解析 robots.txt
  • can_fetch(): 检查是否允许抓取
  • mtime(): 获取最后解析时间
  • modified(): 设置最后解析时间

6. 最佳实践建议

  1. 始终设置 User-Agent:避免被识别为爬虫
  2. 合理使用 timeout:防止长时间无响应
  3. 处理异常:增强程序健壮性
  4. 遵守 robots.txt:尊重网站规则
  5. 考虑使用 requests 库:对于复杂需求,requests 库更友好

7. 更新说明(2023)

  1. 移除了 Python 2 相关内容
  2. 更新了异常处理的最佳实践
  3. 增加了对现代 Web 开发的建议
  4. 优化了代码示例的可读性
  5. 强调了 HTTPS 和安全连接的重要性

对于更复杂的爬虫需求,建议考虑使用 requestsaiohttp 等第三方库。