title: First experience with urllib crawler description: urllib is Python's built-in HTTP request library and can be used without additional installation. It contains the following main modules:
Python urllib module usage guide
1. Overview
urllibIt is the only built-in HTTP request tool set in Python's standard library and can be used directly without additional installation. It handles request sending, exception catching, URL processing and evenrobots.txtThe parsing is all packaged, which is very suitable for getting started with crawlers.
It mainly contains 4 sub-modules:
urllib.request: Responsible for sending HTTP requests such as GET and POSTurllib.error: Unified processing of URL and HTTP related exceptions that occur during the request processurllib.parse: Specifically used to split, splice, encode/decode URLsurllib.robotparser: parsingrobots.txtRules (occasionally used in the entry stage)
Note: Python 3 has confused Python 2
urllib/urllib2Completely unified into the presenturllibpackage, and divided the above sub-modules with clearer responsibilities, so don’t confuse them anymore.
2. Send request
2.1 The simplest request:urlopen
urlopenIt is the most direct way to initiate a request. One line of code can complete a GET request:
In addition to URLs,urlopenIt also supports several very practical parameters:
Parameter 1:data→ Use POST request instead
If you want to send a POST request, you need to encode the parameter dictionary into a byte string before passing it in.data:
Parameter 2:timeout→ Control timeout
To prevent the program from getting stuck due to network fluctuations:
Parameter 3: Other security-related parameters
context: Customize SSL verification rules (you can skip the entry stage, but do not turn off verification in the production environment)cafile/capath: Specify the local CA certificate path (suitable for intranet self-signed HTTPS sites)
2.2 More flexible requests:Requestkind
When you need to customize request headers such as User-Agent and Referer, or explicitly specify the request method, just useurlopenIt's not quite enough. You can use it first at this timeRequestBuild a request object:
RequestComplete parameters supported by the constructor (use as needed):
2.3 Advanced gameplay:OpenerDirector
urlopenBehind the scenes is actually a default Opener prepared by Python for us. If you want to handle complex scenarios such as identity authentication, proxies, and cookies, you need to build Opener yourself.
Scenario 1: The website requires Basic Auth authentication
For example, some test sites require a username and password to access:
Scenario 2: Set proxy IP
To prevent the same IP from being banned due to too frequent requests:
Scenario 3: Handling Cookies
Continue to maintain session status after simulated login:
3. exception-handling
The crawler will inevitably encounter network failures, 404/500 and other problems during operation. Be sure to capture exceptions to make the program more robust.urllibTwo main exception classes are provided, among whichHTTPErroryesURLErrorWhen capturing subclasses, you need to pay attention to catch the subclass first, then the parent class:
4. URL parsing
Crawlers often split, splice or encode URLs.urllib.parseModules can help you do it easily.
4.1 Split URL:urlparse / urlsplit
Split the complete URL into its component parts:
urlsplitandurlparseVery similar, just not separated separatelyparams(Modern websites rarely use this section anymore).
4.2 Splicing URLs:urlunparse / urlunsplit / urljoin
urlunparseandurlunsplitIt is the reverse operation of the above splitting function, which can piece the tuple/list back to the complete URL.urljoinIt is specially used to convert relative paths into absolute paths, which is very useful when dealing with relative links in web pages.
4.3 URL encoding/decoding:urlencode / quote / unquote
Chinese and special symbols placed in the URL must be encoded in order to be correctly recognized by the server:
5. Best Practices (Pitfall Avoidance Guide)
-
Be sure to set User-Agent The default urllib User-Agent will directly expose your crawler identity. There is a high probability of encountering 403. Be sure to disguise yourself as a browser.
-
Set timeout reasonably It is generally recommended to wait for 3 to 10 seconds, and cooperate with exception capture to prevent the program from being unresponsive for a long time.
-
Explicit handling of encoding When converting the response byte stream into a string, it is best to first use the response header
Content-TypeTo determine the encoding, write it down directlyutf-8Sometimes you will encounter garbled characters. -
Comply with robots.txt You can leave it alone when getting started, but if you want to write a public crawler or large-scale collection, you must first pass
robotparserCheck if access is allowed. -
Timely replacement of databases for complex requirements
urllibAlthough built-in, the syntax is relatively cumbersome. When encountering cookie pool, proxy pool, asynchronous concurrency and other requirements, it is more recommended to use requests (synchronous) or aiohttp (asynchronous).
Summarize
urllibIt is the "stepping stone" to get started with Python crawlers. Master it, and you will be able to understand the basic flow of HTTP requests: construct a request → send → receive a response → handle exceptions. Once these foundations are laid, learning other third-party libraries will be particularly smooth in the future.

