title: Basic use of aiohttp description: Python asynchronous crawler tutorial: aiohttp detailed explanation
Python asynchronous crawler tutorial: aiohttp detailed explanation
Imagine you want to scrape data from 10 websites, each of which takes exactly 1 second to respond. If you use the most familiarrequestslibrary, you will find that the program takes at least 10 seconds - the requests are "queued" one after another, and the subsequent ones can only wait. And today’s protagonistaiohttp, which allows you to initiate all requests at the same time, and the total time may be less than 1.1 seconds. This is the charm of asynchronous crawlers.
1. Overview
aiohttpis a Python asynchronous libraryasyncioThe core HTTP tool in the ecosystem, providing both client-side and server-side capabilities. In crawler development, we mainly use its client part to initiate high-concurrency, non-blocking HTTP requests.
Simply put, it makes your crawler like an efficient restaurant waiter: running to greet the next table of guests without waiting for the previous dish to be served.
1.1 Why choose it?
- 🚀 Completely asynchronous and non-blocking: Multiple requests are processed in parallel, and the total time of 10 websites with 1 second delay may only take more than 1 second
- 📡 Support HTTP/HTTPS/WebSocket: Whether it is an ordinary crawler or real-time data stream, it can handle it
- 🤝 Built-in connection pool, session, proxy, cookie management: No need to reinvent the wheel yourself
- ⚡ ** Far more than synchronization library
requestsPerformance**: The advantage is extremely obvious in large-scale collection scenarios
2. Basic introduction
2.1 Installation
You can install it with one line of commands. Note that the Python version should be ≥ 3.7 (lower versions are compatible with 3.6, but it is strongly recommended to upgrade):
2.2 The first asynchronous crawler
Before writing asynchronous code, remember two core keywords:
async def: Define an asynchronous function, which cannot be called directly and needs to be handed over to the event loop for scheduling.await: Waiting for an asynchronous operation to complete (such as sending a request, reading response data), can only appear inasync definside function
Let’s use them to grabexample.com, and print the first 200 characters:
💡 Key Points:
ClientSessionSimilar to the "incognito window" of a browser, all pages opened in it will share the environment and can be closed after use. Do not create them repeatedly.
3. Common request configurations
3.1 URL with parameters
Avoid manual splicing and escaping and pass parameters directly in dictionary formparams, aiohttp will automatically handle:
3.2 Set request header
The most common disguise of crawlers is modificationUser-Agent. Directly place the head that needs to be customizedheadersJust in the dictionary:
3.3 Timeout control
One slow request can bog down the entire application.aiohttp.ClientTimeoutYou can finely control the total timeout and the time of each stage such as connection and reading:
4. Supported request methods
Except for the most commonly usedGET,aiohttpAlso supportedPOST、PUT、DELETEetc. all standard HTTP methods.
4.1 POST request
Form submission (corresponding torequestsindata)
Commonly used to simulate login or ordinary forms:
JSON submission (corresponding torequestsinjson)
When interacting with the RESTful API use:
4.2 PUT / DELETE etc.
Usage andGET、POSTAlmost exactly the same, just replace the method name according to the needs of the interface:
5. Process response data
5.1 Basic response information
Attributes such as status code and response headers can be obtained directly, but the response body must beawaitRead** because it is an asynchronous data stream.
5.2 Large file streaming download
If you want to download a file of several hundred MB, call directlyread()The entire file will be moved into memory, which may cause the program to crash. The correct approach is to read in chunks:
6. Advanced practical functions
6.1 Control the number of concurrencies
Opening hundreds or thousands of concurrencies without restrictions may not only cause the IP to be blocked by the server, but also occupy the local bandwidth.asyncio.SemaphoreCan help you limit the number of simultaneous requests:
6.2 Session persistence
ClientSessionCookies, connection pools, and default request headers are maintained internally. Configure it uniformly when creating, and all subsequent requests will automatically use it, which is very convenient:
6.3 Proxy settings
pass directlyproxyParameter configuration HTTP/HTTPS proxy, also supports proxy with authentication:
7. Error handling
Network request anomalies are very frequent, so be sure to catch them proactively! aiohttp common exceptions are inherited fromaiohttp.ClientError, it is recommended to handle different types separately:
8. Complete practice: crawling websites in batches
Now connect the knowledge points learned above and write a real crawler: crawl 3 websites at the same time, set the concurrency limit to 2, and bring complete error handling.
Run this code, and you will find that even though there is a page with a 2-second delay, the results of the other two pages are returned almost at the same time, and the total time taken is much less than the cumulative time spent on each of the three pages.
9. Summary of best practices
- must be reused
ClientSession: Do not create a new one for every request, the initialization overhead is relatively high - Be sure to control the number of concurrencies: Flexibly adjust according to the server's affordability and its own network conditions. Generally recommended is 5 to 20
- Set timeout reasonably: Avoid being dragged down by individual slow requests to the entire crawler
- Download large files in chunks: Prevent memory overflow
- Complete error handling: The network environment is uncontrollable, and all possible exceptions must be covered
- Make good use of
async withManagement life cycle: automatically close Session and semaphore, less error-prone
10. Extended learning
If you want to learn more deeplyaiohttp, it is recommended to read its official documentation and combineasyncioStudy and thoroughly understand the event loop and asynchronous programming model:
- aiohttp 官方文档
- When dealing with more complex concurrent task scheduling, it can also be used with
aiomultiprocessingand other libraries to further improve the execution efficiency of large batch tasks.
I hope this tutorial can help you get started quicklyaiohttp, write an efficient and stable asynchronous crawler!

