Python asynchronous crawler tutorial (2024 updated version)
Is it time for the weekly crawler optimization session again? Obviously, crawling 100 test pages with a 1-second delay requires one and a half minutes to synchronize. Multi-threading is afraid of conflicts, heaping of resources, and inter-process communication is troublesome - yes, Python coroutine asynchronous crawler is the golden solution for IO-intensive tasks in 2024!
1. Overview of asynchronous crawlers
1.1 Why must it be asynchronous?
The core bottleneck of the crawler is never CPU operation, but network/disk IO waiting - for example, if you type arequests.get(), for the next tens, hundreds of milliseconds or even seconds, the program will just "wait" for the server to respond, unable to do anything.
An asynchronous crawler will immediately switch to executing other runnable tasks when a certain task is waiting for IO, using up all the waiting time!
1.2 Three core advantages
2. Core minimalist introduction (no official version)
There is no need to engage in complex underlying scheduling algorithms, just remember these three keywords:
2.1 Coroutine
It can be understood as a task unit that “can be paused, resumed, and actively controlled by the programmer”—analogy to watching a movie:
- Synchronous thread: watch a movie to the end and don’t answer the phone in the middle
- Coroutine: Pause watching the movie → Answer the emergency call → Hang up after processing → Return to the place where the movie was paused and continue watching the movie
2.2 Event Loop
This is the "general dispatcher" of the coroutine, which does three things in a loop in the background:
- Check all coroutines: Which ones have been suspended but IO is completed (can be resumed)? Which ones can be run just after startup?
- Select a task to execute according to the rules
- The task is executed to the pause point (
await), and then return to the loop
Python 3.7+ provides a super convenient entryasyncio.run(), no need to manually create/close event loop!
2.3 async/await syntactic sugar
is the magic that makes coroutine code look like synchronous code:
async def: Tell Python "This is not an ordinary function, it is a coroutine function. After calling, it will return the coroutine object and will not be executed immediately."await: Can only be used inasync def, which means "Wait until this asynchronous operation is completed before proceeding, and during this period you can do other tasks"
3. Mainstream asynchronous HTTP tool: aiohttp
The most used one in the Python asynchronous ecosystem is aiohttp. The latest 3.9+ version in 2024 will have a smoother experience!
3.1 The most basic single page crawling
3.2 2024 version 3.9+ small updates worth using
- HTTP/2 support is enabled by default (requires manual confirmation of dependencies)
h2installed) - Optimized the concurrent cache of DNS resolution
- More fine-grained connection reuse and timeout control
- Built-in pair
httpx.ResponseFormat compatibility (facilitates migration of old code)
4. Practical combat: high-performance URL batch crawling
Directly upload a complete script framework with exception-handling, concurrency control, and data parsing!
4.1 Complete code
Tips:
asyncio.gather()The results will be collected in the original order. Even if some tasks fail (throw an exception), the entire batch will not be interrupted unless you setreturn_exceptions=True。
5. Best practice pitfall avoidance guide
5.1 Preventing IP blocking is the first priority
In addition to the above codelimit_per_host, you can also add:
- Rate Limit: Use
aiolimiterLibrary, limit requests per second - Random User-Agent: Use
fake_useragent_asyncLibrary - Random delay: in
await fetchAdd beforeawait asyncio.sleep(random.uniform(0.1, 0.5))
5.2 Don’t mix synchronous blocking code
If called in a coroutinerequests.get()、time.sleep()、open()With this kind of synchronous blocking, the event loop will be completely stuck, and asynchronous use will be in vain!
- Change the corresponding asynchronous library:
requests→aiohttp/httpx,time.sleep→asyncio.sleep,open→aiofiles
5.3 Is asynchronous parsing also important?
If your data parsing is very complex (such as tens of thousands of words of long text, a large number of regular matches), you can use:
asyncio.to_thread(): Throw synchronous parsing into the thread pool and run (built-in in Python 3.9+)- Specialized asynchronous parsing library (such as
selectolaxAlthough it is synchronized, it is morebs4More than 10 times faster, just use it directly for small/medium data amounts)
6. Quick performance comparison (10 URLs with 1 second delay test)
Directly use the above actual script to simplify the synchronous/multi-threaded version test:
NOTE:
HOST_LIMITAccording to the target website'srobots.txtOr the actual anti-climbing strategy adjustment, bigger is not better!
7. Summary
In 2024, Python coroutine asynchronous crawlers are already the entry-level but high-efficiency choice - you don’t need to understand the complex underlying layers, just remember to "useasync defDefine tasks withawaitTo suspend and wait, useasyncio.run"Start", and then cooperateaiohttpWith the connection pool and exception-handling, you can write a crawler that is dozens of times faster than synchronization!

