title: 🛠️ Python crawler practical teaching: Douban Movie Top250 Synchronization and Asynchronous Practical Manual description: In the field of reptiles, efficiency is life. When we need to collect tens of thousands of data, the waiting time of single-threaded synchronous crawling is unacceptable. This article will take you from the most basic synchronization logic to the ultimate concurrency coroutine solution through practical code. https://github.com/Annyfee/spider-js-reverse https://github.com/Annyfee/spider-defense-bypass/tree/main
Foreword: In the world of reptiles, efficiency is king. When the amount of data skyrockets from hundreds to tens of thousands, the "one step and one stop" rhythm of single-threaded synchronous crawling will cause people to collapse. This article will not just throw a bunch of theories at you, but will take you starting from the simplest synchronous crawler, upgrading to multi-threading step by step, and then pushing the performance to the limit - using coroutines to achieve ultra-high throughput. After reading this, you will not only be able to write a fast crawler, but also truly understand the core thinking of concurrent programming.
🛠️ 1. Core tool stack
This case is all about pragmatism, and each library directly addresses the pain points of crawlers:
- Data Collection:
requests(Synchronized HTTP library, first choice for getting started)/aiohttp(Asynchronous HTTP library, speed king) - Data Analysis:
lxml.etree→ Use XPath to accurately locate page content, which is much cleaner and more efficient than regular expressions - Storage Optimization:
DataRecorder→ Automatically handle Excel writing, file locking and table headers, perfectly supporting multi-threading/process safety - Data Alignment:
itertools.zip_longest→ Prevent all subsequent data from being distorted when a certain movie lacks short reviews.
🧠 2. Core logic of task splitting
The difference in thinking between synchronization and concurrency determines the huge difference in code structure:
- Synchronization process: String "send request → parse data → write file" into a line, only do one thing at a time, and then move on after finishing, waiting for I/O throughout the process.
- Concurrent/Asynchronous Process: Encapsulate the "request + parsing" of each page into an independent task (atomic), which is only responsible for returning the data list of this page. As for when to execute and how to schedule, everything is left to the thread pool or event loop for unified management. The main thread only does the final summarization and writing to disk, which is clean and efficient.
To put it simply: synchronization is one person moving bricks serially, concurrency is asking a group of people to move bricks at the same time, and coroutine is using one person but letting him switch quickly without stopping.
🚀 3. Practical code implementation of the whole solution
Preparation: one-click environment configuration
Open the terminal and paste the following line of commands to install all dependencies at once:
1. Synchronous crawling: the starting point of the crawler
The logic is straightforward, and each step is waiting for the network or disk, but it is the easiest version to understand and debug, and is very suitable for figuring out the process.
2. Multi-threading solution: the first choice for smooth speed increase
Thread pool is the fastest way to upgrade synchronization code. The resource overhead is smaller than multiple processes, and you can continue to use what you are already familiar with.requestslibrary.
3. Coroutine solution: ultimate performance of single thread
Coroutines are the optimal solution for I/O-intensive tasks: one thread can easily schedule hundreds or thousands of requests, and the CPU and memory overhead are extremely low, but an asynchronous library is required.aiohttpCooperate.
📊 4. Selection and optimization suggestions
1. Solution selection guide
2. Pitfall avoidance and optimization suggestions
- Control the amount of concurrency: Websites such as Douban have request frequency limits. It is recommended that the number of threads or coroutine concurrency be set between 5 and 10. If you add more, it will be easily blocked.
- Write to disk in batches: Do not write items one by one in a loop
record(), after all the data is ready, disk I/O can be reduced by more than 90% by writing to the disk at one time. - Reuse session object:
requests.Session()andaiohttp.ClientSession()You can reuse the underlying TCP connection to increase the request speed by about 30%. Do not create a new one every time. - Exception-handling must be in place: Network fluctuations and anti-crawling interception may cause the crawling of a single page to fail, so be sure to use it
try/exceptKeep the bottom line, otherwise you will end up with a fishy soup.

