title: Practical crawler tutorial: full-site crawling of static websites description: Python3 crawler practical tutorial: static website full-site crawling
Python3 crawler practical tutorial: static website full-site crawling
1. Preface
Don’t want to learn about complex parsing libraries/frameworks? Native Python can also handle full-site crawling!
If you are new to crawlers and have a headache when you see the long list of configurations for BeautifulSoup and Scrapy - don't panic, today we can use requests + built-in regular re to complete a complete public training movie station project. Grab links from the list page, extract fields from the details page, store them in JSON, and finally add multiple processes to speed up the process. The reverse climb at the target station is very loose, which is very suitable for getting started.
Through this case you will master:
- Encapsulation of lightweight HTTP requests
- Fast regular parsing of static HTML
- Hierarchical jump logic of "List Page → Details Page"
- Multi-process parallel crawling, doubling the efficiency
- Common pitfalls and repair methods for entry-level crawlers
2. Technical preparation
First set up a minimalist environment, with almost no need to install anything extra:
- Make sure you have Python 3.6+ locally (compatible with f-string and multi-process
Pool(conventional writing) - Install core dependencies:
💡 The libraries used are lightweight/natively provided:
- requests: HTTP library that is 10 times easier to write than urllib
- re: Python's built-in regular expression is sufficient for parsing simple static HTML
- logging: records crawling status, necessary for debugging
- json/os: data storage and file management
- multiprocessing: multi-process speed-up (you can also practice standing on a single process, but it is always useful after learning)
3. Target website analysis
Hand training station address: https://ssr1.scrape.center/ When you open it, the structure is very clear, and it is simply a template for crawler practice.
3.1 Page structure
- List Page: URL rules are fixed to
/page/{页码}, from 1 to 10 (corresponding to the following codeTOTAL_PAGE), 10 movies per page - Details Page: The movie name of each movie card is a
class="name"of<a>Label,hrefIt is a relative path and needs to be spliced into the root domain name.BASE_URL
3.2 Fields to be captured
The details page contains this public information:
- Cover image URL
- movie title
- movie categories (array of tags)
- Show time
- Rating
- Plot synopsis
4. Complete code with pitfall repair
The finally available full version is given directly below, which fixes several pitfalls that novices are likely to encounter (escape warnings, death, etc., illegal characters in file names, etc.), and the logic is more standardized. You can copy and run it directly and understand it while reading the comments.
5. Dismantling of core processes
In order to make it easier for you to get started, we split the above code into 3 layers. Each layer only does one thing and has clear responsibilities.
5.1 Request a unified entrance
Regardless of whether it is a list page or a details page, the samescrape_page()function. The benefits of doing this are:
- Avoid duplicate writing
try...exceptand status code judgment - Unified management of timeouts and exception logs. If you want to add request headers or proxies later, you only need to change one place.
5.2 Hierarchical analysis
-
List page parsing (
parse_index)
Match all using regexclass="name"of<a>label,extractionhref, then useurljoin()Splice out the complete details page URL. Return Generator instead of a list, which can be parsed while crawling, saving memory. -
Details page analysis (
parse_detail)
Write the corresponding regular rules for each field you want to capture, and addre.SThe pattern prevents newlines from interfering with matching. If a field is not found, replace it with a default value to ensure that the entire program does not crash due to missing data on a certain page.
5.3 Data storage + multi-process
- Storage: Use the movie name as the file name, and use regular rules to replace illegal characters such as colons and slashes with underscores in advance. JSON files are unified in
results/folder for subsequent inspection. - Multiprocess: use
multiprocessing.Pool()Distributing the task of 10 pages to multiple processes for simultaneous execution is much faster than serial access by a single process.
⚠️ NOTE:
pool.map()Will block the main process until all subtasks end; finally remember to callpool.close()andpool.join()。
6. Advanced optimization suggestions
At present, this version can be used directly, but if you want to crawl more complex websites in the future, you can consider the following small optimizations:
-
Add reasonable request headers Most websites will check
User-Agent, can be added like this: -
Random Delay between each request
time.sleep(1~3)Random seconds to reduce request frequency and avoid triggering anti-crawling. -
Agent Support If the IP is blocked, a proxy pool can be introduced.
requests.get()rigaproxiesparameter. -
Continue climbing after breakpoint Use a simple database or JSON file to record the URLs or movie names that have been successfully crawled. You can skip the crawled parts when restarting after the program is interrupted.
-
Replace parsing library When the page structure becomes complex and regularity is difficult to maintain, you can consider switching to
BeautifulSoup4orlxml, making the code more readable.
7. Summary
Today’s case covers the core process of static crawlers: unified request → hierarchical parsing → structured storage, coupled with multi-process speed-up and repair solutions for 5 pitfalls that are easy to step on in actual combat. The entire project only uses Python's built-in library and one requests, which is very suitable for getting started.
You can try replacing the fields in the code with content you are interested in, or try it out on a public practice site with a similar structure (such as some book catalog sites, news list sites), and really use the ideas you have learned.
Finally, one more reminder: Although crawlers are good, please do not crawl unauthorized data, and do not cause excessive server pressure on the target site~

