title: Practical crawler tutorial: full-site crawling of static websites description: Python3 crawler practical tutorial: static website full-site crawling

Python3 crawler practical tutorial: static website full-site crawling

1. Preface

Don’t want to learn about complex parsing libraries/frameworks? Native Python can also handle full-site crawling!

If you are new to crawlers and have a headache when you see the long list of configurations for BeautifulSoup and Scrapy - don't panic, today we can use requests + built-in regular re to complete a complete public training movie station project. Grab links from the list page, extract fields from the details page, store them in JSON, and finally add multiple processes to speed up the process. The reverse climb at the target station is very loose, which is very suitable for getting started.

Through this case you will master:

  • Encapsulation of lightweight HTTP requests
  • Fast regular parsing of static HTML
  • Hierarchical jump logic of "List Page → Details Page"
  • Multi-process parallel crawling, doubling the efficiency
  • Common pitfalls and repair methods for entry-level crawlers

2. Technical preparation

First set up a minimalist environment, with almost no need to install anything extra:

  1. Make sure you have Python 3.6+ locally (compatible with f-string and multi-processPool(conventional writing)
  2. Install core dependencies:
pip install requests

💡 The libraries used are lightweight/natively provided:

  • requests: HTTP library that is 10 times easier to write than urllib
  • re: Python's built-in regular expression is sufficient for parsing simple static HTML
  • logging: records crawling status, necessary for debugging
  • json/os: data storage and file management
  • multiprocessing: multi-process speed-up (you can also practice standing on a single process, but it is always useful after learning)

3. Target website analysis

Hand training station address: https://ssr1.scrape.center/ When you open it, the structure is very clear, and it is simply a template for crawler practice.

3.1 Page structure

  • List Page: URL rules are fixed to/page/{页码}, from 1 to 10 (corresponding to the following codeTOTAL_PAGE), 10 movies per page
  • Details Page: The movie name of each movie card is aclass="name"of<a>Label,hrefIt is a relative path and needs to be spliced ​​into the root domain name.BASE_URL

3.2 Fields to be captured

The details page contains this public information:

  • Cover image URL
  • movie title
  • movie categories (array of tags)
  • Show time
  • Rating
  • Plot synopsis

4. Complete code with pitfall repair

The finally available full version is given directly below, which fixes several pitfalls that novices are likely to encounter (escape warnings, death, etc., illegal characters in file names, etc.), and the logic is more standardized. You can copy and run it directly and understand it while reading the comments.

import json
from os import makedirs
from os.path import exists
import requests
import logging
import re
from urllib.parse import urljoin
import multiprocessing

# 日志配置:记录时间、级别、内容,方便观察运行情况
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s: %(message)s')

# 核心常量
BASE_URL = 'https://ssr1.scrape.center'
TOTAL_PAGE = 10
RESULTS_DIR = 'results'  # 存放 JSON 的文件夹
if not exists(RESULTS_DIR):
    makedirs(RESULTS_DIR)


def scrape_page(url):
    """
    通用「套娃」爬取函数:统一处理请求、状态码、异常
    :param url: 目标 URL
    :return: 成功返回 HTML 文本,失败返回 None
    """
    logging.info('正在爬取 %s...', url)
    try:
        # 坑1:增加 timeout(10秒)防止请求永远不返回,卡住程序
        response = requests.get(url, timeout=10)
        if response.status_code == 200:
            return response.text
        logging.error('无效状态码 %d,URL:%s', response.status_code, url)
    except requests.RequestException:
        logging.error('请求异常!URL:%s', url, exc_info=True)


def scrape_index(page):
    """
    构造并爬取指定页码的列表页
    :param page: 页码
    :return: 列表页 HTML
    """
    index_url = f'{BASE_URL}/page/{page}'
    return scrape_page(index_url)


def parse_index(html):
    """
    解析列表页,返回详情页 URL 生成器(用 yield 节省内存)
    :param html: 列表页 HTML
    :return: 详情页 URL 生成器
    """
    if not html:
        # 坑2:空值保护,防止后续正则匹配报错
        return []
    # 坑3:正则字符串前加 r'' 前缀,避免转义警告
    pattern = re.compile(r'<a.*?href="(.*?)".*?class="name">')
    items = re.findall(pattern, html)
    if not items:
        return []
    for item in items:
        detail_url = urljoin(BASE_URL, item)  # 自动拼接根域名,处理相对路径
        logging.info('获取到详情页链接:%s', detail_url)
        yield detail_url


def scrape_detail(url):
    """
    爬取详情页
    :param url: 详情页 URL
    :return: 详情页 HTML
    """
    return scrape_page(url)


def parse_detail(html):
    """
    解析详情页,返回电影信息字典
    :param html: 详情页 HTML
    :return: 电影信息字典
    """
    if not html:
        return None  # 空值保护

    # 逐个定义正则,加上 re.S 让 . 也能匹配换行符
    cover_pattern = re.compile(r'class="item.*?<img.*?src="(.*?)".*?class="cover">', re.S)
    name_pattern = re.compile(r'<h2.*?>(.*?)</h2>')
    categories_pattern = re.compile(r'<button.*?category.*?<span>(.*?)</span>.*?</button>', re.S)
    published_at_pattern = re.compile(r'(\d{4}-\d{2}-\d{2})\s?上映')
    drama_pattern = re.compile(r'<div.*?drama.*?>.*?<p.*?>(.*?)</p>', re.S)
    score_pattern = re.compile(r'<p.*?score.*?>(.*?)</p>', re.S)

    # 逐个提取字段,找不到就设默认值,保证程序不会崩
    name_match = re.search(name_pattern, html)
    name = name_match.group(1).strip() if name_match else "未命名电影"

    cover_match = re.search(cover_pattern, html)
    cover = cover_match.group(1).strip() if cover_match else None

    categories = re.findall(categories_pattern, html)

    pub_match = re.search(published_at_pattern, html)
    published_at = pub_match.group(1) if pub_match else None

    drama_match = re.search(drama_pattern, html)
    drama = drama_match.group(1).strip() if drama_match else None

    score_match = re.search(score_pattern, html)
    score = float(score_match.group(1).strip()) if score_match else None

    return {
        'cover': cover,
        'name': name,
        'categories': categories,
        'published_at': published_at,
        'drama': drama,
        'score': score
    }


def save_data(data):
    """
    保存单条电影数据到 JSON 文件
    :param data: 电影信息字典
    """
    if not data:
        return
    name = data.get('name')
    # 坑4:清洗文件名中的非法字符(Windows / Linux 通用)
    safe_name = re.sub(r'[\\/:*?"<>|]', '_', name)
    data_path = f'{RESULTS_DIR}/{safe_name}.json'

    # 坑5:用 with open 自动关闭文件,避免资源泄漏
    try:
        with open(data_path, 'w', encoding='utf-8') as f:
            # ensure_ascii=False 防止中文变成 \uxxxx
            json.dump(data, f, ensure_ascii=False, indent=2)
        logging.info('已成功保存:%s', data_path)
    except Exception as e:
        logging.error('保存失败!文件路径:%s,错误:%s', data_path, e)


def main(page):
    """
    单页主处理函数:串联整个单页流程
    :param page: 页码
    """
    index_html = scrape_index(page)
    detail_urls = parse_index(index_html)
    for detail_url in detail_urls:
        detail_html = scrape_detail(detail_url)
        data = parse_detail(detail_html)
        if data:
            logging.info('获取到电影数据:%s', data['name'])
            save_data(data)


if __name__ == '__main__':
    # 多进程并行爬取
    # 如果想单步调试,也可以直接循环:for page in range(1, TOTAL_PAGE + 1): main(page)
    pool = multiprocessing.Pool()
    pages = range(1, TOTAL_PAGE + 1)
    pool.map(main, pages)
    pool.close()  # 关闭进程池,不再接受新任务
    pool.join()   # 等待所有子进程完成

5. Dismantling of core processes

In order to make it easier for you to get started, we split the above code into 3 layers. Each layer only does one thing and has clear responsibilities.

5.1 Request a unified entrance

Regardless of whether it is a list page or a details page, the samescrape_page()function. The benefits of doing this are:

  • Avoid duplicate writingtry...exceptand status code judgment
  • Unified management of timeouts and exception logs. If you want to add request headers or proxies later, you only need to change one place.

5.2 Hierarchical analysis

  • List page parsing (parse_index
    Match all using regexclass="name"of<a>label,extractionhref, then useurljoin()Splice out the complete details page URL. Return Generator instead of a list, which can be parsed while crawling, saving memory.

  • Details page analysis (parse_detail
    Write the corresponding regular rules for each field you want to capture, and addre.SThe pattern prevents newlines from interfering with matching. If a field is not found, replace it with a default value to ensure that the entire program does not crash due to missing data on a certain page.

5.3 Data storage + multi-process

  • Storage: Use the movie name as the file name, and use regular rules to replace illegal characters such as colons and slashes with underscores in advance. JSON files are unified inresults/folder for subsequent inspection.
  • Multiprocess: usemultiprocessing.Pool()Distributing the task of 10 pages to multiple processes for simultaneous execution is much faster than serial access by a single process.

⚠️ NOTE:pool.map()Will block the main process until all subtasks end; finally remember to callpool.close()andpool.join()


6. Advanced optimization suggestions

At present, this version can be used directly, but if you want to crawl more complex websites in the future, you can consider the following small optimizations:

  1. Add reasonable request headers Most websites will checkUser-Agent, can be added like this:

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    }
    # 在 scrape_page 里改为:
    response = requests.get(url, headers=headers, timeout=10)
  2. Random Delay between each requesttime.sleep(1~3)Random seconds to reduce request frequency and avoid triggering anti-crawling.

  3. Agent Support If the IP is blocked, a proxy pool can be introduced.requests.get()rigaproxiesparameter.

  4. Continue climbing after breakpoint Use a simple database or JSON file to record the URLs or movie names that have been successfully crawled. You can skip the crawled parts when restarting after the program is interrupted.

  5. Replace parsing library When the page structure becomes complex and regularity is difficult to maintain, you can consider switching toBeautifulSoup4orlxml, making the code more readable.


7. Summary

Today’s case covers the core process of static crawlers: unified request → hierarchical parsing → structured storage, coupled with multi-process speed-up and repair solutions for 5 pitfalls that are easy to step on in actual combat. The entire project only uses Python's built-in library and one requests, which is very suitable for getting started.

You can try replacing the fields in the code with content you are interested in, or try it out on a public practice site with a similar structure (such as some book catalog sites, news list sites), and really use the ideas you have learned.

Finally, one more reminder: Although crawlers are good, please do not crawl unauthorized data, and do not cause excessive server pressure on the target site~