title: Practical Project 2: Social Media Monitoring description: Use CrawlSpider rules to realize automatic discovery and crawling of the entire network.

Practical Project 2: Social Media Monitoring

Brands, content creators, and marketing teams often need to continuously track changes in public opinion, competitive product trends, and hot topics on social platforms. If you manually refresh hundreds or thousands of pages, it is not only inefficient, but also easy to miss key information. The CrawlSpider officially provided by Scrapy is precisely to solve the need for automatic crawling of multiple pages with "regular link jumps", so that crawlers can automatically discover, automatically track, and automatically extract data.

📂 Stage: Stage 4 - Practical Exercise (Project Development)


1. Core tool: CrawlSpider rule-driven architecture

CrawlSpiderInherited from Scrapy's baseSpider, but an additional set of automatic link processing mechanisms is added. In the past we needed toparseManual construction in the methodRequestAnd specify the callback, but now we only need to use rules to describe "which links should be grabbed", "who should be handed over to handle after grabbing" and "whether we should continue to track other links in this link", and the framework will help us complete the follow-up work.

Get started quickly: a minimalist but complete monitoring template

The following is a monitoring crawler for single-domain social platforms such as Twitter/Xiaohongshu. Although the code is short, it already includes the complete process of rule definition, data extraction, etc.:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class SocialMediaMonitor(CrawlSpider):
    # 爬虫唯一标识,启动时使用:scrapy crawl social_media
    name = 'social_media'
    # 限制只抓取目标域名,避免爬出无关站点
    allowed_domains = ['example-social.com']
    # 入口页面:热门话题页、竞品列表页、首页推荐等
    start_urls = ['https://example-social.com/trending']
    
    # rules 是 CrawlSpider 的灵魂,每个 Rule 定义一类链接的处理方式
    rules = (
        # 规则1:用户主页 → 交给 parse_user 提取粉丝、签名等 → 继续跟踪页面内的链接
        Rule(
            LinkExtractor(allow=r'/user/profile/\d{8,10}'),  # 匹配8-10位数字ID的用户
            callback='parse_user',
            follow=True
        ),
        # 规则2:帖子详情页 → 交给 parse_post 提取标题、正文、点赞 → 不再跟踪内部链接
        Rule(
            LinkExtractor(allow=r'/post/detail/\d{8,10}'),
            callback='parse_post',
            follow=False
        ),
    )
    
    def parse_user(self, response):
        """处理用户主页,只抓取监控关心的核心字段"""
        yield {
            'user_id': response.url.split('/')[-1],  # 从URL中直接提取ID,比CSS/XPath更稳定
            'username': response.css('h1.profile-name::text').get().strip(),
            'followers_count': response.css('div.stats span:nth-child(2)::text').get(),
            'bio': response.css('p.profile-bio::text').get(default='无签名'),  # 空字段给默认值,防止报错
        }
    
    def parse_post(self, response):
        """处理帖子详情页,同样只抓核心字段"""
        yield {
            'post_id': response.url.split('/')[-1],
            'author_id': response.css('div.author-info a::attr(href)').get().split('/')[-1],
            'title': response.css('h2.post-title::text').get().strip(),
            'content': response.css('div.post-body::text').getall(),  # 分段正文用 getall 取完整内容
            'likes_count': response.css('span.like-btn::text').get(),
            'publish_time': response.css('time::attr(datetime)').get(),  # 优先提取标准化的 ISO 时间
        }

Key details in the template (Guide to avoid pitfalls)

  1. The order of rules determines the matching priority: the framework will followrulesThe order of tuples is tried sequentially to match. If you put "Post Rules" before "User Rules", the post links may be consumed first on the hot topic page, thus missing other user links. Therefore, when designing rules, priority should be given to matching nodes that contain more effective target links (such as home pages and list pages).

  2. followParameter selection:

    • follow=True: Suitable for pages such as entrance pages and user homepages that will continue to generate new target links, allowing crawlers to follow the links in depth.
    • follow=False: Suitable for pages such as details pages and comment pages that are already data endpoints to avoid meaningless expansion of the crawl range.
  3. LinkExtractorMore filtering options: In addition to usingallowFor regular matching, you can also use:

    • deny: Exclude some unnecessary links (such as help pages, reporting pages).
    • restrict_xpaths / restrict_css: Only search for links in specific areas of the page (such as the "User Card Area" and "Post List Area" of the hot topic page), thus greatly reducing invalid extraction and improving efficiency.

2. Project summary

Three core advantages of CrawlSpider

  1. Rule-driven, automatic discovery: No need to manually write entry scanning and page turning logic, focus on defining rules and extracting data.
  2. Flexible configuration, clear division of labor: Supports multiple rules in parallel, each rule has independent callbacks andfollowAttributes can easily handle complex link structures such as "category→product→review→store".
  3. Built-in deduplication to avoid repeated crawling: CrawlSpider will automatically record the visited links, combined with Scrapy'sDUPEFILTER_CLASSYou can further customize the deduplication strategy to reduce resource waste.

Suitable actual combat scenarios

In addition to social media monitoring, CrawlSpider is also great for:

  • Crawl all posts in vertical forums
  • Daily hot spot tracking on news aggregation platform
  • Vertical e-commerce competitive product price/inventory monitoring
  • Full site content archive of corporate official website

💡 Remember: CrawlSpider is not omnipotent, but for more than 90% of multi-page crawlers with "regular URL jumps", its development efficiency is 3-5 times higher than that of the basic Spider, and subsequent maintenance costs are also lower.


🔗 Official extended reading