Web page basics

#Basics of modern web crawlers: web page structure and parsing technology

Have you ever been curious: What core technology is hidden behind these functions, including the summary of hot-sale lists on Douyin, the price monitoring of e-commerce platforms, and the batch citation statistics of academic papers? That’s right, it’s a modern web crawler. The first step in writing a good crawler is to thoroughly understand the "webpage cake" you want to "gnaw" - it consists of a three-layer structure, and you must learn to accurately dig out the data you need from it, whether it is static content or a dynamic page full of JavaScript.

1. Modern three-tier structure of web pages

Today's web pages are no longer in the era of "just pile a bunch of static HTML". They're more like an application that works closely together with the HTML5 structural skeleton, CSS3 skin styles, and ES6+ dynamic muscles. Understand these three layers, and you will know where to look for data, and why sometimes the data is "not captured" even though it is on the page.

1.1 HTML5: Defining "Meaningful Content Framework"

HTML5 doesn't just tell the browser "put an element here", it also introduces a set of semantic tags. For reptiles, this is great news - we don’t have to guess anymore<div id="nav_abc123">Is it a navigation bar? Because<nav>The label will tell you the answer directly.

<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>电商示例页</title>
</head>
<body>
    <header><!-- 语义化顶部栏：导航、logo、搜索 --></header>
    <main><!-- 语义化主内容区：爬虫优先关注这里 --></main>
    <aside><!-- 语义化侧边栏：相关推荐、广告等 --></aside>
    <footer><!-- 语义化底部栏：版权、联系方式等 --></footer>
</body>
</html>

Crawler Tips: Prioritize parsing<main>The content there is usually the most valuable data such as product lists and article text.

1.2 CSS3: Determine "how to place and move elements"

Although crawlers don’t care whether it looks good or not, some features of CSS will directly affect the stability of our parsing, such as:

Layout method (Grid/Flex) generates waterfall flow, which may cause the order of elements to be different from the HTML source code.
Responsive Hide will display different DOM structures for mobile and PC.
Dynamic class name switching For example, use.activeMark the selected product, or use custom attributes to store resource links.

/* 响应式隐藏：爬虫需要判断某段内容在目标视口下是否可见 */
.mobile-only {
    display: none;
}
@media (max-width: 768px) {
    .mobile-only {
        display: block;
    }
}

/* 动态类名：标记最新上架或促销商品 */
.product-card.active-promotion {
    border: 2px solid #ff4d4f;
}

Remember something: The page rendering effect you see is calculated by superimposing HTML + CSS, but when the crawler crawls the original HTML, these hiding/showing logic has not yet been executed. Therefore, when analyzing a web page, it is best to open the "Elements" panel of the browser developer tools instead of just looking at "View Source Code".

1.3 ES6+: Implement “interaction, dynamic loading”

The soul driver of modern web pages is JavaScript (especially ES6+ syntax), which is also the biggest challenge facing crawlers. Contents such as pagination of search results, endless scrolling with pull-down refresh, and Shadow DOM encapsulated by Web Components will not appear directly in the initial HTML source code, but will be dynamically generated in the browser through JS.

// 典型的 AJAX 加载场景：原始 HTML 里没有评论，爬虫直接抓肯定一无所获
async function loadReviews(productId) {
    const res = await fetch(`https://api.example.com/reviews/${productId}?page=2`);
    const data = await res.json();
    // 只有 JS 执行后，评论才会被插入到 DOM 中
    document.querySelector('.review-list').innerHTML = data.map(r => `<p>${r.text}</p>`).join('');
}

Therefore, if your goal is this kind of dynamic content, the source code obtained through requests alone is not enough. You must use a browser tool that can execute JavaScript (more on this later).

2. Core foundation of modern DOM parsing

Regardless of whether the web page is static or dynamic, the browser will eventually render it into a DOM tree. The core job of a reptile is to "pick fruit from this tree." However, front-end frameworks and new technologies have introduced some concepts that are easy to get into trouble. Let’s clear them up first.

2.1 New DOM concept that is easy to get into trouble

Virtual DOM: lightweight JavaScript objects maintained in memory by frameworks such as React and Vue, which cannot be accessed by crawlers. We only care about the final rendered real DOM.
Shadow DOM: An "isolated DOM tree" created by Web Components, and its internal content cannot be seen in the initial HTML. If you want to obtain the data, you must use a special method (such aselement.shadowRoot）。

// 在 Puppeteer / Playwright 里访问 Shadow DOM 的大致写法
const customBtn = document.querySelector('custom-button');
const shadowContent = customBtn?.shadowRoot?.querySelector('.btn-text');

Practical Suggestion: Use the "Inspect Element" function of your browser first, if you see#shadow-root (open)mark, it means you have encountered Shadow DOM. When dealing with dynamic scraping, remember to penetrate it.

2.2 Fast and easy-to-use modern DOM operation API

There is no need to move out a large library every time, a few native APIs can handle most simple scenarios:

// 推荐优先使用这两个，比 getElementById 等更灵活
document.querySelector('.product-card.active-promotion'); // 选中第一个匹配的元素
document.querySelectorAll('.product-card');               // 选中所有匹配的元素

// 一些非常实用的节点关系
element.closest('.category-wrapper');   // 向上找到最近的父容器（处理嵌套结构时特别管用）
element.matches('.sold-out');          // 检查某个元素是否匹配给定的选择器

// 高效的遍历写法：NodeList 是类数组，需要转换成真正的数组才能用 map/filter
Array.from(document.querySelectorAll('.product-card'))
  .filter(card => !card.matches('.sold-out'));

3. Two weapons for accurately positioning elements

The first step of a crawler is always to accurately locate the element you want to capture. There are currently two mainstream solutions: CSS selectors and XPath.

3.1 Quick comparison: Which one is more suitable for you?

Properties	CSS Selectors	XPath
Syntax simplicity, learning cost	⭐⭐⭐⭐⭐	⭐⭐⭐
Function coverage (reverse search, text matching)	⭐⭐	⭐⭐⭐⭐⭐
Execution performance in browser environment	⭐⭐⭐⭐⭐	⭐⭐⭐
Crawler tool support	Almost all supported	All mainstream tools supported

Selection Suggestions: In 90% of scenarios, just use CSS selectors, which is simple and intuitive; when you need to "find elements based on text content" or "find parent elements based on child elements", use XPath to save the day.

3.2 Practical Level 4 CSS Selectors

Several new selectors added to CSS4 can make crawlers even more powerful:

/* :has() —— 选择包含特定子元素的父元素，比如“挑选有评论数的商品卡片” */
.product-card:has(.review-count:not(:empty))

/* :is() —— 简化组合选择器，比如“选择 header 和 main 里的所有 a 标签” */
:is(header, main) a

/* 配合位置和类名，比如“选取前 3 个同时含有 active 类的商品卡片” */
.product-card.active:nth-child(-n+3)

3.3 XPath necessary for rescue

When CSS cannot be solved (for example, if you need to reversely find the card container based on the text content of "limited time special offer"), XPath comes into play:

// XPath 示例：找到包含“限时特惠”文本的 h2 所在的 .product-card 父容器
const xpath = '//div[contains(@class, "product-card")]/h2[contains(text(), "限时特惠")]/..';
const result = document.evaluate(
    xpath,
    document,
    null,
    XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, // 返回静态快照，不用担心 DOM 变化
    null
);

// 遍历结果集
for (let i = 0; i < result.snapshotLength; i++) {
    console.log(result.snapshotItem(i));
}

4. Mainstream tools for processing static/dynamic content

Depending on whether the data you want to capture exists directly in HTML or relies on dynamic rendering with JS, you need to choose different tools.

4.1 Static HTML: Just use a lightweight parsing library

If all target content is in the initial HTML returned by the server, it is most efficient to parse it directly with a lightweight library:

Node.js：cheerio(The syntax is similar to jQuery, and it’s very fast to get started)
Python：BeautifulSoup4(newbie friendly),lxml(Excellent performance, perfect support for XPath)

Here is a small example in Python:

# 用 BeautifulSoup4 提取商品标题和价格
from bs4 import BeautifulSoup
import requests

url = 'https://example.com/electronics'
headers = {'User-Agent': 'MyCrawler/1.0 (+https://example.com)'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, 'lxml')  # 用 lxml 解析器比内置的 html.parser 快很多

# 用 CSS 选择器定位
product_cards = soup.select('.product-card:not(.sold-out)')
for card in product_cards:
    title = card.select_one('.product-title').get_text(strip=True)
    price = card.select_one('.product-price').get_text(strip=True)
    print(f'商品：{title}，价格：{price}')

4.2 Dynamic content: please use browser automation

If the content relies on JS loading (comments, endless scrolling, Shadow DOM, etc.), you must use a Headless (interfaceless) browser to simulate real access:

Preferred Playwright: good cross-browser support, user-friendly API, automatic waiting for elements, greatly reducing instability factors.
Second choice Puppeteer: Focus on Chrome/Edge ecosystem, mature documentation.
Alternative Selenium: compatible with all browsers, but has slightly weaker performance and relatively cumbersome configuration.

An example of Playwright grabbing dynamic comments:

// 使用 Playwright 抓取需要 JS 加载的评论列表
import { chromium } from 'playwright';

(async () => {
    const browser = await chromium.launch();
    const page = await browser.newPage();
    
    // 拦截图片、字体等无关资源，提升抓取速度
    await page.route('**/*.{png,jpg,jpeg,gif,woff,woff2}', route => route.abort());
    
    await page.goto('https://example.com/product/123');
    
    // 等到评论列表出现，比写死 sleep 稳定百倍
    await page.waitForSelector('.review-list li');
    
    // 提取数据
    const reviews = await page.$$eval('.review-list li', items =>
        items.map(item => ({
            author: item.querySelector('.review-author')?.textContent.trim(),
            text: item.querySelector('.review-text')?.textContent.trim(),
            rating: item.querySelector('.review-rating')?.getAttribute('data-rating')
        }))
    );
    
    console.log(reviews);
    await browser.close();
})();

5. Anti-crawler identification and basic response (only used in legal scenarios)

Modern websites pay more and more attention to anti-crawling, but as long as we abide by robots.txt, control the frequency of requests, and reasonably simulate human behavior, most of the time we can still obtain data compliantly.

Common anti-crawling techniques	Basic countermeasures
User-Agent detection	Use a legitimate User-Agent pool, such as`fake-useragent`library, or use a real UA that comes with browser automation tools.
IP restrictions	Use a stable proxy IP service, with at least 1-2 seconds between each IP request, to control concurrency.
Simple sliding CAPTCHA	Use drag-and-drop APIs from tools like Playwright to simulate human behavior (but advanced CAPTCHAs are difficult to solve purely automatically these days).
Request interval detection	Add random delays between requests, such as Python's`time.sleep(random.uniform(1, 3))`。

Important reminder: To bypass anti-crawler methods, you must comply with laws, regulations and website terms, and must not be used for illegal intrusion or malicious collection.

6. Legal and compliant crawler best practices

Strict compliancerobots.txt: First check the website root directory/robots.txt, clarify which paths are allowed to be crawled.
Control request frequency: Don’t put pressure on the target server. It is recommended to have no more than 1-2 requests per second.
Set up friendly User-Agent: For exampleMyCrawler/1.0 (+https://your-site.com/crawler-info), allowing the webmaster to contact you.
Cache captured data: Avoid repeatedly requesting the same page, which not only saves resources but also reduces the risk of banning.
Respect copyright and privacy regulations: Do not capture sensitive data such as personal privacy and business secrets; do not use captured data for commercial purposes without authorization.

7. Recommended learning resources

MDN Web Docs: The most authoritative web technical documentation, a must-read for in-depth understanding of HTML, CSS, and DOM.
Playwright official documentation: https://playwright.dev/
BeautifulSoup4 Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
"Python3 Web Crawler Development Practice": A practical tutorial suitable for beginners of Python crawlers.

After mastering these basics, you can write your first reliable crawler. We will continue to delve into advanced topics such as API reverse engineering, JavaScript reverse engineering, distributed crawlers, etc., so stay tuned.

Web page basics#

#1. Modern three-tier structure of web pages

#1.1 HTML5: Defining "Meaningful Content Framework"

#1.2 CSS3: Determine "how to place and move elements"

#1.3 ES6+: Implement “interaction, dynamic loading”

#2. Core foundation of modern DOM parsing

#2.1 New DOM concept that is easy to get into trouble

#2.2 Fast and easy-to-use modern DOM operation API

#3. Two weapons for accurately positioning elements

#3.1 Quick comparison: Which one is more suitable for you?

#3.2 Practical Level 4 CSS Selectors

#3.3 XPath necessary for rescue

#4. Mainstream tools for processing static/dynamic content

#4.1 Static HTML: Just use a lightweight parsing library

#4.2 Dynamic content: please use browser automation

#5. Anti-crawler identification and basic response (only used in legal scenarios)

#6. Legal and compliant crawler best practices

#7. Recommended learning resources