Web page basics
#Basics of modern web crawlers: web page structure and parsing technology
Have you ever been curious: What core technology is hidden behind these functions, including the summary of hot-sale lists on Douyin, the price monitoring of e-commerce platforms, and the batch citation statistics of academic papers? That’s right, it’s a modern web crawler. The first step in writing a good crawler is to thoroughly understand the "webpage cake" you want to "gnaw" - it consists of a three-layer structure, and you must learn to accurately dig out the data you need from it, whether it is static content or a dynamic page full of JavaScript.
1. Modern three-tier structure of web pages
Today's web pages are no longer in the era of "just pile a bunch of static HTML". They're more like an application that works closely together with the HTML5 structural skeleton, CSS3 skin styles, and ES6+ dynamic muscles. Understand these three layers, and you will know where to look for data, and why sometimes the data is "not captured" even though it is on the page.
1.1 HTML5: Defining "Meaningful Content Framework"
HTML5 doesn't just tell the browser "put an element here", it also introduces a set of semantic tags. For reptiles, this is great news - we don’t have to guess anymore<div id="nav_abc123">Is it a navigation bar? Because<nav>The label will tell you the answer directly.
Crawler Tips: Prioritize parsing<main>The content there is usually the most valuable data such as product lists and article text.
1.2 CSS3: Determine "how to place and move elements"
Although crawlers don’t care whether it looks good or not, some features of CSS will directly affect the stability of our parsing, such as:
- Layout method (Grid/Flex) generates waterfall flow, which may cause the order of elements to be different from the HTML source code.
- Responsive Hide will display different DOM structures for mobile and PC.
- Dynamic class name switching For example, use
.activeMark the selected product, or use custom attributes to store resource links.
Remember something: The page rendering effect you see is calculated by superimposing HTML + CSS, but when the crawler crawls the original HTML, these hiding/showing logic has not yet been executed. Therefore, when analyzing a web page, it is best to open the "Elements" panel of the browser developer tools instead of just looking at "View Source Code".
1.3 ES6+: Implement “interaction, dynamic loading”
The soul driver of modern web pages is JavaScript (especially ES6+ syntax), which is also the biggest challenge facing crawlers. Contents such as pagination of search results, endless scrolling with pull-down refresh, and Shadow DOM encapsulated by Web Components will not appear directly in the initial HTML source code, but will be dynamically generated in the browser through JS.
Therefore, if your goal is this kind of dynamic content, the source code obtained through requests alone is not enough. You must use a browser tool that can execute JavaScript (more on this later).
2. Core foundation of modern DOM parsing
Regardless of whether the web page is static or dynamic, the browser will eventually render it into a DOM tree. The core job of a reptile is to "pick fruit from this tree." However, front-end frameworks and new technologies have introduced some concepts that are easy to get into trouble. Let’s clear them up first.
2.1 New DOM concept that is easy to get into trouble
- Virtual DOM: lightweight JavaScript objects maintained in memory by frameworks such as React and Vue, which cannot be accessed by crawlers. We only care about the final rendered real DOM.
- Shadow DOM: An "isolated DOM tree" created by Web Components, and its internal content cannot be seen in the initial HTML. If you want to obtain the data, you must use a special method (such as
element.shadowRoot)。
Practical Suggestion: Use the "Inspect Element" function of your browser first, if you see#shadow-root (open)mark, it means you have encountered Shadow DOM. When dealing with dynamic scraping, remember to penetrate it.
2.2 Fast and easy-to-use modern DOM operation API
There is no need to move out a large library every time, a few native APIs can handle most simple scenarios:
3. Two weapons for accurately positioning elements
The first step of a crawler is always to accurately locate the element you want to capture. There are currently two mainstream solutions: CSS selectors and XPath.
3.1 Quick comparison: Which one is more suitable for you?
Selection Suggestions: In 90% of scenarios, just use CSS selectors, which is simple and intuitive; when you need to "find elements based on text content" or "find parent elements based on child elements", use XPath to save the day.
3.2 Practical Level 4 CSS Selectors
Several new selectors added to CSS4 can make crawlers even more powerful:
3.3 XPath necessary for rescue
When CSS cannot be solved (for example, if you need to reversely find the card container based on the text content of "limited time special offer"), XPath comes into play:
4. Mainstream tools for processing static/dynamic content
Depending on whether the data you want to capture exists directly in HTML or relies on dynamic rendering with JS, you need to choose different tools.
4.1 Static HTML: Just use a lightweight parsing library
If all target content is in the initial HTML returned by the server, it is most efficient to parse it directly with a lightweight library:
- Node.js:
cheerio(The syntax is similar to jQuery, and it’s very fast to get started) - Python:
BeautifulSoup4(newbie friendly),lxml(Excellent performance, perfect support for XPath)
Here is a small example in Python:
4.2 Dynamic content: please use browser automation
If the content relies on JS loading (comments, endless scrolling, Shadow DOM, etc.), you must use a Headless (interfaceless) browser to simulate real access:
- Preferred Playwright: good cross-browser support, user-friendly API, automatic waiting for elements, greatly reducing instability factors.
- Second choice Puppeteer: Focus on Chrome/Edge ecosystem, mature documentation.
- Alternative Selenium: compatible with all browsers, but has slightly weaker performance and relatively cumbersome configuration.
An example of Playwright grabbing dynamic comments:
5. Anti-crawler identification and basic response (only used in legal scenarios)
Modern websites pay more and more attention to anti-crawling, but as long as we abide by robots.txt, control the frequency of requests, and reasonably simulate human behavior, most of the time we can still obtain data compliantly.
Important reminder: To bypass anti-crawler methods, you must comply with laws, regulations and website terms, and must not be used for illegal intrusion or malicious collection.
6. Legal and compliant crawler best practices
- Strict compliance
robots.txt: First check the website root directory/robots.txt, clarify which paths are allowed to be crawled. - Control request frequency: Don’t put pressure on the target server. It is recommended to have no more than 1-2 requests per second.
- Set up friendly User-Agent: For example
MyCrawler/1.0 (+https://your-site.com/crawler-info), allowing the webmaster to contact you. - Cache captured data: Avoid repeatedly requesting the same page, which not only saves resources but also reduces the risk of banning.
- Respect copyright and privacy regulations: Do not capture sensitive data such as personal privacy and business secrets; do not use captured data for commercial purposes without authorization.
7. Recommended learning resources
- MDN Web Docs: The most authoritative web technical documentation, a must-read for in-depth understanding of HTML, CSS, and DOM.
- Playwright official documentation: https://playwright.dev/
- BeautifulSoup4 Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- "Python3 Web Crawler Development Practice": A practical tutorial suitable for beginners of Python crawlers.
After mastering these basics, you can write your first reliable crawler. We will continue to delve into advanced topics such as API reverse engineering, JavaScript reverse engineering, distributed crawlers, etc., so stay tuned.

