Scrapy Complete Guide to ImagesPipeline and FilesPipeline - Detailed explanation of multimedia resource downloading and processing technology

📂 Stage: Stage 2 - Data Flow (Data Processing) 🔗 Related chapters: Pipeline管道实战 · 数据清洗与校验

Table of contents


Students who play Scrapy must have encountered such a scenario: the crawler has captured hundreds of large pictures of products, and if they want to download the pictures locally, they have to generate thumbnails of different sizes; or they need to save a bunch of PDF and video files with automatic deduplication and reasonable file names for easy retrieval. If you write the download logic yourself every time, you will not only reinvent the wheel, but you will also easily get into trouble.

Scrapy built-inImagesPipelineandFilesPipelineIt is born for this kind of multimedia download scenario. They can automatically complete a series of tedious tasks such as downloading, deduplication, format verification, and path management, making your crawler code clean and efficient.

This article will take you from the beginning to identify the differences between the two, step by step through basic configuration, practical usage, to advanced customization techniques, and finally cover all common problems. After reading this, you can completely control the download pipeline of any multimedia resource.


ImagesPipeline vs FilesPipeline quick overview

Newbies often confuse these two Pipelines. In fact, their positioning is very clear. Just remember this comparison table:

FeaturesImagesPipelineFilesPipeline
ScopeOnly images (JPG/PNG/GIF, etc.)All file types
Core CapabilitiesAutomatic download + deduplication + size/format/quality processing + thumbnailsAutomatic download + deduplication + file path management
Running OverheadSlightly larger (with additional image verification and format conversion steps)Small
Typical scenariosE-commerce graphics, posters, avatars, material libraries, etc.PDF documents, videos, compressed packages, code files, etc.

In one sentence: ** Use ImagesPipeline to only download images, and use FilesPipeline to process mixed files or non-image resources. ** If you have both images and other files, both can be enabled at the same time and Scrapy will execute them in priority order.

💡 Tip: The bottom layer of ImagesPipeline is inherited from FilesPipeline, which adds additional image processing capabilities of PIL/Pillow, so images can also be downloaded using FilesPipeline, but you will not enjoy benefits such as thumbnails and format conversion.


Basic configuration and simplest practice

Next we run the Pipeline step by step. Whether it is a picture or a file, the configuration idea is exactly the same: Enable → Set parameters → Define Item → Spider output URL.

Step 1: Enable Pipeline

existsettings.pyAdd the target Pipeline toITEM_PIPELINESin the dictionary. The numerical value represents the priority. The smaller the numerical value, the earlier it is executed. If you want both, it is recommended to let ImagesPipeline run first (this is purely customary, not mandatory).

# settings.py
ITEM_PIPELINES = {
    # 只下载图片
    # 'scrapy.pipelines.images.ImagesPipeline': 1,
    # 只下载文件
    # 'scrapy.pipelines.files.FilesPipeline': 1,
    # 图片+文件并存
    'scrapy.pipelines.images.ImagesPipeline': 1,
    'scrapy.pipelines.files.FilesPipeline': 2,
}

Step 2: Configure core parameters

ImagesPipeline common configuration

# settings.py - ImagesPipeline 相关配置
IMAGES_URLS_FIELD = 'image_urls'      # Item 中存放图片 URL 列表的字段名(默认值)
IMAGES_RESULT_FIELD = 'images'        # 下载完成后,图片信息写回的字段名(默认值)
IMAGES_STORE = './static/images'       # 图片保存的本地根目录(支持 S3、FTP 等 URI)
IMAGES_EXPIRES = 90                   # 图片缓存过期天数,过期后会重新下载(默认 90 天)
IMAGES_MIN_WIDTH = 100                # 宽度低于此值(px)的图片会被过滤
IMAGES_MIN_HEIGHT = 100               # 高度低于此值(px)的图片会被过滤
IMAGES_THUMBS = {                     # 自动生成指定尺寸的缩略图
    'small': (100, 100),
    'medium': (300, 300),
}

⚠️ NOTE:IMAGES_MIN_WIDTHandIMAGES_MIN_HEIGHTOnly images that have been downloaded successfully can be filtered. If the image URL itself is invalid, the filtering conditions will not be triggered and additional processing is required.

FilesPipeline common configuration

# settings.py - FilesPipeline 相关配置
FILES_URLS_FIELD = 'file_urls'        # Item 中存放文件 URL 列表的字段名(默认值)
FILES_RESULT_FIELD = 'files'          # 下载完成后,文件信息写回的字段名(默认值)
FILES_STORE = './static/files'         # 文件保存的本地根目录
FILES_EXPIRES = 90                    # 文件缓存过期天数

📌 Two Pipelines*_URLS_FIELDand*_RESULT_FIELDYou can customize the naming in Item and just modify it synchronously in settings.

Step 3: Define Item

Two fields need to be provided in Item: one to store the URL list, and the other to receive the download results. They can be defined in the same Item or separated. Here is an example to demonstrate them uniformly:

# items.py
import scrapy

class MediaItem(scrapy.Item):
    # 通用字段
    id = scrapy.Field()
    title = scrapy.Field()
    # 图片专用
    image_urls = scrapy.Field()   # 列表,每个元素为图片 URL
    images = scrapy.Field()       # 下载完成后,Pipeline 自动填充结果列表
    # 文件专用
    file_urls = scrapy.Field()    # 列表,每个元素为文件 URL
    files = scrapy.Field()        # 下载完成后,Pipeline 自动填充结果列表

Scrapy's convention is: URL field must be an iterable object (usually a list), even if there is only one URL, it must be written as a list.

Step 4: Extract URL from Spider and yield Item

Finally, fill in the captured URL into Item in Spider andyieldOnce out, Pipeline can automatically take over the subsequent download process.

# spiders/media_spider.py
import scrapy
from my_project.items import MediaItem

class ExampleSpider(scrapy.Spider):
    name = 'media_demo'
    start_urls = ['https://example.com/media-page']

    def parse(self, response):
        item = MediaItem()
        item['id'] = '123'
        item['title'] = '示例媒体包'

        # 提取所有产品图片的 src
        item['image_urls'] = response.css('.product-image::attr(src)').getall()

        # 提取所有文件下载链接
        item['file_urls'] = response.css('.download-link::attr(href)').getall()

        yield item

After running the crawler, you will be at./static/images/full/See the original picture below,thumbs/small/andthumbs/medium/You will see the corresponding thumbnail below; the file will appear under./static/files/full/Down. at the same time,item['images']anditem['files']will be populated as a dictionary list containingurlpathchecksumand other information to facilitate subsequent storage.


High-frequency practical customization skills

The default Pipeline can cover about 80% of download requirements, but in the real world you are likely to encounter these demands: ** Want to customize the file name, add anti-anti-crawling request headers, want to uniformly convert all images into WebP format **... At this time, you need to inherit and rewrite the corresponding methods.

Customize the file name and say goodbye to hash garbled characters

The default filename is the SHA1 hash of the URL, which is completely indecipherable to humans. We can do this by rewritingfile_pathMethod, design the file name as 分类/时间戳-ID-原始文件名 structure prevents duplication and facilitates search.

ImagesPipeline Custom file name

# pipelines.py
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from itemadapter import ItemAdapter
from urllib.parse import urlparse
import time
import os

class CustomImagesPipeline(ImagesPipeline):

    def file_path(self, request, response=None, info=None, *, item=None):
        adapter = ItemAdapter(item)
        item_id = adapter.get('id', 'unknown')
        category = adapter.get('category', 'uncategorized')

        # 从 URL 提取原始文件名
        url_path = urlparse(request.url).path
        original_name = os.path.basename(url_path)
        if not original_name:
            original_name = 'image.jpg'

        timestamp = str(int(time.time()))
        return f"{category}/{timestamp}-{item_id}-{original_name}"

    def thumb_path(self, request, thumb_id, response=None, info=None, *, item=None):
        # 基于主图路径,生成缩略图路径,例如 xxx_small.jpg
        base_path = self.file_path(request, item=item)
        name, ext = os.path.splitext(base_path)
        return f"{name}_{thumb_id}{ext}"

🧠 Tips:thumb_pathIt is completely usable without rewriting, but if you want the thumbnail to have the same name as the original image and only add a suffix, you need to handle it as above.

FilesPipeline can be customized in exactly the same way, just change the inherited class toFilesPipelineThat’s it, I won’t go into details here.

Add anti-anti-crawling headers to download requests

Many resource servers will verifyRefererUser-Agent, and even requires cookies to be carried. The Request generated by the default Pipeline is quite "bare" and we must rewrite itget_media_requeststo customize the request.

class CustomImagesPipeline(ImagesPipeline):
    # 前面的 file_path、thumb_path 省略...

    def get_media_requests(self, item, info):
        adapter = ItemAdapter(item)
        # 可从 Item 里传入当前页面的 URL,作为 Referer
        referer = adapter.get('source_url', 'https://example.com')

        for url in adapter.get(self.images_urls_field, []):
            yield scrapy.Request(
                url,
                headers={
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36',
                    'Referer': referer,
                },
                meta={
                    'download_timeout': 60,  # 大文件可以适当加长超时
                }
            )

⚠️ About Cookies: It is usually recommended to use Scrapy’s globalCOOKIES_ENABLED=TrueCooperate with middleware management, but if you have special needs, you can also pass it hereheadersorcookiesParameters are passed in.

Unified image format is WebP (advanced)

The WebP format is 30%~80% smaller than JPG/PNG and has become standard in modern websites. by rewritingconvert_imageMethod, we can uniformly transfer all downloaded images to WebP, and complete the necessary mode conversion and size scaling at the same time.

from PIL import Image

class CustomImagesPipeline(ImagesPipeline):
    # ... 其他方法

    def convert_image(self, image, size=None):
        # 处理透明通道:如果原图为 RGBA / LA / P,直接保持或转为 RGBA(WebP 支持透明)
        if image.mode in ('RGBA', 'LA'):
            pass  # 保持原样
        elif image.mode == 'P':
            image = image.convert('RGBA')
        else:
            # 如果不需要透明,也可以统一转 RGB,然后用白底填充(看需求)
            pass

        # 按需缩放
        if size:
            image = image.resize(size, Image.Resampling.LANCZOS)

        # 注意:这里不主动做格式转换,真正的格式由 file_path 中扩展名决定
        return image

At the same time, infile_pathChange the extension to.webp, Scrapy automatically calls Pillow's WebP encoder when writing a file.

def file_path(self, request, response=None, info=None, *, item=None):
    # ... 构造路径
    base_path = super().file_path(request, response, info, item=item)
    # 把原有扩展名统一替换为 .webp
    path_without_ext, _ = os.path.splitext(base_path)
    return path_without_ext + '.webp'

At this point, a fully automatic image downloading and converting pipeline to WebP is completed.


Frequently Asked Questions and Solutions

❓ Image/file download failed, but there is no log alarm

Cause: The default Pipeline will ignore a single download failure and will only throw it when all URLs in the Item fail.DropItem

Solution:

  • Added when the crawler is running--logfile=scrapy.log, then view the DEBUG level log and searchFile (images) download error
  • can also be rewrittenmedia_failedMethod to proactively log failed URLs and reasons.
def media_failed(self, failure, request, info):
    self.logger.error(f'下载失败: {request.url} - {failure.value}')
    # 可选:将失败 URL 存入数据库或文件供后续重试

❓ Customized file names still appear overwritten

Cause: The file name generation rules are not unique enough, such as multiple ItemidDuplicate, or timestamp/hash not included.

Solution: When splicing the file name, add at least a combination of timestamp + Item unique ID + URL short hash (first 8 digits).

from hashlib import sha1

url_hash = sha1(request.url.encode()).hexdigest()[:8]
file_name = f"{timestamp}-{item_id}-{url_hash}-{original_name}"

❓ The image cache has expired and I don’t want to download it again. I want to keep it permanently.

reason:IMAGES_EXPIRES / FILES_EXPIRESSet too short.

Solution: Set the expiration number to a maximum value, for example36500days (100 years), or no limit at all (set to0means never expires, but the official Scrapy documentation recommends using a larger value to avoid accidental cleanup).

IMAGES_EXPIRES = 36500   # 约 100 年

❓ Memory explosion or process crash when processing large images

Reason: There is no limit on image size or memory, and large images are loaded and scaled at one time.

Solution:

  • existconvert_imageLimit the maximum size, for example, if the width or height exceeds 2048px, it will be scaled proportionally.
  • Actively release intermediate variables (although Python has GC, it actively releases intermediate variables when dealing with large objects)delcan speed up recycling).
MAX_SIZE = 2048

def convert_image(self, image, size=None):
    # 限制最大尺寸
    if max(image.size) > MAX_SIZE:
        image.thumbnail((MAX_SIZE, MAX_SIZE), Image.Resampling.LANCZOS)
    
    # ... 其他处理
    return image

❓ After the thumbnail is generated, the original image still exists in the full directory. If you want to keep only the thumbnail

By default, the Pipeline will retain the original image at the same time. If you want to retain only the thumbnail image (not the original image), you can override it.item_completedMethod, manually delete the original image file after the processing is completed.

import os
from scrapy.pipelines.images import ImagesPipeline

class ThumbOnlyPipeline(ImagesPipeline):
    def item_completed(self, results, item, info):
        # results 是 [(success, image_info), ...] 的列表
        for ok, image_info in results:
            if ok:
                # 删除原图
                original_path = os.path.join(self.store.basedir, image_info['path'])
                if os.path.exists(original_path):
                    os.remove(original_path)
        return item

⚠️ Note: This approach means that you abandon the original image and must re-download it when you need the original image again. Please weigh based on actual business.


Write at the end

ImagesPipelineandFilesPipelineIt is a gift from Scrapy to multimedia crawler developers. Master them well, and you can quickly set up a robust resource download pipeline and focus on data extraction and business logic. Combined with techniques such as custom file names, anti-anti-crawling request headers, and unified formats, it can fully meet the needs of most production environments.

If you encounter other thorny problems in practice, you might as well look back at the official documentation, or print it in the codeinfo.spiderRelated properties, Scrapy's Pipeline is far more powerful and flexible than it seems. I wish you worry-free crawler downloading!

📌 Extended reading: Next, you can learn about Item Pipeline 的更多玩法 and how to 清洗与校验 the downloaded data.