Docker容器化爬虫 - 云原生爬虫部署与管理详解

📂 所属阶段:第六阶段 — 运维与监控(工程化篇)
🔗 相关章节:Scrapyd与ScrapydWeb · 抓取监控看板 · Scrapy-Redis分布式架构

目录

Docker容器化概述

Docker容器化为Scrapy爬虫提供了标准化的运行环境,解决了"在我机器上能运行"的问题,实现了环境一致性、可移植性和可扩展性。

容器化的优势

1. 环境一致性

  • 开发、测试、生产环境统一
  • 依赖包版本锁定
  • 系统库隔离

2. 部署便捷性

  • 一键部署
  • 快速启动/停止
  • 版本管理

3. 资源隔离

  • CPU/内存限制
  • 网络隔离
  • 文件系统隔离

4. 可扩展性

  • 水平扩展
  • 负载均衡
  • 自动伸缩

容器化架构模式

1. 单体架构

  • 单个容器运行完整爬虫
  • 适合小型项目
  • 部署简单

2. 微服务架构

  • 多个容器协同工作
  • 爬虫、存储、监控分离
  • 适合大型项目

3. 混合架构

  • 结合单体和微服务
  • 根据业务需求灵活组合

Dockerfile最佳实践

基础Dockerfile

# Dockerfile - Scrapy爬虫基础镜像
# 使用官方Python基础镜像
FROM python:3.9-slim

# 设置环境变量
ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PYTHONPATH=/app \
    SCRAPY_SETTINGS_MODULE=myproject.settings

# 设置工作目录
WORKDIR /app

# 安装系统依赖
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    libxml2-dev \
    libxslt1-dev \
    libffi-dev \
    libssl-dev \
    libpng-dev \
    libjpeg-dev \
    zlib1g-dev \
    wget \
    curl \
    git \
    && rm -rf /var/lib/apt/lists/*

# 复制依赖文件
COPY requirements.txt .

# 安装Python依赖
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# 创建非root用户
RUN groupadd -r appuser && useradd -r -g appuser appuser
RUN chown -R appuser:appuser /app
USER appuser

# 复制应用代码
COPY --chown=appuser:appuser . .

# 创建必要的目录
RUN mkdir -p /app/logs /app/data

# 暴露端口(如果需要)
EXPOSE 8080

# 健康检查
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:8080')" || exit 1

# 启动命令
CMD ["python", "run_spider.py"]

优化版Dockerfile

# Dockerfile.optimized - 优化版镜像
FROM python:3.9-slim

# 设置构建参数
ARG BUILD_DATE
ARG VCS_REF
ARG VERSION="1.0.0"

# 设置环境变量
ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PYTHONPATH=/app \
    SCRAPY_SETTINGS_MODULE=myproject.settings

# 设置工作目录
WORKDIR /app

# 安装系统依赖(合并为一条命令减少层数)
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    g++ \
    libxml2-dev \
    libxslt1-dev \
    libffi-dev \
    libssl-dev \
    libpng-dev \
    libjpeg-dev \
    zlib1g-dev \
    wget \
    curl \
    git \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

# 复制requirements文件并安装Python依赖
COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt && \
    pip cache purge

# 创建非root用户
RUN groupadd -r appuser && useradd -r -g appuser appuser -u 1000
RUN chown -R appuser:appuser /app
USER appuser

# 复制应用代码
COPY --chown=appuser:appuser . .

# 设置文件权限
RUN chmod +x run_spider.py entrypoint.sh

# 创建必要的目录
RUN mkdir -p /app/logs /app/data /app/cache

# 标签
LABEL org.label-schema.build-date=$BUILD_DATE \
      org.label-schema.vcs-ref=$VCS_REF \
      org.label-schema.version=$VERSION \
      org.label-schema.schema-version="1.0"

# 暴露端口
EXPOSE 8080

# 健康检查
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:8080')" || exit 1

# 启动脚本
ENTRYPOINT ["/app/entrypoint.sh"]

专用Scrapy镜像

# Dockerfile.scrapy - 专用Scrapy镜像
FROM python:3.9-slim

# 安装系统依赖
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    g++ \
    libxml2-dev \
    libxslt1-dev \
    libffi-dev \
    libssl-dev \
    libpng-dev \
    libjpeg-dev \
    zlib1g-dev \
    libgumbo-dev \
    libevent-dev \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

# 安装Python依赖
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir \
    scrapy \
    twisted[tls,http2] \
    cryptography \
    pyopenssl \
    service_identity \
    cssselect \
    lxml \
    && pip cache purge

# 创建工作目录
WORKDIR /app

# 创建非root用户
RUN groupadd -r scrapy && useradd -r -g scrapy scrapy -u 1000
USER scrapy

# 暴露端口
EXPOSE 8080

# 启动命令
CMD ["scrapy", "--version"]

多阶段构建优化

多阶段构建示例

# Dockerfile.multi-stage - 多阶段构建
# 构建阶段
FROM python:3.9-slim AS builder

# 安装构建依赖
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    g++ \
    libxml2-dev \
    libxslt1-dev \
    libffi-dev \
    libssl-dev \
    && rm -rf /var/lib/apt/lists/*

# 设置工作目录
WORKDIR /app

# 复制依赖文件并安装
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

# 运行时阶段
FROM python:3.9-slim

# 安装运行时依赖
RUN apt-get update && apt-get install -y --no-install-recommends \
    libxml2 \
    libxslt1.1 \
    libffi7 \
    libssl1.1 \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

# 复制Python包
COPY --from=builder /root/.local /root/.local

# 设置环境变量
ENV PATH=/root/.local/bin:$PATH \
    PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1

# 创建工作目录
WORKDIR /app

# 创建非root用户
RUN groupadd -r appuser && useradd -r -g appuser appuser -u 1000
USER appuser

# 复制应用代码
COPY --chown=appuser:appuser . .

# 暴露端口
EXPOSE 8080

# 启动命令
CMD ["python", "run_spider.py"]

针对不同环境的构建

# Dockerfile.env-specific - 环境特定构建
ARG TARGET_ENV=production
FROM python:3.9-slim AS base

# 基础系统依赖
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    g++ \
    libxml2-dev \
    libxslt1-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1

# 开发环境
FROM base AS development
RUN apt-get update && apt-get install -y --no-install-recommends \
    vim \
    curl \
    && rm -rf /var/lib/apt/lists/*
COPY requirements-dev.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# 生产环境
FROM base AS production
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# 最终镜像
FROM ${TARGET_ENV}
COPY . .
USER 1000
EXPOSE 8080
CMD ["python", "run_spider.py"]

Docker Compose编排

基础Compose配置

# docker-compose.yml - 基础编排配置
version: '3.8'

services:
  # Redis缓存
  redis:
    image: redis:7-alpine
    container_name: scrapy-redis
    restart: unless-stopped
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    command: redis-server --appendonly yes
    networks:
      - scrapy-network

  # MongoDB存储
  mongodb:
    image: mongo:6
    container_name: scrapy-mongodb
    restart: unless-stopped
    ports:
      - "27017:27017"
    environment:
      MONGO_INITDB_ROOT_USERNAME: scrapy
      MONGO_INITDB_ROOT_PASSWORD: scrapy123
    volumes:
      - mongodb_data:/data/db
    networks:
      - scrapy-network

  # Scrapy爬虫
  spider:
    build: .
    container_name: scrapy-spider
    restart: unless-stopped
    depends_on:
      - redis
      - mongodb
    environment:
      - REDIS_URL=redis://redis:6379
      - MONGODB_URI=mongodb://scrapy:scrapy123@mongodb:27017
      - SCRAPY_SETTINGS_MODULE=myproject.settings
    volumes:
      - ./logs:/app/logs
      - ./data:/app/data
    networks:
      - scrapy-network
    deploy:
      replicas: 2  # 启动2个爬虫实例

  # 监控面板
  grafana:
    image: grafana/grafana-enterprise
    container_name: scrapy-grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    networks:
      - scrapy-network
    depends_on:
      - spider

volumes:
  redis_data:
  mongodb_data:
  grafana_data:

networks:
  scrapy-network:
    driver: bridge

高级Compose配置

# docker-compose.advanced.yml - 高级编排配置
version: '3.8'

services:
  # 负载均衡器
  nginx:
    image: nginx:alpine
    container_name: scrapy-nginx
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
      - ./ssl:/etc/nginx/ssl
    networks:
      - scrapy-network
    depends_on:
      - spider-cluster

  # 爬虫集群
  spider-cluster:
    build: .
    restart: unless-stopped
    depends_on:
      - redis
      - postgres
    environment:
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=postgresql://scrapy:scrapy123@postgres:5432/scrapy_db
      - LOG_LEVEL=INFO
      - CONCURRENT_REQUESTS=16
    volumes:
      - spider_logs:/app/logs
      - spider_data:/app/data
    networks:
      - scrapy-network
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: '0.5'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 256M

  # Redis集群
  redis:
    image: redis:7-alpine
    restart: unless-stopped
    command: redis-server --appendonly yes --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendfilename appendonly.aof --appendfsync always
    volumes:
      - redis_cluster_data:/data
    networks:
      - scrapy-network
    deploy:
      replicas: 3

  # PostgreSQL数据库
  postgres:
    image: postgres:15
    restart: unless-stopped
    environment:
      POSTGRES_DB: scrapy_db
      POSTGRES_USER: scrapy
      POSTGRES_PASSWORD: scrapy123
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    networks:
      - scrapy-network
    deploy:
      resources:
        limits:
          memory: 1G
        reservations:
          memory: 512M

  # 消息队列
  rabbitmq:
    image: rabbitmq:3-management
    restart: unless-stopped
    environment:
      RABBITMQ_DEFAULT_USER: scrapy
      RABBITMQ_DEFAULT_PASS: scrapy123
    ports:
      - "15672:15672"  # 管理界面
      - "5672:5672"    # AMQP端口
    volumes:
      - rabbitmq_data:/var/lib/rabbitmq
    networks:
      - scrapy-network

  # 监控系统
  prometheus:
    image: prom/prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"
    networks:
      - scrapy-network

  grafana:
    image: grafana/grafana-enterprise
    restart: unless-stopped
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - "3000:3000"
    networks:
      - scrapy-network
    depends_on:
      - prometheus

volumes:
  spider_logs:
  spider_data:
  redis_cluster_data:
  postgres_data:
  rabbitmq_data:
  prometheus_data:
  grafana_data:

networks:
  scrapy-network:
    driver: overlay
    attachable: true

环境特定配置

# docker-compose.override.yml - 开发环境覆盖配置
version: '3.8'

services:
  spider:
    build:
      context: .
      target: development
    volumes:
      - .:/app  # 代码热重载
      - /app/__pycache__  # 排除缓存
    environment:
      - LOG_LEVEL=DEBUG
      - DEBUG=true
    ports:
      - "8080:8080"  # 暴露端口便于调试
    command: ["python", "-m", "pdb", "run_spider.py"]

# docker-compose.prod.yml - 生产环境配置
version: '3.8'

services:
  spider:
    image: mycompany/scrapy-spider:latest
    environment:
      - LOG_LEVEL=WARNING
      - DEBUG=false
    security_opt:
      - no-new-privileges:true
    read_only: true
    tmpfs:
      - /tmp
      - /var/run
    ulimits:
      nproc: 65535
      nofile:
        soft: 20000
        hard: 40000

生产环境部署

部署脚本

#!/bin/bash
# deploy.sh - 生产环境部署脚本

set -e  # 遇到错误立即退出

# 配置变量
IMAGE_NAME="scrapy-spider"
CONTAINER_NAME="scrapy-production"
REGISTRY="myregistry.com"
TAG="latest"
ENV_FILE="./.env.production"

echo "🚀 Starting production deployment..."

# 检查Docker是否运行
if ! docker info >/dev/null 2>&1; then
    echo "❌ Docker is not running. Please start Docker first."
    exit 1
fi

# 登录私有仓库
if [ -n "$DOCKER_REGISTRY" ]; then
    echo "🔐 Logging into Docker registry..."
    echo "$DOCKER_PASSWORD" | docker login -u "$DOCKER_USERNAME" --password-stdin "$DOCKER_REGISTRY"
fi

# 构建镜像
echo "🔨 Building Docker image..."
docker build -t "$IMAGE_NAME:$TAG" .

# 标记镜像
echo "🏷️ Tagging image..."
docker tag "$IMAGE_NAME:$TAG" "$REGISTRY/$IMAGE_NAME:$TAG"

# 推送镜像
echo "📤 Pushing image to registry..."
docker push "$REGISTRY/$IMAGE_NAME:$TAG"

# 停止现有容器
echo "🛑 Stopping existing containers..."
docker-compose -f docker-compose.prod.yml down --remove-orphans || true

# 拉取最新镜像
echo "📥 Pulling latest image..."
docker-compose -f docker-compose.prod.yml pull

# 启动服务
echo "▶️ Starting services..."
docker-compose -f docker-compose.prod.yml up -d

# 等待服务启动
echo "⏳ Waiting for services to start..."
sleep 10

# 验证服务状态
echo "🔍 Verifying service status..."
if docker-compose -f docker-compose.prod.yml ps | grep -q "Up"; then
    echo "✅ Deployment successful!"
else
    echo "❌ Deployment failed!"
    docker-compose -f docker-compose.prod.yml logs
    exit 1
fi

# 清理旧镜像
echo "🧹 Cleaning up old images..."
docker image prune -f

echo "🎉 Production deployment completed successfully!"

Kubernetes部署配置

# k8s-scrapy.yaml - Kubernetes部署配置
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scrapy-spider
  labels:
    app: scrapy-spider
spec:
  replicas: 3
  selector:
    matchLabels:
      app: scrapy-spider
  template:
    metadata:
      labels:
        app: scrapy-spider
    spec:
      containers:
      - name: spider
        image: mycompany/scrapy-spider:latest
        ports:
        - containerPort: 8080
        env:
        - name: REDIS_URL
          value: "redis://redis-service:6379"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: database-url
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: scrapy-service
spec:
  selector:
    app: scrapy-spider
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  type: LoadBalancer
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: scrapy-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: scrapy-spider
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

网络与存储配置

网络安全配置

# network-config.yml - 网络安全配置
version: '3.8'

services:
  # 内部网络(仅内部通信)
  internal-network:
    driver: bridge
    internal: true  # 阻止外部访问
  
  # 爬虫服务(仅内部访问)
  spider-internal:
    build: .
    networks:
      - internal
    environment:
      - ALLOWED_DOMAINS=internal-only.com
  
  # API网关(对外暴露)
  api-gateway:
    image: nginx:alpine
    ports:
      - "80:80"
    networks:
      - internal
      - external
    volumes:
      - ./gateway.conf:/etc/nginx/nginx.conf
  
  # 监控服务
  monitoring:
    image: grafana/grafana
    networks:
      - internal
    ports:
      - "3000:3000"  # 仅在特定IP开放

networks:
  internal:
    driver: bridge
    internal: true
  external:
    driver: bridge

存储配置

# storage-config.yml - 存储配置
version: '3.8'

services:
  # 持久化存储
  spider-data:
    build: .
    volumes:
      # 本地持久化
      - scrapy-data:/app/data
      # 日志持久化
      - scrapy-logs:/app/logs
      # 配置文件
      - ./config:/app/config:ro
      # SSL证书
      - ./ssl:/app/ssl:ro
    
    # 临时存储
    tmpfs:
      - /tmp:size=100M,mode=1777
      - /app/temp:size=50M,mode=1777

volumes:
  scrapy-data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /host/data/scrapy
  scrapy-logs:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /host/logs/scrapy

安全配置与权限管理

安全基线配置

# Dockerfile.security - 安全基线配置
FROM python:3.9-slim

# 创建专用用户和组
RUN groupadd -r scrapy --gid=1001 && \
    useradd -r -g scrapy --uid=1001 -d /app scrapy

# 安装最小化系统依赖
RUN apt-get update && apt-get install -y --no-install-recommends \
    libxml2 \
    libxslt1.1 \
    libffi7 \
    libssl1.1 \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

# 设置工作目录
WORKDIR /app

# 复制并安装Python依赖
COPY --chown=scrapy:scrapy requirements.txt .
USER scrapy
RUN pip install --user --no-cache-dir -r requirements.txt

# 复制应用代码
COPY --chown=scrapy:scrapy . .

# 设置安全限制
USER root
RUN chown -R scrapy:scrapy /app
USER scrapy

# 只读根文件系统
# 注意:这需要在运行时使用 --read-only 标志
CMD ["python", "run_spider.py"]

运行时安全配置

# security-compose.yml - 运行时安全配置
version: '3.8'

services:
  secure-spider:
    build:
      context: .
      dockerfile: Dockerfile.security
    security_opt:
      - no-new-privileges:true
      - label=type:container_runtime_t
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE
      - SETUID
      - SETGID
    read_only: true
    tmpfs:
      - /tmp
      - /run
      - /var/run
    user: "1001:1001"
    privileged: false
    devices: []
    shm_size: 64m
    ulimits:
      nproc: 1024
      nofile:
        soft: 1024
        hard: 2048
    environment:
      - PYTHONPATH=/app
      - HOME=/app
    volumes:
      - spider-data:/app/data:rw
      - spider-logs:/app/logs:rw
      - /etc/passwd:/etc/passwd:ro
      - /etc/group:/etc/group:ro

volumes:
  spider-data:
  spider-logs:

CI/CD集成

GitHub Actions配置

# .github/workflows/docker.yml - Docker CI/CD配置
name: Docker Build and Push

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.9'
    
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        pip install pytest pytest-cov
    
    - name: Run tests
      run: |
        pytest tests/ -v --cov=scrapy_project
    
    - name: Run linting
      run: |
        pip install flake8
        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

  build-and-push:
    needs: test
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    
    steps:
    - name: Checkout repository
      uses: actions/checkout@v3
    
    - name: Log in to Container Registry
      uses: docker/login-action@v2
      with:
        registry: ${{ env.REGISTRY }}
        username: ${{ github.actor }}
        password: ${{ secrets.GITHUB_TOKEN }}
    
    - name: Extract metadata
      id: meta
      uses: docker/metadata-action@v4
      with:
        images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
        tags: |
          type=ref,event=branch
          type=ref,event=pr
          type=sha,prefix={{branch}}-
          type=raw,value=latest,enable={{is_default_branch}}
    
    - name: Build and push Docker image
      uses: docker/build-push-action@v4
      with:
        context: .
        platforms: linux/amd64,linux/arm64
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        labels: ${{ steps.meta.outputs.labels }}
        cache-from: type=gha
        cache-to: type=gha,mode=max

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    
    steps:
    - name: Deploy to production
      run: |
        echo "Deploying to production..."
        # Add your deployment commands here

Jenkins Pipeline配置

// Jenkinsfile - Jenkins流水线配置
pipeline {
    agent any
    
    environment {
        DOCKER_IMAGE = 'scrapy-spider'
        DOCKER_REGISTRY = 'myregistry.com'
        KUBE_NAMESPACE = 'scrapy-prod'
    }
    
    stages {
        stage('Checkout') {
            steps {
                checkout scm
            }
        }
        
        stage('Test') {
            steps {
                script {
                    sh '''
                        python -m pip install --upgrade pip
                        pip install -r requirements.txt
                        pip install pytest
                        pytest tests/ -v
                    '''
                }
            }
        }
        
        stage('Security Scan') {
            steps {
                script {
                    sh '''
                        # Run security scans
                        pip install bandit safety
                        bandit -r . -f json -o bandit-report.json
                        safety check -o json > safety-report.json
                    '''
                }
            }
            post {
                always {
                    publishHTML([
                        allowMissing: false,
                        alwaysLinkToLastBuild: true,
                        keepAll: true,
                        reportDir: '.',
                        reportFiles: 'bandit-report.json,safety-report.json',
                        reportName: 'Security Reports'
                    ])
                }
            }
        }
        
        stage('Build Docker Image') {
            steps {
                script {
                    def image = docker.build("${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${env.BUILD_NUMBER}")
                    
                    sh """
                        docker tag ${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${env.BUILD_NUMBER} ${DOCKER_REGISTRY}/${DOCKER_IMAGE}:latest
                    """
                }
            }
        }
        
        stage('Push to Registry') {
            steps {
                script {
                    docker.withRegistry("https://${DOCKER_REGISTRY}", 'docker-registry-credentials') {
                        sh """
                            docker push ${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${env.BUILD_NUMBER}
                            docker push ${DOCKER_REGISTRY}/${DOCKER_IMAGE}:latest
                        """
                    }
                }
            }
        }
        
        stage('Deploy to Kubernetes') {
            when {
                branch 'main'
            }
            steps {
                script {
                    sh """
                        kubectl set image deployment/scrapy-spider spider=${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${env.BUILD_NUMBER} -n ${KUBE_NAMESPACE}
                        kubectl rollout status deployment/scrapy-spider -n ${KUBE_NAMESPACE}
                    """
                }
            }
        }
    }
    
    post {
        always {
            cleanWs()
        }
        success {
            echo '✅ Pipeline completed successfully!'
        }
        failure {
            echo '❌ Pipeline failed!'
        }
    }
}

性能优化与监控

性能监控配置

# monitoring.py - 性能监控脚本
import docker
import psutil
import time
from prometheus_client import start_http_server, Gauge, Counter, Histogram
import logging

# Prometheus指标定义
CONTAINER_CPU_USAGE = Gauge('container_cpu_usage_percent', 'Container CPU usage', ['container_name'])
CONTAINER_MEMORY_USAGE = Gauge('container_memory_usage_bytes', 'Container memory usage', ['container_name'])
CONTAINER_NETWORK_RX = Counter('container_network_receive_bytes_total', 'Container network receive bytes', ['container_name'])
CONTAINER_NETWORK_TX = Counter('container_network_transmit_bytes_total', 'Container network transmit bytes', ['container_name'])
SPIDER_REQUEST_DURATION = Histogram('spider_request_duration_seconds', 'Spider request duration')

class DockerMonitor:
    """
    Docker容器监控器
    """
    
    def __init__(self, interval=10):
        self.client = docker.from_env()
        self.interval = interval
        self.logger = logging.getLogger(__name__)
    
    def collect_metrics(self):
        """
        收集容器指标
        """
        try:
            containers = self.client.containers.list()
            
            for container in containers:
                try:
                    # 获取容器统计信息
                    stats = container.stats(stream=False)
                    
                    # CPU使用率
                    cpu_delta = stats['cpu_stats']['cpu_usage']['total_usage'] - \
                               stats['precpu_stats']['cpu_usage']['total_usage']
                    system_delta = stats['cpu_stats']['system_cpu_usage'] - \
                                  stats['precpu_stats']['system_cpu_usage']
                    
                    if system_delta > 0:
                        cpu_percent = (cpu_delta / system_delta) * len(stats['cpu_stats']['cpu_usage']['percpu_usage']) * 100
                        CONTAINER_CPU_USAGE.labels(container_name=container.name).set(cpu_percent)
                    
                    # 内存使用
                    mem_usage = stats['memory_stats']['usage']
                    CONTAINER_MEMORY_USAGE.labels(container_name=container.name).set(mem_usage)
                    
                    # 网络流量
                    if 'networks' in stats:
                        for interface, net_stats in stats['networks'].items():
                            CONTAINER_NETWORK_RX.labels(container_name=container.name).inc(net_stats['rx_bytes'])
                            CONTAINER_NETWORK_TX.labels(container_name=container.name).inc(net_stats['tx_bytes'])
                
                except Exception as e:
                    self.logger.error(f"Error collecting metrics for {container.name}: {e}")
        
        except Exception as e:
            self.logger.error(f"Error collecting container metrics: {e}")
    
    def start_monitoring(self):
        """
        开始监控
        """
        self.logger.info("Starting Docker monitoring...")
        
        while True:
            try:
                self.collect_metrics()
                time.sleep(self.interval)
            except KeyboardInterrupt:
                self.logger.info("Monitoring stopped.")
                break
            except Exception as e:
                self.logger.error(f"Monitoring error: {e}")
                time.sleep(self.interval)

if __name__ == "__main__":
    # 启动Prometheus指标服务器
    start_http_server(8000)
    
    # 开始监控
    monitor = DockerMonitor(interval=5)
    monitor.start_monitoring()

资源优化配置

# optimization-compose.yml - 资源优化配置
version: '3.8'

services:
  optimized-spider:
    build: .
    # 资源限制
    deploy:
      resources:
        limits:
          cpus: '1.0'
          memory: 1024M
        reservations:
          cpus: '0.5'
          memory: 512M
    
    # 重启策略
    restart: unless-stopped
    
    # 健康检查
    healthcheck:
      test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8080/health')"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    
    # 环境变量优化
    environment:
      # Python优化
      - PYTHONUNBUFFERED=1
      - PYTHONDONTWRITEBYTECODE=1
      # Scrapy优化
      - CONCURRENT_REQUESTS=16
      - DOWNLOAD_DELAY=1
      - RANDOMIZE_DOWNLOAD_DELAY=0.5
      # 内存优化
      - PYTHON_GC_THRESHOLD=700,10,10
      # 日志优化
      - LOG_LEVEL=INFO
    
    # 存储优化
    volumes:
      # 使用高性能存储
      - type: volume
        source: spider-data
        target: /app/data
        volume:
          nocopy: true
      # 临时文件使用tmpfs
      - type: tmpfs
        target: /tmp
        tmpfs:
          size: 100M
          mode: 1777
    
    # 网络优化
    networks:
      - optimized
    dns:
      - 8.8.8.8
      - 8.8.4.4

networks:
  optimized:
    driver: bridge
    ipam:
      config:
        - subnet: 172.20.0.0/16

volumes:
  spider-data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /fast-ssd/scrapy-data

故障排查与维护

故障诊断脚本

#!/bin/bash
# troubleshoot.sh - Docker容器故障诊断脚本

echo "🔍 Starting Docker container troubleshooting..."

# 检查Docker服务状态
echo "📋 Checking Docker service status..."
if ! systemctl is-active --quiet docker; then
    echo "❌ Docker service is not running"
    sudo systemctl start docker
else
    echo "✅ Docker service is running"
fi

# 检查磁盘空间
echo "💾 Checking disk space..."
df -h | grep -E '(overlay|/dev)' | while read line; do
    usage=$(echo $line | awk '{print $5}' | sed 's/%//')
    if [ $usage -gt 80 ]; then
        echo "⚠️ High disk usage: $line"
    fi
done

# 检查容器状态
echo "🐳 Checking container status..."
docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" | while read line; do
    if [[ $line =~ Exited|Restarting ]]; then
        echo "⚠️ Container issue: $line"
    fi
done

# 检查日志错误
echo "📝 Checking container logs for errors..."
for container in $(docker ps -aq); do
    if [ -n "$(docker logs $container 2>&1 | grep -i error)" ]; then
        echo "⚠️ Errors found in container: $(docker inspect --format='{{.Name}}' $container | sed 's/\///')"
        docker logs $container --tail 50 | grep -i error
    fi
done

# 检查Docker镜像
echo "📦 Checking Docker images..."
if [ $(docker images -q | wc -l) -gt 50 ]; then
    echo "⚠️ Too many Docker images, consider cleaning up"
    echo "Run: docker image prune -a"
fi

# 检查Docker网络
echo "🌐 Checking Docker networks..."
docker network ls | grep -v bridge | grep -v host | grep -v none | while read network; do
    if [ $(docker network inspect $network | jq -r '.[].Containers | length') -eq 0 ]; then
        echo "⚠️ Unused network detected: $network"
    fi
done

# 性能检查
echo "⚡ Checking container performance..."
docker stats --no-stream | while read line; do
    if [[ $line =~ ([0-9]+)% ]]; then
        cpu_usage=${BASH_REMATCH[1]}
        if [ $cpu_usage -gt 80 ]; then
            echo "⚠️ High CPU usage: $line"
        fi
    fi
done

echo "✅ Troubleshooting completed!"

维护脚本

#!/bin/bash
# maintenance.sh - Docker容器维护脚本

# 配置变量
RETENTION_DAYS=30
LOG_RETENTION_DAYS=7

echo "🔧 Starting Docker maintenance tasks..."

# 清理未使用的容器
echo "🗑️ Cleaning up stopped containers..."
docker container prune -f

# 清理未使用的镜像
echo "🗑️ Cleaning up unused images..."
docker image prune -f

# 清理未使用的卷
echo "🗑️ Cleaning up unused volumes..."
docker volume prune -f

# 清理未使用的网络
echo "🗑️ Cleaning up unused networks..."
docker network prune -f

# 清理构建缓存
echo "🗑️ Cleaning up build cache..."
docker builder prune -f

# 清理旧日志
echo "📝 Rotating container logs..."
for container in $(docker ps -q); do
    log_file=$(docker inspect --format='{{.LogPath}}' $container)
    if [ -f "$log_file" ]; then
        size=$(stat -c%s "$log_file")
        if [ $size -gt 104857600 ]; then  # 100MB
            echo "Rotating large log: $log_file"
            mv "$log_file" "${log_file}.old"
            docker restart $container > /dev/null 2>&1
        fi
    fi
done

# 清理旧日志文件
find /var/lib/docker/containers -name "*.log" -mtime +$LOG_RETENTION_DAYS -delete

# 检查磁盘使用
echo "💾 Checking disk usage..."
docker system df

# 更新系统
echo "🔄 Updating Docker system..."
docker system prune -f

echo "✅ Maintenance tasks completed!"

最佳实践总结

Dockerfile最佳实践

  1. 使用官方基础镜像

    FROM python:3.9-slim  # 使用官方精简镜像
  2. 多阶段构建

    FROM python:3.9-slim AS builder
    # 构建依赖
    FROM python:3.9-slim AS runtime
    # 运行时依赖
  3. 非root用户

    RUN useradd -r -s /bin/false appuser
    USER appuser
  4. 层缓存优化

    COPY requirements.txt .
    RUN pip install -r requirements.txt
    COPY . .  # 在依赖安装后复制代码

部署最佳实践

  1. 环境变量管理

    environment:
      - ENVIRONMENT=production
      - LOG_LEVEL=INFO
  2. 健康检查

    HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
        CMD python -c "import requests; requests.get('http://localhost:8080/health')"
  3. 资源限制

    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 512M
  4. 安全配置

    security_opt:
      - no-new-privileges:true
    read_only: true

监控最佳实践

  1. 指标收集

    • CPU/内存使用率
    • 网络I/O
    • 磁盘I/O
    • 应用性能指标
  2. 日志管理

    • 结构化日志
    • 日志轮转
    • 集中收集
  3. 告警设置

    • 资源使用告警
    • 服务可用性告警
    • 性能下降告警

💡 核心要点: Docker容器化是现代爬虫部署的标准方案。通过合理的配置、完善的监控和规范的运维流程,可以实现爬虫的高效、稳定运行。

SEO优化策略

  1. 关键词优化: 在标题、内容中合理布局"Docker", "容器化", "Scrapy", "云原生", "微服务", "DevOps", "CI/CD"等关键词
  2. 内容结构: 使用清晰的标题层级(H1-H6),便于搜索引擎理解内容结构
  3. 内部链接: 建立与其他相关教程的内部链接,提升页面权重
  4. 元数据优化: 在页面头部包含描述性的标题、描述和标签

🔗 相关教程推荐

🏷️ 标签云: Docker 容器化 Scrapy 云原生 微服务 DevOps CI/CD Docker Compose Kubernetes 部署管理