Skip to content

Scraper Workers (Celery)

The Workers are the core execution units responsible for performing the actual web scraping tasks.

graph TD
    A[Receive Job Message] --> B{Parse Message};
    B --> C{Requires Playwright?};
    C -- Yes --> D[Run Playwright Fetch Logic];
    C -- No --> E[Run Requests Fetch Logic];
    D --> F{Fetch Success?};
    E --> F;
    F -- Yes --> G[Compress HTML];
    G --> H[Upload HTML to S3];
    H --> I[Log Success / Report Stats];
    I --> J[Acknowledge Queue Message];
    F -- No (After Retries) --> K[Log Error / Report Stats];
    K --> J;

    style J fill:#cfc,stroke:#333,stroke-width:2px

Role & Purpose

  • Consume job messages from the Message Queue.
  • Interpret the job message to understand the target URL and required actions/parameters.
  • Execute the data fetching logic using either lightweight HTTP clients (requests) or browser automation (Playwright) as specified.
  • Implement politeness delays, proxy rotation, and User-Agent rotation during fetching.
  • Handle fetcher-level errors and retries.
  • Process pagination or dynamic content loading logic.
  • Upon successful fetching, store the raw HTML content into the S3 Bucket.
  • Report task status (success, failure, retry needed) back via Celery and potentially to the Monitoring system.
  • Acknowledge messages from the queue upon definitive completion or failure.

Technology Choice: Celery

  • Rationale: Celery is the standard, feature-rich distributed task queue framework for Python.
  • Benefits:
    • Integrates seamlessly with message brokers (RabbitMQ, Redis, SQS).
    • Handles worker management, task distribution, and concurrency.
    • Provides built-in support for task retries with backoff.
    • Mature and well-documented.

Worker Implementation Details

Task Definition

  • Define Celery tasks corresponding to the scraping jobs (e.g., a single task type that adapts based on message parameters).
  • The task receives the job message content as input.

Fetching Logic Integration

  • The task code imports and uses the chosen fetching libraries (requests with Retry adapter, Playwright).
  • Logic branches based on requires_playwright flag in the job message.

Resource Management

  • Concurrency: Configure the number of concurrent tasks each Celery worker process can handle (--concurrency flag). This needs careful tuning, especially when using Playwright, to avoid overwhelming CPU/RAM. Consider using gevent or eventlet execution pools for I/O-bound requests tasks, but stick to prefork (default) or solo for CPU/RAM-intensive Playwright tasks.
  • Playwright Cleanup: Ensure Playwright browser instances and contexts are closed properly at the end of each task (or reused carefully) to prevent resource leaks.
  • Containerization: Run workers inside Docker containers managed by Kubernetes for scaling and resource isolation. Define appropriate CPU/Memory requests and limits in Kubernetes deployments.

Scaling

  • Scale the number of worker instances (pods in Kubernetes) horizontally based on the depth of the Message Queue or CPU/Memory load using Kubernetes HPA and potentially KEDA for queue-based scaling.

Error Handling within Workers

  • Implement try...except blocks around fetching and processing logic.
  • Utilize Celery's task.retry() for recoverable errors.
  • Log all errors comprehensively to the central logging system.
  • Ensure tasks eventually terminate (success or failure) and acknowledge the message queue correctly.