Scraper Workers (Celery)
The Workers are the core execution units responsible for performing the actual web scraping tasks.
graph TD
A[Receive Job Message] --> B{Parse Message};
B --> C{Requires Playwright?};
C -- Yes --> D[Run Playwright Fetch Logic];
C -- No --> E[Run Requests Fetch Logic];
D --> F{Fetch Success?};
E --> F;
F -- Yes --> G[Compress HTML];
G --> H[Upload HTML to S3];
H --> I[Log Success / Report Stats];
I --> J[Acknowledge Queue Message];
F -- No (After Retries) --> K[Log Error / Report Stats];
K --> J;
style J fill:#cfc,stroke:#333,stroke-width:2px
Role & Purpose
- Consume job messages from the Message Queue.
- Interpret the job message to understand the target URL and required actions/parameters.
- Execute the data fetching logic using either lightweight HTTP clients (
requests) or browser automation (Playwright) as specified. - Implement politeness delays, proxy rotation, and User-Agent rotation during fetching.
- Handle fetcher-level errors and retries.
- Process pagination or dynamic content loading logic.
- Upon successful fetching, store the raw HTML content into the S3 Bucket.
- Report task status (success, failure, retry needed) back via Celery and potentially to the Monitoring system.
- Acknowledge messages from the queue upon definitive completion or failure.
Technology Choice: Celery
- Rationale: Celery is the standard, feature-rich distributed task queue framework for Python.
- Benefits:
- Integrates seamlessly with message brokers (RabbitMQ, Redis, SQS).
- Handles worker management, task distribution, and concurrency.
- Provides built-in support for task retries with backoff.
- Mature and well-documented.
Worker Implementation Details
Task Definition
- Define Celery tasks corresponding to the scraping jobs (e.g., a single task type that adapts based on message parameters).
- The task receives the job message content as input.
Fetching Logic Integration
- The task code imports and uses the chosen fetching libraries (
requestswithRetryadapter,Playwright). - Logic branches based on
requires_playwrightflag in the job message.
Resource Management
- Concurrency: Configure the number of concurrent tasks each Celery worker process can handle (
--concurrencyflag). This needs careful tuning, especially when using Playwright, to avoid overwhelming CPU/RAM. Consider usinggeventoreventletexecution pools for I/O-boundrequeststasks, but stick toprefork(default) orsolofor CPU/RAM-intensive Playwright tasks. - Playwright Cleanup: Ensure Playwright browser instances and contexts are closed properly at the end of each task (or reused carefully) to prevent resource leaks.
- Containerization: Run workers inside Docker containers managed by Kubernetes for scaling and resource isolation. Define appropriate CPU/Memory requests and limits in Kubernetes deployments.
Scaling
- Scale the number of worker instances (pods in Kubernetes) horizontally based on the depth of the Message Queue or CPU/Memory load using Kubernetes HPA and potentially KEDA for queue-based scaling.
Error Handling within Workers
- Implement
try...exceptblocks around fetching and processing logic. - Utilize Celery's
task.retry()for recoverable errors. - Log all errors comprehensively to the central logging system.
- Ensure tasks eventually terminate (success or failure) and acknowledge the message queue correctly.