Task Queue

The Message Queue is the central communication channel decoupling the Dispatcher from the pool of Scraper Workers.

Role & Purpose

Acts as a buffer holding job messages sent by the Dispatcher.
Distributes job messages to available Scraper Workers.
Enhances system resilience: If workers are down, messages wait in the queue. If the dispatcher is down, workers continue processing existing messages.
Enables independent scaling of Dispatcher and Workers.

sequenceDiagram
    participant D as Dispatcher
    participant Q as Message Queue Broker
    participant W as Celery Worker

    Note over D, W: Worker is idle, waiting for jobs.

    D->>+Q: Publish(Job Message)
    Note over Q: Message stored persistently.
    Q-->>-D: Publish Acknowledged (Optional)

    Note over Q, W: Broker makes message available.
    W->>+Q: Consume()/Get Message
    Note over Q: Message marked as "unacknowledged".
    Q-->>-W: Job Message

    Note over W: Worker begins processing job...
    W->>W: Execute Scraping Task (Fetch, Parse etc.)

    alt Task Successful
        Note over W: Job completed successfully.
        W->>+Q: Acknowledge(Message)
        Note over Q: Message permanently removed.
        Q-->>-W: Ack Confirmed (Optional)
    else Task Failed (After Retries)
        Note over W: Job failed after all retries.
        W->>+Q: Negative Acknowledge / Route to DLQ
        Note over Q: Message moved to Dead-Letter Queue or discarded.
        Q-->>-W: Nack/Route Confirmed (Optional)
    else Worker Crashes Mid-Task
        Note over W: Worker crashes before acknowledging.
        Note over Q: Message remains "unacknowledged".<br/>After visibility timeout,<br/>Broker makes message available again<br/>for another worker.
    end

Technology Choices

RabbitMQ: Feature-rich, mature, protocol-based (AMQP) message broker. Offers routing flexibility, acknowledgements, persistence, and DLQ support. Requires separate deployment and management.
Redis: Often used as a simpler message broker for Celery. Fast in-memory store, but persistence and reliability features might be less robust than RabbitMQ depending on configuration. Can serve multiple purposes (caching, broker).
AWS SQS (Simple Queue Service): Fully managed cloud service. Highly scalable and durable. Offers standard and FIFO queues, DLQ support. Reduces operational overhead. Pay-per-use pricing.
Google Cloud Pub/Sub: Fully managed cloud service. Global scale, push/pull delivery, filtering. Reduces operational overhead.

Selection Criteria: Choice depends on factors like existing infrastructure, operational preferences (managed vs. self-hosted), required features (e.g., message ordering guarantees - usually not needed for scraping jobs), and scalability needs. Managed services (SQS, Pub/Sub) are often preferred for cloud-native deployments to reduce operational burden.

Key Features Used

Message Persistence: Ensure messages are not lost if the broker restarts.
Worker Acknowledgements: Workers should acknowledge messages only after successfully completing the task (or deciding it's a terminal failure) to prevent message loss if a worker crashes mid-task.
Dead-Letter Queue (DLQ) Support: Automatically route messages that fail repeatedly to a separate queue for investigation.
Scalability: The queue system must handle the peak rate of job dispatch and consumption.

Interaction Flow

Dispatcher publishes a job message to a specific queue/topic.
The Message Broker makes the message available.
An available Celery Worker retrieves (consumes) the message.
The Worker processes the job.
Upon completion/failure, the Worker sends an acknowledgement (ack) to the Broker, removing the message from the queue (or routing to DLQ upon repeated failure).