Dispatcher Service
The Dispatcher acts as the brain coordinating the scraping activities based on the central configuration.
sequenceDiagram
participant S as Scheduler
participant D as Dispatcher
participant C as Config DB
participant Q as Message Queue
loop Check Schedule
S->>D: Trigger Dispatch Check
D->>C: Query Due ScrapeTargets
C-->>D: Return Due Targets
alt For Each Due Target
D->>D: Construct Job Message
D->>Q: Publish Job Message
D->>C: Update last_scheduled
end
end
Role & Purpose
- Periodically queries the Configuration Database (Django/PostgreSQL) to identify active
ScrapeTargetcombinations that are due for execution based on their definedfrequencyandlast_scheduledtime. - Constructs specific job instructions for each due target.
- Formats these instructions into messages suitable for the Message Queue.
- Publishes these job messages to the Message Queue for consumption by the Scraper Workers.
- Updates the
last_scheduledtimestamp in the Configuration Database for dispatched targets.
Implementation Options
- Standalone Python Service: A script running continuously or triggered periodically (e.g., via
cronor a systemd timer). Uses a library likeSQLAlchemyor Django's ORM (if run within Django context) to query the config DB and a library likepika(RabbitMQ),boto3(SQS), orredis-pyto publish messages. - Django Management Command: A command integrated within the Django application, scheduled using
cronor similar. Can directly use Django's ORM. - Celery Beat Task: Utilize Celery's periodic task scheduler (
Celery Beat) to run the dispatch logic as a recurring task within the Celery ecosystem, especially if Celery is already heavily used. Can directly use Django's ORM if configured. - Serverless Function (e.g., AWS Lambda, Google Cloud Function): Triggered on a schedule (e.g., CloudWatch Events). Queries the database (requires network access/credentials) and publishes to the queue.
Job Message Content
Each message published to the queue needs to contain sufficient information for a worker to execute the scrape. Example contents:
{
"target_id": 123, // ID from ScrapeTarget table
"website_id": 5,
"website_name": "ExampleJobs.com",
"search_url": "https://www.examplejobs.com/search?q=Python+Developer&loc=Remote&posted=24h",
"requires_playwright": false,
"pagination_type": "next_link",
// ... other necessary site-specific metadata ...
"keywords_used": ["Python Developer"],
"location_used": "Remote",
"attempt_count": 0 // Initial attempt
}