Skip to content

Technology Stack Summary

This section outlines the proposed core technologies for each major component of the system, along with justifications based on the design goals.

Component Technology Choices Justification
Language Python 3.x Mature ecosystem, excellent libraries for web scraping, data processing, web frameworks, ML, and cloud integration.
Configuration Mgmt Django + PostgreSQL Django: Built-in Admin UI for easy config management, robust ORM, migrations. PostgreSQL: Reliable relational DB for storing config.
Job Scheduling/Dispatch Python Service (potentially Django Mgmt Command/Celery Beat) Leverages Python ecosystem. Can be scheduled via cron, systemd timer, or Celery Beat integrated with the Django app.
Message Queue RabbitMQ / Redis / AWS SQS / Google Pub/Sub Decouples dispatcher & workers. Choice depends on scale/ops preference: RabbitMQ: Feature-rich. Redis: Simpler. SQS/PubSub: Managed cloud services.
Scraping Workers Celery + Python Celery: Mature, distributed task queue system for Python, integrates well with brokers.
Core Fetching Logic requests + urllib3.Retry / backoff Lightweight, standard Python HTTP library with robust, configurable retry/backoff mechanisms for simple fetches.
Dynamic Content Fetching Playwright (or potential alternative: Splash/Commercial API) Playwright: Modern browser automation for JS-heavy sites. Alternatives: Considered for resource optimization/offloading browser mgmt.
Proxy/UA Rotation Custom Scrapy Middleware / Python Logic in Worker Integrates rotation logic within the fetching process. Can use lists, files, or proxy service APIs.
Raw Data Storage AWS S3 / Google Cloud Storage / Azure Blob Storage Scalable, durable, cost-effective object storage for raw HTML. Cloud provider choice often aligns with other infrastructure.
Parsing Logic Python (BeautifulSoup, lxml, potentially parsel) Standard Python libraries for efficient HTML parsing.
Parsing Service/Worker Celery Workers / AWS Lambda / Google Cloud Functions Celery: Reuse worker infrastructure. Serverless: Option if parsing is stateless and triggered by S3 events.
Structured Data Storage PostgreSQL (with JSONB) PostgreSQL: Strong relational features, SQL querying, data integrity. JSONB: Flexibility for semi-structured fields (skills, salary).
Monitoring - Logs ELK Stack / Grafana Loki / CloudWatch Logs Centralized log aggregation for debugging and tracing.
Monitoring - Metrics Prometheus + Grafana / CloudWatch Metrics / Datadog Time-series metrics for performance monitoring, dashboarding, and alerting.
Orchestration (Scale) Docker + Kubernetes (EKS/GKE/AKS) Docker: Containerization. Kubernetes: Scalable deployment, management, auto-scaling of workers and services.
Infrastructure Provisioning Terraform / Pulumi / CloudFormation Infrastructure as Code for reproducible and automated environment setup.
CI/CD GitHub Actions / GitLab CI / Jenkins Automated building, testing, and deployment pipelines.

This stack provides a balance of established, well-supported technologies with scalable cloud-native patterns, aligning with the system's design goals. Specific choices (e.g., RabbitMQ vs. SQS) may be influenced by existing infrastructure or operational preferences.