Technology Stack Summary
This section outlines the proposed core technologies for each major component of the system, along with justifications based on the design goals.
| Component | Technology Choices | Justification |
|---|---|---|
| Language | Python 3.x | Mature ecosystem, excellent libraries for web scraping, data processing, web frameworks, ML, and cloud integration. |
| Configuration Mgmt | Django + PostgreSQL | Django: Built-in Admin UI for easy config management, robust ORM, migrations. PostgreSQL: Reliable relational DB for storing config. |
| Job Scheduling/Dispatch | Python Service (potentially Django Mgmt Command/Celery Beat) | Leverages Python ecosystem. Can be scheduled via cron, systemd timer, or Celery Beat integrated with the Django app. |
| Message Queue | RabbitMQ / Redis / AWS SQS / Google Pub/Sub | Decouples dispatcher & workers. Choice depends on scale/ops preference: RabbitMQ: Feature-rich. Redis: Simpler. SQS/PubSub: Managed cloud services. |
| Scraping Workers | Celery + Python | Celery: Mature, distributed task queue system for Python, integrates well with brokers. |
| Core Fetching Logic | requests + urllib3.Retry / backoff |
Lightweight, standard Python HTTP library with robust, configurable retry/backoff mechanisms for simple fetches. |
| Dynamic Content Fetching | Playwright (or potential alternative: Splash/Commercial API) | Playwright: Modern browser automation for JS-heavy sites. Alternatives: Considered for resource optimization/offloading browser mgmt. |
| Proxy/UA Rotation | Custom Scrapy Middleware / Python Logic in Worker | Integrates rotation logic within the fetching process. Can use lists, files, or proxy service APIs. |
| Raw Data Storage | AWS S3 / Google Cloud Storage / Azure Blob Storage | Scalable, durable, cost-effective object storage for raw HTML. Cloud provider choice often aligns with other infrastructure. |
| Parsing Logic | Python (BeautifulSoup, lxml, potentially parsel) |
Standard Python libraries for efficient HTML parsing. |
| Parsing Service/Worker | Celery Workers / AWS Lambda / Google Cloud Functions | Celery: Reuse worker infrastructure. Serverless: Option if parsing is stateless and triggered by S3 events. |
| Structured Data Storage | PostgreSQL (with JSONB) | PostgreSQL: Strong relational features, SQL querying, data integrity. JSONB: Flexibility for semi-structured fields (skills, salary). |
| Monitoring - Logs | ELK Stack / Grafana Loki / CloudWatch Logs | Centralized log aggregation for debugging and tracing. |
| Monitoring - Metrics | Prometheus + Grafana / CloudWatch Metrics / Datadog | Time-series metrics for performance monitoring, dashboarding, and alerting. |
| Orchestration (Scale) | Docker + Kubernetes (EKS/GKE/AKS) | Docker: Containerization. Kubernetes: Scalable deployment, management, auto-scaling of workers and services. |
| Infrastructure Provisioning | Terraform / Pulumi / CloudFormation | Infrastructure as Code for reproducible and automated environment setup. |
| CI/CD | GitHub Actions / GitLab CI / Jenkins | Automated building, testing, and deployment pipelines. |
This stack provides a balance of established, well-supported technologies with scalable cloud-native patterns, aligning with the system's design goals. Specific choices (e.g., RabbitMQ vs. SQS) may be influenced by existing infrastructure or operational preferences.