Structured Data Storage (PostgreSQL)

This component is the destination for the cleaned, processed, and structured job posting data extracted from the raw HTML.

Role & Purpose

Store the final, usable job data in a structured format suitable for querying and analysis.
Enforce data types and potentially basic constraints (e.g., uniqueness on source URL + timestamp).
Serve as the primary data source for downstream applications, analytics platforms, or ML model training pipelines.

Rationale:
- Relational Integrity: Strong support for data types, constraints, and relationships.
- Powerful SQL: Standard, expressive query language for complex data retrieval and aggregation.
- JSONB Support: Excellent built-in support for storing and efficiently querying semi-structured JSON data within relational tables. This is ideal for handling fields with variable structures (e.g., salary details, extracted skills lists, location breakdowns).
- Maturity & Reliability: Proven, ACID-compliant database with a large ecosystem and community support.
- Scalability: Scales well vertically and offers various replication and high-availability options. Extensions like pgvector can add specialized capabilities if needed later.

A primary table (e.g., JobPostings) would store the core information.

id: Primary Key (Serial).
source_website: Text (e.g., 'linkedin', 'indeed_com').
source_url: Text (Unique identifier for the specific posting).
job_id_on_source: Text (Nullable, ID used by the source site).
scraped_timestamp: TimestampTZ (When this record was scraped/parsed).
first_seen_timestamp: TimestampTZ (When this job URL was first encountered).
last_processed_timestamp: TimestampTZ (When this record was last updated by the parser).
raw_html_s3_path: Text (Link back to the raw source in S3).
job_title_raw: Text.
company_name: Text (Nullable).
location_raw: Text (Nullable).
location_structured: JSONB (Nullable, e.g., {"city": "London", "country": "UK", "remote": false}).
description_text: Text (Cleaned job description).
description_html: Text (Nullable, Raw HTML description if needed).
publish_date: Date/TimestampTZ (Nullable, Standardized).
employment_type: Text (Nullable, e.g., 'Full-time', 'Contract').
salary_raw: Text (Nullable, The raw salary string).
salary_structured: JSONB (Nullable, e.g., {"min": 50000, "max": 70000, "currency": "USD", "period": "annual"}).
extracted_skills: JSONB (Nullable, Array of strings, e.g., ["Python", "SQL", "AWS"]).
categorized_role: Text (Nullable, Standardized role category).
seniority_level: Text (Nullable, Inferred seniority).
... other relevant fields.

Create indexes on frequently queried columns like source_url, scraped_timestamp, source_website, job_title_raw, company_name, and potentially use GIN indexes for efficient querying within JSONB fields.

The Parsing Service/Worker is responsible for INSERTing new records or potentially UPDATEing existing ones (based on source_url) if re-processing occurs.
Downstream analysis tools, ML pipelines, or applications query this database using SQL.

This setup provides a robust and flexible foundation for storing and accessing the valuable structured job data.