Skip to content

Challenge Overview

This document outlines the proposed design for a system addressing the data engineering challenge presented. The core objective is to design and plan the implementation of a scalable web scraping system focused on collecting publicly available job listing data across various sources.

Project Focus:

The specific application domain for this design is the collection of job postings from:

  • Multiple job boards (e.g., LinkedIn, Indeed, specific industry boards).
  • Various industry sectors.
  • Different geographic locations/cities.
  • A diverse set of roles and job titles.

Core Tasks Addressed (Based on Exercise 1 & 2):

  1. Web Scraping Strategy & Implementation (Exercise 1):

    • Define an approach to identify relevant websites and scrape job data efficiently and ethically.
    • Plan for handling common scraping challenges like dynamic content loading, pagination, rate limiting, and anti-scraping measures.
    • Design a data model for storing the collected information (both raw and structured).
    • Address data quality concerns like duplicates and missing values during collection.
    • Select and justify appropriate programming languages, libraries, and frameworks (e.g., Python, Scrapy, Playwright, Requests).
    • Outline storage solutions for raw scraped data (e.g., S3).
  2. Data Processing & Optimization (Exercise 2):

    • Design a workflow to clean, standardize, and preprocess the scraped data for usability.
    • Plan steps required if the data were to be used for machine learning applications (e.g., text processing, feature extraction).
    • Outline strategies for optimizing the scraping and processing system for large-scale data collection and long-term efficiency (e.g., caching, parallel processing, monitoring).
    • Define storage solutions for processed, structured data suitable for analysis (e.g., PostgreSQL).

The ultimate goal is to propose a robust, scalable, and maintainable architecture capable of collecting, processing, and storing valuable job market data for further analysis or application development.