Skip to content

Core Fetching Implementation (Requests)

For websites or specific endpoints that serve content as static HTML or via predictable APIs, a lightweight fetching approach is preferred for efficiency.

Technology Choice: requests Library

  • Rationale: requests is the de facto standard, robust, and well-maintained Python library for making HTTP requests. It's significantly lighter in terms of resource consumption (CPU/RAM) compared to full browser automation.

Key Implementation Features

graph TD
    A[Start Fetch Request] --> B{Make HTTP Request};
    B --> C{Check Response Status};
    C -- OK (2xx) --> D[Success: Return Response];
    C -- Retryable Error (e.g., 503, Timeout)? --> E{Retry Count < Max?};
    E -- Yes --> F[Apply Backoff Delay];
    F --> B;
    E -- No --> G[Fail: Raise Exception];
    C -- Non-Retryable Error (e.g., 404) --> G;

    style D fill:#cfc,stroke:#333,stroke-width:2px
    style G fill:#fcc,stroke:#333,stroke-width:2px

Retry Logic

  • Mechanism: Utilize the urllib3.Retry mechanism integrated with requests.Session or the backoff decorator library.
  • Configuration: Configure automatic retries for:
    • Transient network errors (timeouts, connection issues).
    • Specific server-side error codes (e.g., 500, 502, 503, 504).
    • Potentially rate-limiting codes (e.g., 429) if deemed temporary.
  • Backoff Strategy: Employ exponential backoff with jitter between retries to avoid overwhelming servers.

Session Management

  • Use requests.Session objects for:
    • Connection Pooling: Reusing underlying TCP connections for better performance when making multiple requests to the same host.
    • Cookie Persistence: Automatically handle cookies if needed for session management on the target site (though less common for API endpoints).

Headers & User-Agent

  • User-Agent Rotation: Set appropriate User-Agent headers, ideally rotating them via middleware or logic within the worker to mimic different browsers and reduce blocking potential.
  • Other Headers: Include other standard headers (Accept, Accept-Language, etc.) to appear more like a regular browser request.

Timeout Configuration

  • Set reasonable timeouts for connection and read operations to prevent tasks from hanging indefinitely on unresponsive servers.

When to Use

  • Fetching robots.txt.
  • Accessing static HTML pages where job data is directly embedded.
  • Interacting with identified APIs (AJAX/XHR/Fetch) that return data (often JSON).
  • Simple pagination scenarios handled by URL parameters.

By defaulting to this lightweight approach and only escalating to browser automation when necessary, the system optimizes resource usage and improves overall throughput.