Automated Job-Posting Collection — Christos Prapas

The Problem

Building reliable ML training datasets requires reproducible, high-volume data ingestion pipelines — not one-off manual downloads. The same engineering problem applies to job market data: LinkedIn's listings are ephemeral, disappearing within days, and no public API exists. I needed a robust automated pipeline that could run weekly, ingest ~50 new postings, deduplicate against previous runs, and deliver clean structured records without human intervention.

What I Built

A Playwright-based ingestion pipeline that:

Scrapes LinkedIn job search results — navigates paginated results for ML/AI engineer roles in Germany, capturing job IDs, titles, companies, and posting URLs.
Writes to Google Sheets — four tabs: CronCaptures (raw scrape timestamps), Sheet1 (enriched job records), TrackingIDs (deduplicated IDs so the same job isn't re-processed), and JobDescriptions (full text of target job descriptions).
Runs on a cron — scheduled to scrape weekly, appending new results and skipping jobs already in the tracking list.

Technical Challenges

LinkedIn anti-bot detection — Playwright with realistic browser headers and randomised wait times between page interactions kept the scraper running without CAPTCHA challenges across weeks of use.

Google Sheets rate limits — bulk writes batched using the Sheets API's batchUpdate, with exponential backoff on quota errors.

Deduplication — job IDs are hashed and stored in the TrackingIDs tab; the scraper skips anything already present, so repeated runs are idempotent.

Outcome

Built and used personally. Maintained a live feed of ~50 new ML roles per week with zero manual browsing.