The Problem
Building reliable ML training datasets requires reproducible, high-volume data ingestion pipelines — not one-off manual downloads. The same engineering problem applies to job market data: LinkedIn's listings are ephemeral, disappearing within days, and no public API exists. I needed a robust automated pipeline that could run weekly, ingest ~50 new postings, deduplicate against previous runs, and deliver clean structured records without human intervention.
What I Built
A Playwright-based ingestion pipeline that:
- Scrapes LinkedIn job search results — navigates paginated results for ML/AI engineer roles in Germany, capturing job IDs, titles, companies, and posting URLs.
- Writes to Google Sheets — four tabs:
CronCaptures(raw scrape timestamps),Sheet1(enriched job records),TrackingIDs(deduplicated IDs so the same job isn't re-processed), andJobDescriptions(full text of target job descriptions). - Runs on a cron — scheduled to scrape weekly, appending new results and skipping jobs already in the tracking list.
Technical Challenges
LinkedIn anti-bot detection — Playwright with realistic browser headers and randomised wait times between page interactions kept the scraper running without CAPTCHA challenges across weeks of use.
Google Sheets rate limits — bulk writes batched using the Sheets API's batchUpdate, with exponential backoff on quota errors.
Deduplication — job IDs are hashed and stored in the TrackingIDs tab; the scraper skips anything already present, so repeated runs are idempotent.
Outcome
Built and used personally. Maintained a live feed of ~50 new ML roles per week with zero manual browsing.