The Architecture Challenge
Building large-scale ML datasets requires high-throughput ingestion pipelines that saturate network bandwidth without blocking on I/O. This project demonstrates that pattern: crawl a site of 800+ pages, discover URLs dynamically, and extract structured records at maximum speed using Python's async concurrency primitives.
The client domain was event listings — the engineering challenge was performance.
What I Built
An async scraper using aiohttp + asyncio with 50 concurrent requests — designed to maximise throughput while respecting the server.
Key design choices
Async concurrency — asyncio.PriorityQueue manages URL discovery. New links discovered on each page are enqueued immediately; already-visited URLs are skipped via a visited_urls set. Priority ordering ensures city index pages are scraped before event detail pages.
Normalised URL handling — all discovered URLs are resolved relative to the base domain and stripped of fragments to avoid duplicates from different link formats.
Structured output — events are extracted with BeautifulSoup into a flat schema: name, city, country, date, venue, description. Output is written to both CSV (for the client's pipeline) and Excel (for review).
Scraping time tracking — a separate scraping_time.csv logs per-page fetch latency, useful for profiling slow pages and tuning concurrency.
Outcome
Full crawl of the target site (~800 pages) completed in ~12 seconds with 50 concurrent connections. Delivered milongas_events.xlsx with 1,200+ structured event records, organised by city.