High-Speed Event Listings Collection — Christos Prapas

The Architecture Challenge

Building large-scale ML datasets requires high-throughput ingestion pipelines that saturate network bandwidth without blocking on I/O. This project demonstrates that pattern: crawl a site of 800+ pages, discover URLs dynamically, and extract structured records at maximum speed using Python's async concurrency primitives.

The client domain was event listings — the engineering challenge was performance.

What I Built

An async scraper using aiohttp + asyncio with 50 concurrent requests — designed to maximise throughput while respecting the server.

Key design choices

Async concurrency — asyncio.PriorityQueue manages URL discovery. New links discovered on each page are enqueued immediately; already-visited URLs are skipped via a visited_urls set. Priority ordering ensures city index pages are scraped before event detail pages.

Normalised URL handling — all discovered URLs are resolved relative to the base domain and stripped of fragments to avoid duplicates from different link formats.

Structured output — events are extracted with BeautifulSoup into a flat schema: name, city, country, date, venue, description. Output is written to both CSV (for the client's pipeline) and Excel (for review).

Scraping time tracking — a separate scraping_time.csv logs per-page fetch latency, useful for profiling slow pages and tuning concurrency.

Outcome

Full crawl of the target site (~800 pages) completed in ~12 seconds with 50 concurrent connections. Delivered milongas_events.xlsx with 1,200+ structured event records, organised by city.