The Problem
Real-world data rarely arrives in a clean tabular format. This project is a good example: a construction contractor's proprietary Excel template used complex merged-cell geometry to encode daily work orders — nested tables for labour hours, equipment, materials, and sign-off rows within a single locked sheet. Standard pandas.read_excel produces garbage on merged cells. Extracting clean, ML-ready structured records from this kind of source requires navigating the file's internal geometry, not just reading it row by row.
The Challenge
The Excel format was not generated programmatically — it was hand-filled using a locked template with complex merged cell ranges. Standard pandas.read_excel produced garbage due to the merging. The extraction needed to:
- Identify the start and end rows of each ticket block (by
LOG_IDmarker cells). - Navigate merged cell geometry with
openpyxl.worksheet.cell_range.CellRangeto find logical cell boundaries. - Extract each field (date, crew, hours, materials, location, sign-off) by relative position within the block.
- Output a flat CSV/Excel with one row per ticket, preserving all embedded images (signature photos) as linked files.
What I Built
extract_daily_tickets.py — a command-line tool that:
- Loads the source workbook with
openpyxlin read-only mode. - Scans for
LOG_IDmarkers to delimit ticket blocks (BlockSlicedataclass). - Resolves merged cell ranges to find the actual cell holding each value.
- Copies embedded images to an
images/output folder and references them in the output. - Writes a clean Excel output with one row per ticket and a validation summary sheet.
Outcome
Processed 12 months of daily tickets (300+ work orders) in under 30 seconds. The output fed directly into the client's payroll spreadsheet without manual re-entry.