Excel Work-Order Data Extraction — Christos Prapas

The Problem

Real-world data rarely arrives in a clean tabular format. This project is a good example: a construction contractor's proprietary Excel template used complex merged-cell geometry to encode daily work orders — nested tables for labour hours, equipment, materials, and sign-off rows within a single locked sheet. Standard pandas.read_excel produces garbage on merged cells. Extracting clean, ML-ready structured records from this kind of source requires navigating the file's internal geometry, not just reading it row by row.

The Challenge

The Excel format was not generated programmatically — it was hand-filled using a locked template with complex merged cell ranges. Standard pandas.read_excel produced garbage due to the merging. The extraction needed to:

Identify the start and end rows of each ticket block (by LOG_ID marker cells).
Navigate merged cell geometry with openpyxl.worksheet.cell_range.CellRange to find logical cell boundaries.
Extract each field (date, crew, hours, materials, location, sign-off) by relative position within the block.
Output a flat CSV/Excel with one row per ticket, preserving all embedded images (signature photos) as linked files.

What I Built

extract_daily_tickets.py — a command-line tool that:

Loads the source workbook with openpyxl in read-only mode.
Scans for LOG_ID markers to delimit ticket blocks (BlockSlice dataclass).
Resolves merged cell ranges to find the actual cell holding each value.
Copies embedded images to an images/ output folder and references them in the output.
Writes a clean Excel output with one row per ticket and a validation summary sheet.

Outcome

Processed 12 months of daily tickets (300+ work orders) in under 30 seconds. The output fed directly into the client's payroll spreadsheet without manual re-entry.