Links / News
How We Cracked OCR Through Extraction Pipelines
Jan 26, 2026

Our OCR pipelines looked accurate in demos and failed silently in production. The breakthrough wasn’t a better model; it was changing how extraction was designed.
OCR systems often seem to function effectively in controlled settings but frequently falter in practical applications. Over time, we observed that processes that initially performed well would deteriorate, yielding outputs that appeared correct at first glance but contained subtle, hard-to-detect inaccuracies.
The core issue was not the OCR's precision but its reliability. Factors such as layout shifts, visual ambiguities, and erroneous values undermine simple extraction systems in ways that typical benchmarks rarely capture.
Our primary adjustment was to consider OCR outputs as probabilistic indicators rather than as definitive absolutes. Reliable extraction stems from the system's architecture: improving OCR with visual context, applying semantic constraints, and structuring extraction as a debug-friendly, verifiable process rather than relying solely on a single model call.
Our pipeline detects regions of interest using a vision model, then extracts text with OCR while analyzing cell colours in parallel. Colour thresholds distinguish headers, empty slots, and data cells. Task-specific pipelines reconstruct grid structures from detected cells, interpolate missing values, such as time labels, and validate that extracted content matches expected formats.
We wrote a paper to document this approach from a production systems perspective, focusing on real failure modes, design tradeoffs, and what actually improves reliability over time, rather than proposing a new model or benchmark. For the technical deep dive, full breakdown, and design decisions, read the paper here.