OCR pipeline: structured field extraction from 941 PDFs via native Anthropic PDF support

2026-04-01

Phase 5 of the honest-cam pipeline: read the 941 reorganized PDFs and pull structured financial fields out of each one into a JSON sidecar that sits next to the source. This is the step that turns a pile of documents into queryable data.

Design decisions

A few of these are opinionated and worth calling out:

DecisionChoiceWhy
PDF processingAnthropic native PDF support (document content blocks)Skips pypdf entirely — the API accepts the PDF as-is, the model sees layout, no brittle text extraction
ConfidenceValidation-based, computed from rulesSelf-reported confidence from the model is noisy; I trust deterministic post-extraction validation more
AmountsDecimal, never floatBecause this is money and 0.1 + 0.2 != 0.3 is unacceptable in a ledger
Unit numberslist[int]Supports property-wide charges (empty list), single-unit, and multi-unit splits
Period datesProper period_start / period_end datesNot free text, not just "April 2026"
Bank statementsLine-item extractionPer-transaction, so the ledger can reconcile against individual rows
Sidecar writesAtomic: write to .json.tmp, then renameCrash-safe; never leaves a half-written sidecar behind

The big unlock is Anthropic's native PDF support. I was ready to plumb pypdf (or worse, Tesseract) through this thing, and instead the pipeline just sends the PDF bytes as a document content block and asks for a tool-use response matching a Pydantic schema. Zero OCR text-extraction code to maintain.

The two-tier model ladder

Every document goes to Haiku first. Haiku is fast and cheap, and for clean invoices with well-formatted line items it nails the fields on the first pass. If validation fails — missing required fields, amount doesn't reconcile, unit number outside known range — the document is retried on Sonnet. Sonnet eats the ambiguous ones (handwritten receipts, scanned utility bills where the meter read landed on a crease) and gets them right.

This ladder matters because the cost delta between Haiku and Sonnet is real, and on a 941-doc batch the savings compound fast. The validation layer is what makes the ladder safe — nothing slips through on Haiku without being audited against the rules.

New files

  • packages/py/src/propco/ingest/ocr.py — core pipeline: detect category, call Anthropic with the category-specific tool schema, validate, write sidecar.
  • packages/py/src/propco/ingest/prompts.py — 8 per-category prompt templates (invoices, bank statements, compliance notices, contracts, …) plus matching tool-use schemas.
  • packages/py/src/propco/ingest/validation.py — post-extraction validation rules and confidence computation. Everything deterministic.
  • packages/py/src/propco/models/document.py — Pydantic models: FieldValue, LineItem, DocumentFields, DocumentSidecar, OcrRunStats.
  • CHANGELOG.md — backfilled with 0.1.0 and now in Keep a Changelog format.
  • properties/bamboo-house/ocr.yaml — per-property OCR config (which categories apply, which model ladder, batch size).

CLI

propco ocr <slug> [--dry-run | --execute] [--batch-size N] [--category <name>]

Dry run prints what it would process; --execute actually makes API calls. --category lets me pilot on just invoices first before turning the whole pipeline loose on every category.

Verification

  • 93 tests passing (25 pipeline + 37 validation + 31 existing reorg).
  • ruff check clean.
  • A deliberate pilot is up next: propco ocr bamboo-house --execute --batch-size 20 --category invoices, then eyeball the first 20 sidecars before scaling.

Once the sidecars land for the whole property, the next phase is wiring them into the Xero sync so invoices extracted here automatically become bills there. That's PR #4.


PR: https://github.com/StevieIsmagic/honest-cam/pull/3