OCR pipeline: structured field extraction from 941 PDFs via native Anthropic PDF support

Phase 5 of the honest-cam pipeline: read the 941 reorganized PDFs and pull structured financial fields out of each one into a JSON sidecar that sits next to the source. This is the step that turns a pile of documents into queryable data.

Design decisions

A few of these are opinionated and worth calling out:

Decision	Choice	Why
PDF processing	Anthropic native PDF support (document content blocks)	Skips pypdf entirely — the API accepts the PDF as-is, the model sees layout, no brittle text extraction
Confidence	Validation-based, computed from rules	Self-reported confidence from the model is noisy; I trust deterministic post-extraction validation more
Amounts	`Decimal`, never `float`	Because this is money and `0.1 + 0.2 != 0.3` is unacceptable in a ledger
Unit numbers	`list[int]`	Supports property-wide charges (empty list), single-unit, and multi-unit splits
Period dates	Proper `period_start` / `period_end` dates	Not free text, not just "April 2026"
Bank statements	Line-item extraction	Per-transaction, so the ledger can reconcile against individual rows
Sidecar writes	Atomic: write to `.json.tmp`, then rename	Crash-safe; never leaves a half-written sidecar behind

The big unlock is Anthropic's native PDF support. I was ready to plumb pypdf (or worse, Tesseract) through this thing, and instead the pipeline just sends the PDF bytes as a document content block and asks for a tool-use response matching a Pydantic schema. Zero OCR text-extraction code to maintain.

The two-tier model ladder

Every document goes to Haiku first. Haiku is fast and cheap, and for clean invoices with well-formatted line items it nails the fields on the first pass. If validation fails — missing required fields, amount doesn't reconcile, unit number outside known range — the document is retried on Sonnet. Sonnet eats the ambiguous ones (handwritten receipts, scanned utility bills where the meter read landed on a crease) and gets them right.

This ladder matters because the cost delta between Haiku and Sonnet is real, and on a 941-doc batch the savings compound fast. The validation layer is what makes the ladder safe — nothing slips through on Haiku without being audited against the rules.

New files

packages/py/src/propco/ingest/ocr.py — core pipeline: detect category, call Anthropic with the category-specific tool schema, validate, write sidecar.
packages/py/src/propco/ingest/prompts.py — 8 per-category prompt templates (invoices, bank statements, compliance notices, contracts, …) plus matching tool-use schemas.
packages/py/src/propco/ingest/validation.py — post-extraction validation rules and confidence computation. Everything deterministic.
packages/py/src/propco/models/document.py — Pydantic models: FieldValue, LineItem, DocumentFields, DocumentSidecar, OcrRunStats.
CHANGELOG.md — backfilled with 0.1.0 and now in Keep a Changelog format.
properties/bamboo-house/ocr.yaml — per-property OCR config (which categories apply, which model ladder, batch size).

CLI

propco ocr <slug> [--dry-run | --execute] [--batch-size N] [--category <name>]

Dry run prints what it would process; --execute actually makes API calls. --category lets me pilot on just invoices first before turning the whole pipeline loose on every category.

Verification

93 tests passing (25 pipeline + 37 validation + 31 existing reorg).
ruff check clean.
A deliberate pilot is up next: propco ocr bamboo-house --execute --batch-size 20 --category invoices, then eyeball the first 20 sidecars before scaling.

Once the sidecars land for the whole property, the next phase is wiring them into the Xero sync so invoices extracted here automatically become bills there. That's PR #4.

PR: https://github.com/StevieIsmagic/honest-cam/pull/3