Phase 5 of the honest-cam pipeline: read the 941 reorganized PDFs and pull structured financial fields out of each one into a JSON sidecar that sits next to the source. This is the step that turns a pile of documents into queryable data.
Design decisions
A few of these are opinionated and worth calling out:
| Decision | Choice | Why |
|---|---|---|
| PDF processing | Anthropic native PDF support (document content blocks) | Skips pypdf entirely — the API accepts the PDF as-is, the model sees layout, no brittle text extraction |
| Confidence | Validation-based, computed from rules | Self-reported confidence from the model is noisy; I trust deterministic post-extraction validation more |
| Amounts | Decimal, never float | Because this is money and 0.1 + 0.2 != 0.3 is unacceptable in a ledger |
| Unit numbers | list[int] | Supports property-wide charges (empty list), single-unit, and multi-unit splits |
| Period dates | Proper period_start / period_end dates | Not free text, not just "April 2026" |
| Bank statements | Line-item extraction | Per-transaction, so the ledger can reconcile against individual rows |
| Sidecar writes | Atomic: write to .json.tmp, then rename | Crash-safe; never leaves a half-written sidecar behind |
The big unlock is Anthropic's native PDF support. I was ready to plumb pypdf (or worse, Tesseract) through this thing, and instead the pipeline just sends the PDF bytes as a document content block and asks for a tool-use response matching a Pydantic schema. Zero OCR text-extraction code to maintain.
The two-tier model ladder
Every document goes to Haiku first. Haiku is fast and cheap, and for clean invoices with well-formatted line items it nails the fields on the first pass. If validation fails — missing required fields, amount doesn't reconcile, unit number outside known range — the document is retried on Sonnet. Sonnet eats the ambiguous ones (handwritten receipts, scanned utility bills where the meter read landed on a crease) and gets them right.
This ladder matters because the cost delta between Haiku and Sonnet is real, and on a 941-doc batch the savings compound fast. The validation layer is what makes the ladder safe — nothing slips through on Haiku without being audited against the rules.
New files
packages/py/src/propco/ingest/ocr.py— core pipeline: detect category, call Anthropic with the category-specific tool schema, validate, write sidecar.packages/py/src/propco/ingest/prompts.py— 8 per-category prompt templates (invoices, bank statements, compliance notices, contracts, …) plus matching tool-use schemas.packages/py/src/propco/ingest/validation.py— post-extraction validation rules and confidence computation. Everything deterministic.packages/py/src/propco/models/document.py— Pydantic models:FieldValue,LineItem,DocumentFields,DocumentSidecar,OcrRunStats.CHANGELOG.md— backfilled with 0.1.0 and now in Keep a Changelog format.properties/bamboo-house/ocr.yaml— per-property OCR config (which categories apply, which model ladder, batch size).
CLI
propco ocr <slug> [--dry-run | --execute] [--batch-size N] [--category <name>]Dry run prints what it would process; --execute actually makes API calls. --category lets me pilot on just invoices first before turning the whole pipeline loose on every category.
Verification
- 93 tests passing (25 pipeline + 37 validation + 31 existing reorg).
ruff checkclean.- A deliberate pilot is up next:
propco ocr bamboo-house --execute --batch-size 20 --category invoices, then eyeball the first 20 sidecars before scaling.
Once the sidecars land for the whole property, the next phase is wiring them into the Xero sync so invoices extracted here automatically become bills there. That's PR #4.