Teaching AI to Read Cursive: The MDL Transcription Pipeline
1. Project Overview
The Marshall District Library (MDL) archive in Marshall, MI at archive.yourmdl.org holds thousands of scanned historical documents — school board ledgers, financial records, handwritten letters dating back to the 1800s. The content was digitized but not readable by machines. Cursive handwriting from that era defeats traditional OCR entirely.
I built two custom Omeka S modules — TranscriptionManager and TypesenseSync — that together form a pipeline: scan a document, transcribe it with Claude Vision via AWS Lambda, run it through human review, publish it, and index it for sub-50ms full-text search. Researchers can now search for a name that only appears in a 150-year-old handwritten ledger and find it.
Live site: archive.yourmdl.org
2. Challenge / Problem
The archive had two problems:
Locked content. Thousands of pages of handwritten cursive sat behind scanned PDFs. The information existed — names, dates, financial records, meeting minutes — but none of it was searchable. A researcher could only find a document if a librarian had manually tagged it with the right metadata. The actual handwritten text was invisible to search.
Inadequate search. Even for metadata that was tagged, Omeka's built-in search was slow and lacked fuzzy matching, relevance ranking, or faceted filtering. Searching for "Walter Martin" wouldn't find "W. Martin" or handle a typo.
Traditional OCR tools like Tesseract can handle printed text, but 1800s cursive — with inconsistent letterforms, faded ink, and writer-specific styles — breaks them completely. The gap between what was in the archive and what was discoverable was enormous.
3. Design Decisions
Why decouple AI processing via Lambda instead of running it in PHP?
Omeka S is a PHP application with no built-in support for background workers or long-running processes. A single transcription job can take minutes across dozens of pages. Running that synchronously in PHP would block the web server and time out.
I used a chain-dispatch pattern: the module sends the first page to an AWS Lambda function with a short cURL timeout, then forgets about it. When Lambda finishes a page, it sends results back via webhook and the module fires the next page. Each page triggers the next one sequentially. The Omeka server never blocks, the archivist can close the browser, and the job keeps running.
This also keeps costs predictable — sequential processing avoids rate limits and makes per-page costs easy to track.
Why human-in-the-loop review?
Claude Vision is surprisingly good at 1800s cursive, but it's not perfect. It occasionally misreads faded letters, hallucinates text on damaged pages, or misaligns columns in dense ledger entries. Publishing raw AI output to a research archive would undermine trust.
Every transcription goes through a side-by-side review interface: original scan on the left, AI-generated text on the right in an editable field. The archivist reads along, corrects mistakes, and publishes only what they've verified. This isn't a limitation — it's the design. Archivists need the final say.
Why Typesense over Omeka's built-in search?
- Speed: Sub-50ms queries across the full archive vs. multi-second responses from MySQL full-text search
- Typo tolerance: Fuzzy matching handles misspellings automatically
- Faceted filtering: Filter by collection, document type, date range, tags
- Relevance ranking: Configurable field weights — titles rank highest, then extracted names, then metadata, then transcription text
- Highlighting: Search terms highlighted in results
I considered Elasticsearch but chose Typesense for lower resource requirements, simpler setup, and strong TypeScript support on the frontend.
4. Architecture Overview
Core pattern: Omeka S (headless CMS) → AWS Lambda (AI processing) → Human Review → Typesense (search index) → Next.js (frontend)
The Pipeline
Scan document → Upload to Omeka S → Select pages for transcription
→ Dispatch to Lambda (Claude Vision) → Webhook returns results
→ Archivist reviews side-by-side → Publish transcription
→ TypesenseSync indexes content → Searchable on archive.yourmdl.org
TranscriptionManager Module (PHP)
- Adds transcription workflow UI to Omeka S admin
- Page selection interface with thumbnails
- Dispatches pages to API Gateway → Lambda via chain-dispatch pattern
- HMAC-SHA256 signed requests in both directions
- Webhook handler stores results and triggers next page
- Side-by-side review interface (scan image + editable transcription)
- Draft → Review → Publish lifecycle with unpublish support
- Name annotation workflow: sends transcriptions back to Claude for bounding box coordinates, archivist reviews with SVG overlay tool
- Published transcriptions become first-class Omeka media resources
TypesenseSync Module (PHP)
- Hooks into Omeka S events (item created, updated, deleted)
- Keeps Typesense search index in sync with Omeka data including transcription content
- Weighted field ranking: title → extracted names → metadata → transcription text
- Admin UI for manual sync trigger
- Logs sync status and errors to Omeka admin dashboard
Lambda Function (Python)
- Receives page image via API Gateway
- Renders PDF page as PNG at 150 DPI using PyMuPDF (auto-downscales if >5MB)
- Base64-encodes and sends to Claude Vision API with cursive-tuned prompt
- Returns structured JSON: transcribed text, extracted person names, readability notes
- HMAC-SHA256 validation on incoming requests
- Sends results back to Omeka via signed webhook
Security
Every request between Omeka and Lambda is signed with HMAC-SHA256 using a shared secret. The module signs outgoing requests; Lambda validates them. Webhooks returning results use the same pattern in reverse. No unsigned requests get through.
5. Implementation Highlights
Chain-Dispatch Pattern
The key architectural decision. PHP can't run background jobs natively, and Omeka doesn't have a queue system. Instead of bolting on a job runner, I made the webhook handler do double duty: store the result for the completed page, then immediately dispatch the next page. The chain runs itself.
This means a 40-page document processes as 40 sequential Lambda invocations, each triggered by the previous one's webhook. The Omeka server handles only short HTTP requests — never a long-running process.
Name Annotation System
Beyond transcription, the module has a "Find Names" workflow. It sends completed transcriptions back to Claude, asking for bounding box coordinates of person names on the page image. The Lambda returns annotations with normalized coordinates (percentages of page dimensions), confidence scores, and parsed name components.
Archivists review annotations in a two-panel interface — page image with SVG overlay boxes on the left, approval list on the right. They can approve, reject, adjust, or manually draw annotations Claude missed. Approved annotations publish through a public API that the Next.js frontend consumes as clickable overlays on document pages.
Typesense Weighted Search
The search index uses field weights to rank results sensibly. A search for "Walter Martin" returns:
- Documents with that name in the title (highest weight)
- Documents where the name was extracted by the annotation system
- Documents with the name in metadata fields
- Documents where the name appears somewhere in the transcription text (lowest weight)
This means a researcher finds the most relevant document first, not just the one with the most mentions.
6. Technical Stack Summary
Omeka S Modules (PHP):
- TranscriptionManager — transcription workflow, review UI, annotation system
- TypesenseSync — search index synchronization, weighted ranking
AI Processing:
- AWS Lambda (Python)
- Claude Vision API (Anthropic)
- PyMuPDF for PDF rendering
Search:
- Typesense (self-hosted on EC2)
- Fuzzy matching, faceted filtering, field-weight ranking
Infrastructure:
- AWS API Gateway, Lambda, EC2
- HMAC-SHA256 request signing
- Webhook-based async communication
Frontend:
- Next.js (App Router) on Vercel
- Typesense client for search
- SVG annotation overlays
7. Conclusion
This pipeline turns locked historical documents into discoverable archive content. Researchers can now search for names, dates, and terms that only existed in 150-year-old cursive handwriting — content that was completely invisible to search before.
The architecture reflects a few principles I keep coming back to: decouple where the platform constrains you (Lambda for AI processing because PHP can't do background work), keep humans in the loop where accuracy matters more than speed (archivist review before publish), and choose tools that solve the actual problem (Typesense for search performance Omeka couldn't provide).
Both modules are in production at archive.yourmdl.org, processing real documents for real researchers. The collection is expanding — more ledgers, letters, and records are queued for transcription.
💬 Questions about this project? Return here and ask my AI assistant, or connect with me on 💼 LinkedIn and 🐙 GitHub.