Teaching AI to Read Cursive: The MDL Transcription Pipeline

1. Project Overview

The Marshall District Library (MDL) archive in Marshall, MI at archive.yourmdl.org holds thousands of scanned historical documents — school board ledgers, financial records, handwritten letters dating back to the 1800s. The content was digitized but not readable by machines. Cursive handwriting from that era defeats traditional OCR entirely.

I built two custom Omeka S modules — TranscriptionManager and TypesenseSync — that together form a pipeline: scan a document, transcribe it with Claude Vision via AWS Lambda, run it through human review, publish it, and index it for sub-50ms full-text search. Researchers can now search for a name that only appears in a 150-year-old handwritten ledger and find it.

Live site: archive.yourmdl.org

2. Challenge / Problem

The archive had two problems:

Locked content. Thousands of pages of handwritten cursive sat behind scanned PDFs. The information existed — names, dates, financial records, meeting minutes — but none of it was searchable. A researcher could only find a document if a librarian had manually tagged it with the right metadata. The actual handwritten text was invisible to search.

Inadequate search. Even for metadata that was tagged, Omeka's built-in search was slow and lacked fuzzy matching, relevance ranking, or faceted filtering. Searching for "Walter Martin" wouldn't find "W. Martin" or handle a typo.

Traditional OCR tools like Tesseract can handle printed text, but 1800s cursive — with inconsistent letterforms, faded ink, and writer-specific styles — breaks them completely. The gap between what was in the archive and what was discoverable was enormous.

3. Design Decisions

Why decouple AI processing via Lambda instead of running it in PHP?

Omeka S is a PHP application with no built-in support for background workers or long-running processes. A single transcription job can take minutes across dozens of pages. Running that synchronously in PHP would block the web server and time out.

I used a chain-dispatch pattern: the module sends the first page to an AWS Lambda function with a short cURL timeout, then forgets about it. When Lambda finishes a page, it sends results back via webhook and the module fires the next page. Each page triggers the next one sequentially. The Omeka server never blocks, the archivist can close the browser, and the job keeps running.

This also keeps costs predictable — sequential processing avoids rate limits and makes per-page costs easy to track.

Why human-in-the-loop review?

Claude Vision is surprisingly good at 1800s cursive, but it's not perfect. It occasionally misreads faded letters, hallucinates text on damaged pages, or misaligns columns in dense ledger entries. Publishing raw AI output to a research archive would undermine trust.

Every transcription goes through a side-by-side review interface: original scan on the left, AI-generated text on the right in an editable field. The archivist reads along, corrects mistakes, and publishes only what they've verified. This isn't a limitation — it's the design. Archivists need the final say.

Why Typesense over Omeka's built-in search?

Speed: Sub-50ms queries across the full archive vs. multi-second responses from MySQL full-text search
Typo tolerance: Fuzzy matching handles misspellings automatically
Faceted filtering: Filter by collection, document type, date range, tags
Relevance ranking: Configurable field weights — titles rank highest, then extracted names, then metadata, then transcription text
Highlighting: Search terms highlighted in results

I considered Elasticsearch but chose Typesense for lower resource requirements, simpler setup, and strong TypeScript support on the frontend.

4. Architecture Overview

Core pattern: Omeka S (headless CMS) → AWS Lambda (AI processing) → Human Review → Typesense (search index) → Next.js (frontend)

The Pipeline

Scan document → Upload to Omeka S → Select pages for transcription
    → Dispatch to Lambda (Claude Vision) → Webhook returns results
    → Archivist reviews side-by-side → Publish transcription
    → TypesenseSync indexes content → Searchable on archive.yourmdl.org

TranscriptionManager Module (PHP)

Adds transcription workflow UI to Omeka S admin
Page selection interface with thumbnails
Dispatches pages to API Gateway → Lambda via chain-dispatch pattern
HMAC-SHA256 signed requests in both directions
Webhook handler stores results and triggers next page
Side-by-side review interface (scan image + editable transcription)
Draft → Review → Publish lifecycle with unpublish support
Name annotation workflow: sends transcriptions back to Claude for bounding box coordinates, archivist reviews with SVG overlay tool
Published transcriptions become first-class Omeka media resources

TypesenseSync Module (PHP)

Hooks into Omeka S events (item created, updated, deleted)
Keeps Typesense search index in sync with Omeka data including transcription content
Weighted field ranking: title → extracted names → metadata → transcription text
Admin UI for manual sync trigger
Logs sync status and errors to Omeka admin dashboard

Lambda Function (Python)

Receives page image via API Gateway
Renders PDF page as PNG at 150 DPI using PyMuPDF (auto-downscales if >5MB)
Base64-encodes and sends to Claude Vision API with cursive-tuned prompt
Returns structured JSON: transcribed text, extracted person names, readability notes
HMAC-SHA256 validation on incoming requests
Sends results back to Omeka via signed webhook

Security

Every request between Omeka and Lambda is signed with HMAC-SHA256 using a shared secret. The module signs outgoing requests; Lambda validates them. Webhooks returning results use the same pattern in reverse. No unsigned requests get through.

5. Implementation Highlights

Chain-Dispatch Pattern

The key architectural decision. PHP can't run background jobs natively, and Omeka doesn't have a queue system. Instead of bolting on a job runner, I made the webhook handler do double duty: store the result for the completed page, then immediately dispatch the next page. The chain runs itself.

This means a 40-page document processes as 40 sequential Lambda invocations, each triggered by the previous one's webhook. The Omeka server handles only short HTTP requests — never a long-running process.

Name Annotation System

Beyond transcription, the module has a "Find Names" workflow. It sends completed transcriptions back to Claude, asking for bounding box coordinates of person names on the page image. The Lambda returns annotations with normalized coordinates (percentages of page dimensions), confidence scores, and parsed name components.

Archivists review annotations in a two-panel interface — page image with SVG overlay boxes on the left, approval list on the right. They can approve, reject, adjust, or manually draw annotations Claude missed. Approved annotations publish through a public API that the Next.js frontend consumes as clickable overlays on document pages.

Typesense Weighted Search

The search index uses field weights to rank results sensibly. A search for "Walter Martin" returns:

Documents with that name in the title (highest weight)
Documents where the name was extracted by the annotation system
Documents with the name in metadata fields
Documents where the name appears somewhere in the transcription text (lowest weight)

This means a researcher finds the most relevant document first, not just the one with the most mentions.

6. Technical Stack Summary

Omeka S Modules (PHP):

TranscriptionManager — transcription workflow, review UI, annotation system
TypesenseSync — search index synchronization, weighted ranking

AI Processing:

AWS Lambda (Python)
Claude Vision API (Anthropic)
PyMuPDF for PDF rendering

Search:

Typesense (self-hosted on EC2)
Fuzzy matching, faceted filtering, field-weight ranking

Infrastructure:

AWS API Gateway, Lambda, EC2
HMAC-SHA256 request signing
Webhook-based async communication

Frontend:

Next.js (App Router) on Vercel
Typesense client for search
SVG annotation overlays

7. Conclusion

This pipeline turns locked historical documents into discoverable archive content. Researchers can now search for names, dates, and terms that only existed in 150-year-old cursive handwriting — content that was completely invisible to search before.

The architecture reflects a few principles I keep coming back to: decouple where the platform constrains you (Lambda for AI processing because PHP can't do background work), keep humans in the loop where accuracy matters more than speed (archivist review before publish), and choose tools that solve the actual problem (Typesense for search performance Omeka couldn't provide).

Both modules are in production at archive.yourmdl.org, processing real documents for real researchers. The collection is expanding — more ledgers, letters, and records are queued for transcription.

💬 Questions about this project? Get in touch or book a meeting, or connect with me on 💼 LinkedIn and 🐙 GitHub.