A financial-services firm in the UAE used to process Arabic bank statements by hand. Roughly 200 documents a month, read line by line by people. Six months after we put a document data extraction pipeline in front of that work, they were processing 4,000 documents a month. Same team. Ten times the throughput. Still running in production today.
This is the anatomy of that 10x. Not a demo, not a benchmark on clean PDFs. Messy, multi-format, Arabic-language financial documents, in a regulated business where a wrong number is a real problem.
What follows is what we actually built, what made the difference, and what did not. The interesting part is not the bank statements. It is that the same pattern handles invoices, onboarding forms, and a dozen other document types once you get the shape right.
The starting point: 200 a month, all by hand
When we walked in, the process was simple and slow. A document arrived. A person opened it, read it, typed the fields they needed into an internal system, and moved to the next one. Bank statement extraction was a fully manual job.
That works fine at low volume. It does not scale. Every new account, every new banking partner, every busy month meant more hours of the same careful, boring, error-prone reading. Hiring more people to read more pages is not a strategy, it is a ceiling.
The firm did not have a data problem in the abstract sense. They had a very specific bottleneck: a human eye was the only thing standing between a stack of PDFs and a structured record they could act on.
Why bank statement extraction is harder than it looks
People assume a bank statement is a solved format. It is not. There is no standard.
- Every bank lays out its statements differently. Columns move, headers change, date formats vary, running balances appear or do not.
- Scan quality is all over the place. Some documents are crisp exports, others are photos of printouts, skewed and compressed.
- Tables break across pages, with totals that have to reconcile against line items that started two pages earlier.
- Arabic adds a real layer of difficulty: right-to-left text, a different script, and mixed RTL and Latin content in the same document when amounts and codes are involved.
Plain invoice OCR gets you characters off the page. It does not get you a correct, structured statement you can trust in a financial workflow. The gap between "we read the pixels" and "we have a reconciled record" is the entire job. That gap is where naive tools fall over and where intelligent document processing earns its name.
What we built: extract, validate, human-in-the-loop
The pipeline has four stages, and the order matters.
- Extract. Pull the raw fields and tables off each document. This is the part everyone thinks is the whole job. It is maybe a quarter of it.
- Validate. Check what came out against rules. Do the line items sum to the stated balance. Are the dates in range. Does the document type match what was expected. Validation is where confidence gets assigned to every field.
- Human-in-the-loop on low-confidence items. High-confidence extractions flow straight through. Anything the pipeline is unsure about gets routed to a person, with the document and the flagged field side by side. People stop reading 4,000 documents and start reviewing the handful that are genuinely ambiguous.
- Audit trail. Every field, every decision, every human override is logged. In a regulated business you do not get to say "the AI did it." You have to show what was extracted, what was checked, and who touched it.
The human-in-the-loop stage is the part that makes the whole thing safe to run at volume. We did not remove people. We moved them from reading everything to judging the few cases worth their judgment.
From invoice data extraction to KYC automation: same pattern, different documents
Here is the part that travels. Once the extract-validate-review-audit pattern works, the document type becomes a configuration detail, not a rewrite.
We run the same shape in production across very different documents:
| Document type | What changes | What stays the same |
|---|---|---|
| Bank statements | Layout per bank, reconciliation rules | Extract, validate, review, audit |
| Invoices | Vendor formats, tax fields, totals | Extract, validate, review, audit |
| Acceptance documents | Approval fields, signatures | Extract, validate, review, audit |
| Construction estimates and specifications | Line-item structure, units | Extract, validate, review, audit |
| KYC and onboarding applications | Identity fields, company vs individual | Extract, validate, review, audit |
Invoice data extraction is the same problem wearing different clothes: different vendors, different tax fields, different totals to reconcile, but the same extract-then-validate-then-review loop. KYC automation is the same again: pull the identity and company fields off an onboarding application, validate them against your rules, and escalate anything that does not look right to a person before an account opens.
This is why we treat document type as a starting point and not a product boundary. In insurance we ran the same idea across nine document types in a claims flow and got to 98% automation. See our insurance work for that one. The lesson is consistent: the pattern is portable, the tuning is per-document.
What made it 10x, and what did not
The 10x did not come from a smarter model. Models matter, but they are not the lever people think they are.
What actually moved the number:
- Validation rules tuned to the documents. The reconciliation and range checks caught the errors that would otherwise have forced a human to re-read everything. Cheap to add, huge in effect.
- Confidence routing. Sending only the uncertain items to people is the difference between reviewing 4,000 documents and reviewing a small fraction of them.
- Handling the messy cases on purpose. Skewed scans, broken tables, mixed-script Arabic. The volume lives in the messy documents, so the messy documents are where the work is.
What did not matter as much as expected:
- Chasing a marginally better OCR engine. Better pixel reading on already-readable documents is a rounding error next to good validation.
- A bigger, more expensive model. We are model-agnostic for a reason. The architecture around the model carries the result, and the firm wanted it running on-premise anyway.
The honest summary: the 10x is mostly engineering discipline around the model, not the model itself.
How to test this on your own documents
If you have a pile of documents and a team reading them by hand, you do not need a year-long program to find out whether this works for you. You need a small, real test on your actual documents.
The open-source core of our extraction work lives in docfold. It is the document-processing foundation we build on, and it is public, so you can look before you talk to anyone.
The way we run an engagement is deliberately short at the front:
- AI Readiness Scan (1 to 2 weeks). We look at your documents and your process and tell you, honestly, whether this is a fit.
- Proof-of-Value Sprint (2 to 4 weeks). We build a working pipeline on a slice of your real documents so you see the actual numbers, not a slide.
- Deployment. If the proof holds up, we put it into production, on-premise if you need it.
Our team has run more than 100 enterprise AI projects, senior people only, in document-heavy regulated industries. We have seen most of the ways this goes wrong, which is mostly how you avoid them.
If you have the document pile and want to know what a real pipeline does with it, book a call and bring a few of your messiest examples.