All posts
IndustryJune 11, 2026·9 min read·Mike Sadofyev

Private LLM Deployment: What a Self Hosted LLM Actually Takes in a Regulated Enterprise

Most of the AI conversations I have with regulated enterprises start the same way. Someone on the call says some version of: "We love what this can do. Now tell me it never leaves our network." That is the moment a generic AI project turns into a private LLM project.

A private LLM is exactly what it sounds like. A large language model that runs inside infrastructure you control, on data that never gets shipped to a third party's API. For a mining company sitting on decades of geological reports, or an insurer holding millions of claim files, that is not a nice-to-have. It is the only version of the project that can exist.

I have spent the last few years doing this kind of work with a team that has shipped 100+ enterprise AI projects, 15+ senior specialists with backgrounds across PwC, Accenture, EY, KPMG, UniCredit, AngloAmerican, AkerBP, and XBRL International. The industries that keep showing up are the regulated, document-heavy ones: mining, energy, finance, insurance, manufacturing. They all eventually arrive at the same question. So this is the honest version of what a self hosted LLM deployment actually involves, and what we have learned shipping AI where the data cannot leave the building.

Why cloud APIs are a non-starter in regulated industries

The easy path is a cloud API. You send text out, you get text back, you pay per token, you ship in a week. For a startup that is the right call. For a bank's credit team or an insurer's claims department, it is a dead end, and not because anyone is being paranoid.

A few concrete reasons it stalls:

  • Data residency. The data is contractually or legally required to stay in a specific jurisdiction, sometimes on specific hardware. A US-region API endpoint fails the requirement before anyone reads the model card.
  • Confidentiality. Claim files, contracts, and engineering reports contain third-party personal and commercial data the company is not allowed to disclose to a vendor, full stop.
  • Auditability. Regulated teams have to explain what happened to a given document and why. "We sent it to a provider's API and they processed it somewhere" is not an answer an auditor accepts.
  • Vendor risk. A model provider's terms change, a region goes offline, a price moves. When the workflow is load-bearing, that dependency is itself a risk the security team will flag.

None of this is hypothetical. In every regulated engagement I have been in, the security review arrives early and it is decisive. If the architecture cannot answer "where does the data physically sit and who can touch it", the project does not move forward. That constraint is the starting point, not a footnote.

What "private LLM" actually means

"Private LLM" gets used as one word, but it is really a spectrum. Where a team lands on it depends on how strict the data rules are and how much they want to operate themselves. It helps to name the points on the line.

  • Local LLM on a workstation. A model running on a single analyst's machine or a small on-site box. Genuinely useful for prototyping, sensitive ad-hoc analysis, or a single power user. A local LLM keeps the data on one device, but it does not scale to a department and it has no governance around it. Good for an experiment, not for production.
  • Self hosted LLM in your VPC. The common production answer. The model runs in the company's own cloud account or private data center, inside the network boundary, behind existing access controls. Data stays within the perimeter the security team already governs. Most regulated deployments I have seen settle here, because it balances control with operability.
  • Fully air-gapped on premise LLM. No internet path at all. The model, the serving stack, and the data all live on hardware with no outbound connection. This is what you build for the most sensitive environments, where on premise AI is a hard requirement and even an outbound metrics call is unacceptable. It is the most work to stand up and maintain, and for some teams it is the only acceptable shape.

The practical job at the start of an engagement is figuring out which of these a workflow actually needs. Plenty of teams assume they need full air-gap and discover a self hosted LLM in their own VPC clears every requirement at a fraction of the operational cost. Some genuinely need the air-gap. Naming the real constraint early saves months.

The model-agnostic selection question

Once you are running the model yourself, a second question shows up: which model. My answer is that you should not pick one on principle. We are model-agnostic, and that is deliberate. Open-weight or commercial, the choice gets made per workflow against three constraints: accuracy on the actual documents, cost at the volume you expect, and the data-residency rules.

For a private LLM deployment, open-weight models have an obvious appeal because you can run them anywhere, including fully on premise, with no per-token meter and no data leaving. They have closed a lot of the quality gap. But "open-weight" is not automatically "better for you". For a hard extraction task, a commercial model run in a compliant private region sometimes wins on accuracy by enough to justify it, and the residency box still gets ticked. The right move is to test candidates on the team's own documents and let the numbers decide, instead of arguing brands in a slide.

This is also where the unglamorous engineering lives. A model is the smallest part of a working system. The document processing around it - parsing messy PDFs, scans, tables, mixed-language files - is usually what makes or breaks accuracy. We open-sourced our core for that, docfold, and a scraping engine, scrapefold, precisely because that layer is reusable across whichever model you end up serving.

What the deployment actually looks like

A private LLM on its own does nothing useful. The value is in the system around it, and in regulated work most of that system is the governance layer. The model is maybe a fifth of the build.

The shape of a typical deployment:

LayerWhat it does
IngestionPulls documents from existing systems, normalizes formats, handles scans and tables
Model servingRuns the chosen model inside the VPC or on-prem, behind the network boundary
Extraction and validationTurns documents into structured, checked data with confidence scores
GovernanceAudit trails, data lineage, access control, human-in-the-loop
IntegrationWrites results back into the systems people already use

The governance toolkit is not optional decoration. Audit trails and data lineage let the team prove what happened to any document. Access control keeps the model and its outputs inside the right hands. Human-in-the-loop is the part teams underestimate: you do not automate a regulated decision and walk away. The system handles the confident cases and routes the low-confidence ones to a person. That design is what makes the whole thing deployable in an environment that answers to auditors, and it is how we keep deployments SOC2-aligned.

Here is the proof point I keep coming back to. We built a production deployment for a top-5 insurer handling claims across 9 document types. Before, a case took 2-3 days to process by hand, with roughly a 15% error rate. Now 98% of it is automated, each claim runs in seconds, and 40+ FTEs were redeployed to higher-value work. The human-in-the-loop piece is doing exactly its job: people review only the low-confidence items, not the whole queue. That is the pattern for AI for insurance and, more broadly, for any document-heavy regulated workflow. None of it would have been allowed to ship without the private, governed architecture underneath. The accuracy number is the headline, but the architecture is the reason it exists.

What it costs in effort, honestly

I will not put money figures on this, because they vary too much to be useful. But I can be honest about effort, measured in weeks and roles, because that is what teams actually need to plan around.

A self hosted LLM deployment is heavier than a cloud API call in three ways:

  • Infrastructure. Someone has to stand up and run the serving stack. In a VPC that is a few weeks of an ML engineer's time. Fully air-gapped, more, because you also own the update path and the hardware.
  • The document layer. Parsing real enterprise documents reliably is usually the longest pole. This is engineering work on the ingestion and extraction layers, not a model you download.
  • Governance. Audit, lineage, access control, and the human-in-the-loop workflow are real build, and they involve the security and compliance people, not just engineers.

The trade you are making is clear. More upfront effort, in exchange for a system that can actually run on your data inside your walls. For the workflows that justify a private LLM at all, that trade is the entire point. The companies that try to shortcut it with a cloud API usually come back six months later to do it properly.

How to start small

The mistake I see most is trying to boil the ocean: a twelve-month "AI platform" program before anyone has proven a single workflow works on the real documents. We do the opposite, and I would recommend it even to teams who never work with us.

The path we use:

  1. AI Readiness Scan (1-2 weeks). Look at the actual workflows, data, and constraints. Decide where a private LLM genuinely earns its keep, and where it does not. Name the real data-residency requirement instead of assuming the strictest one.
  2. Proof-of-Value Sprint (2-4 weeks). Take one painful, high-volume workflow and build a working version on real documents, with the governance scaffolding in place. Small scope, real data, measurable result.
  3. AI Deployment Program. Only after the proof of value clears does the full self hosted or on premise deployment make sense, scoped to what the proof of value actually showed.

The point of starting small is that you learn what the real constraints are with weeks of effort instead of quarters. Most of the surprises in private LLM deployment - a document format nobody mentioned, a residency rule that rewrites the architecture, an accuracy gap on the messy 10% of files - show up in the first sprint. Far cheaper to find them there.

Where to start

If you are in a regulated, document-heavy business and you have a workflow that cannot use a cloud API, that is exactly the kind of problem a private LLM is built for. The honest answer to "can we do this on-prem" is almost always yes, and the honest follow-up is that the model is the easy part.

Bring your hardest workflow. The one with nine document types and a two-day turnaround and an auditor who wants to know what happened to every file. Let us look at it together and tell you what a private deployment would actually take, in weeks and roles, no slideware.

Book a discovery call and we will start with the scan.

Running this at team scale and want a second opinion on your setup?

We do AI toolchain architecture for enterprise teams - from Claude Code workflows to production-grade agent infrastructure. Book a 15-min call and we will share what works.

Book a call