On-premise agent infrastructure: when the cloud is not an option

For most teams, "agent in the cloud" is the default. For banks, defence contractors, regulated healthcare and a long tail of regulated industries, it is not an option. Here is the working on-prem agent stack in 2026 and the honest trade-offs against SaaS.

Why on-prem at all

Three drivers, in declining order of frequency:

Data residency / sovereignty — the data legally cannot leave a jurisdiction or a network.
Latency — for sub-100ms agent responses inside a closed network.
Risk posture — the security team will not approve any external dependency.

If none of those apply, do not go on-prem. The total cost is 2–4x SaaS for the same capability.

The on-prem stack

Five layers:

Model serving — vLLM, TGI, or a vendor-provided enterprise stack. Mistral, Llama, and select Anthropic models available for on-prem licensing in 2026.
MCP infrastructure — internal self-hosted registry, a gateway, the servers themselves.
Memory layer — Postgres + pgvector or a self-hosted vector DB. See persistent memory architecture.
Observability — Langfuse or self-hosted Phoenix. See observability platforms.
Identity — your existing IdP, with agent SSO patterns.

Every layer has a SaaS analogue; pick the on-prem one only where required.

Hardware sizing

For a 50-developer engineering org running a coding agent on-prem:

Component	Spec	Notes
Inference servers	4–8x H100 or equivalent	Sized for Sonnet-class models
Memory DB	32 GB RAM, 1 TB SSD	Postgres + pgvector
Observability	16 GB RAM, 500 GB SSD	Langfuse-style stack
Gateway/auth	8 GB RAM	Stateless, scales horizontally

Roughly $250k–$500k upfront capex, 25% annually for ops. SaaS equivalent: $50k–$120k/yr opex.

Model choice

Three viable paths in 2026:

Open-weight models (Llama, Mistral, DeepSeek) — full control, weaker on agentic tasks than frontier models.
Vendor on-prem (Anthropic Enterprise, Google Distributed Cloud) — better quality, strict licensing.
Hybrid — on-prem for sensitive prompts, SaaS for public-data prompts.

Hybrid covers most regulated orgs without paying the full on-prem tax everywhere.

Networking

The agent network must be:

Air-gappable if your sector requires it. See air-gapped deployment.
Egress-controlled so internal MCP servers cannot reach the public internet by default.
mTLS everywhere between agent host, gateway, and MCP servers.

This is the cheapest part to get right early and the most expensive to retrofit.

Observability on-prem

Three principles:

Telemetry never leaves the network. Self-host the collector, the store, and the UI.
Audit logs go to immutable storage with object-lock or WORM.
SIEM integration uses your existing stack — do not build a parallel one.

Where it makes sense to compromise

Even on-prem deployments usually allow:

Public docs retrieval through a cached, sanitised gateway.
Code intelligence that reads public registry metadata.
Model upgrade pulls via a controlled mirror.

100% air-gap is possible (see the dedicated guide) and 3x more expensive again.

Operational reality

On-prem agents need:

An ops rotation for the inference cluster.
A relationship with the model vendor for security updates.
A procedure for refreshing model weights without breaking eval.
A patching cadence that does not lag the SaaS equivalent by more than 30 days.

If you cannot staff this, on-prem is not the right answer.

Common mistakes

Underestimating ops cost — the inference cluster needs SREs, not just ML engineers.
No model upgrade path — going on-prem and never updating is worse than SaaS.
Missing observability — running blind on-prem because the SaaS option was excluded.
Air-gapping by accident — losing the model upgrade path through over-strict egress rules.

Where this is heading

Three trends to watch: turnkey on-prem agent platforms (Anthropic, NVIDIA, IBM all shipping in 2026), edge inference for sub-region residency, and managed-on-prem (vendor runs the stack inside your network). The last one becomes the dominant model by 2028.