For most teams, "agent in the cloud" is the default. For banks, defence contractors, regulated healthcare and a long tail of regulated industries, it is not an option. Here is the working on-prem agent stack in 2026 and the honest trade-offs against SaaS.
Why on-prem at all
Three drivers, in declining order of frequency:
- Data residency / sovereignty — the data legally cannot leave a jurisdiction or a network.
- Latency — for sub-100ms agent responses inside a closed network.
- Risk posture — the security team will not approve any external dependency.
If none of those apply, do not go on-prem. The total cost is 2–4x SaaS for the same capability.
The on-prem stack
Five layers:
- Model serving — vLLM, TGI, or a vendor-provided enterprise stack. Mistral, Llama, and select Anthropic models available for on-prem licensing in 2026.
- MCP infrastructure — internal self-hosted registry, a gateway, the servers themselves.
- Memory layer — Postgres + pgvector or a self-hosted vector DB. See persistent memory architecture.
- Observability — Langfuse or self-hosted Phoenix. See observability platforms.
- Identity — your existing IdP, with agent SSO patterns.
Every layer has a SaaS analogue; pick the on-prem one only where required.
Hardware sizing
For a 50-developer engineering org running a coding agent on-prem:
| Component | Spec | Notes |
|---|---|---|
| Inference servers | 4–8x H100 or equivalent | Sized for Sonnet-class models |
| Memory DB | 32 GB RAM, 1 TB SSD | Postgres + pgvector |
| Observability | 16 GB RAM, 500 GB SSD | Langfuse-style stack |
| Gateway/auth | 8 GB RAM | Stateless, scales horizontally |
Roughly $250k–$500k upfront capex, 25% annually for ops. SaaS equivalent: $50k–$120k/yr opex.
Model choice
Three viable paths in 2026:
- Open-weight models (Llama, Mistral, DeepSeek) — full control, weaker on agentic tasks than frontier models.
- Vendor on-prem (Anthropic Enterprise, Google Distributed Cloud) — better quality, strict licensing.
- Hybrid — on-prem for sensitive prompts, SaaS for public-data prompts.
Hybrid covers most regulated orgs without paying the full on-prem tax everywhere.
Networking
The agent network must be:
- Air-gappable if your sector requires it. See air-gapped deployment.
- Egress-controlled so internal MCP servers cannot reach the public internet by default.
- mTLS everywhere between agent host, gateway, and MCP servers.
This is the cheapest part to get right early and the most expensive to retrofit.
Observability on-prem
Three principles:
- Telemetry never leaves the network. Self-host the collector, the store, and the UI.
- Audit logs go to immutable storage with object-lock or WORM.
- SIEM integration uses your existing stack — do not build a parallel one.
Where it makes sense to compromise
Even on-prem deployments usually allow:
- Public docs retrieval through a cached, sanitised gateway.
- Code intelligence that reads public registry metadata.
- Model upgrade pulls via a controlled mirror.
100% air-gap is possible (see the dedicated guide) and 3x more expensive again.
Operational reality
On-prem agents need:
- An ops rotation for the inference cluster.
- A relationship with the model vendor for security updates.
- A procedure for refreshing model weights without breaking eval.
- A patching cadence that does not lag the SaaS equivalent by more than 30 days.
If you cannot staff this, on-prem is not the right answer.
Common mistakes
- Underestimating ops cost — the inference cluster needs SREs, not just ML engineers.
- No model upgrade path — going on-prem and never updating is worse than SaaS.
- Missing observability — running blind on-prem because the SaaS option was excluded.
- Air-gapping by accident — losing the model upgrade path through over-strict egress rules.
Where this is heading
Three trends to watch: turnkey on-prem agent platforms (Anthropic, NVIDIA, IBM all shipping in 2026), edge inference for sub-region residency, and managed-on-prem (vendor runs the stack inside your network). The last one becomes the dominant model by 2028.