Skip to main content
Guide3 min read

On-premise agent infrastructure: when the cloud is not an option

Some agents cannot leave your network. Banking, defence, healthcare — the on-prem stack for production agents in 2026: model hosting, MCP infrastructure, observability, and the trade-offs vs SaaS.

For most teams, "agent in the cloud" is the default. For banks, defence contractors, regulated healthcare and a long tail of regulated industries, it is not an option. Here is the working on-prem agent stack in 2026 and the honest trade-offs against SaaS.

Why on-prem at all

Three drivers, in declining order of frequency:

  • Data residency / sovereignty — the data legally cannot leave a jurisdiction or a network.
  • Latency — for sub-100ms agent responses inside a closed network.
  • Risk posture — the security team will not approve any external dependency.

If none of those apply, do not go on-prem. The total cost is 2–4x SaaS for the same capability.

The on-prem stack

Five layers:

  1. Model serving — vLLM, TGI, or a vendor-provided enterprise stack. Mistral, Llama, and select Anthropic models available for on-prem licensing in 2026.
  2. MCP infrastructure — internal self-hosted registry, a gateway, the servers themselves.
  3. Memory layer — Postgres + pgvector or a self-hosted vector DB. See persistent memory architecture.
  4. Observability — Langfuse or self-hosted Phoenix. See observability platforms.
  5. Identity — your existing IdP, with agent SSO patterns.

Every layer has a SaaS analogue; pick the on-prem one only where required.

Hardware sizing

For a 50-developer engineering org running a coding agent on-prem:

Component Spec Notes
Inference servers 4–8x H100 or equivalent Sized for Sonnet-class models
Memory DB 32 GB RAM, 1 TB SSD Postgres + pgvector
Observability 16 GB RAM, 500 GB SSD Langfuse-style stack
Gateway/auth 8 GB RAM Stateless, scales horizontally

Roughly $250k–$500k upfront capex, 25% annually for ops. SaaS equivalent: $50k–$120k/yr opex.

Model choice

Three viable paths in 2026:

  • Open-weight models (Llama, Mistral, DeepSeek) — full control, weaker on agentic tasks than frontier models.
  • Vendor on-prem (Anthropic Enterprise, Google Distributed Cloud) — better quality, strict licensing.
  • Hybrid — on-prem for sensitive prompts, SaaS for public-data prompts.

Hybrid covers most regulated orgs without paying the full on-prem tax everywhere.

Networking

The agent network must be:

  • Air-gappable if your sector requires it. See air-gapped deployment.
  • Egress-controlled so internal MCP servers cannot reach the public internet by default.
  • mTLS everywhere between agent host, gateway, and MCP servers.

This is the cheapest part to get right early and the most expensive to retrofit.

Observability on-prem

Three principles:

  1. Telemetry never leaves the network. Self-host the collector, the store, and the UI.
  2. Audit logs go to immutable storage with object-lock or WORM.
  3. SIEM integration uses your existing stack — do not build a parallel one.

Where it makes sense to compromise

Even on-prem deployments usually allow:

  • Public docs retrieval through a cached, sanitised gateway.
  • Code intelligence that reads public registry metadata.
  • Model upgrade pulls via a controlled mirror.

100% air-gap is possible (see the dedicated guide) and 3x more expensive again.

Operational reality

On-prem agents need:

  • An ops rotation for the inference cluster.
  • A relationship with the model vendor for security updates.
  • A procedure for refreshing model weights without breaking eval.
  • A patching cadence that does not lag the SaaS equivalent by more than 30 days.

If you cannot staff this, on-prem is not the right answer.

Common mistakes

  • Underestimating ops cost — the inference cluster needs SREs, not just ML engineers.
  • No model upgrade path — going on-prem and never updating is worse than SaaS.
  • Missing observability — running blind on-prem because the SaaS option was excluded.
  • Air-gapping by accident — losing the model upgrade path through over-strict egress rules.

Where this is heading

Three trends to watch: turnkey on-prem agent platforms (Anthropic, NVIDIA, IBM all shipping in 2026), edge inference for sub-region residency, and managed-on-prem (vendor runs the stack inside your network). The last one becomes the dominant model by 2028.

Loadout

Build your AI agent loadout

The directory of MCP servers and AI agents that actually work. Pick the right loadout for Slack, Postgres, GitHub, Figma and 20+ integrations — with install commands ready to paste into Claude Desktop, Cursor or your own stack.

© 2026 Loadout. Built on Angular 21 SSR.