Running Large Language Models on Private Infrastructure

After working a lot with GenAI the last couple of years, I see that some organisations must keep data, LLM inference inside their own data centres for compliance, latency or cost reasons. Therefore I wanted to show how to build a PoC/MVP running locally but also look more into the enterprise ecosystem in terms of what alternatives there are.

First of, when running any LLM on private infrastructure you need to understand the use-case, users and workload you want to run. While there is not directly a good formula in terms of what kind of hardware you need to run LLM on your own infrastructure, there is many online calculators available (such as this one) Can You Run This LLM? VRAM Calculator (Nvidia GPU and Apple Silicon) The issue is to also understand “how much power does a user actually require to generate output? or if the use case is to build a RAG application?

But as a general rule of thumb when you are planning for hardware is the formula below (in combination of the calculator online 😉

Required_GB of vGPU Memory ≈ (Model_params_in_Billions × 2) × 1.20
The 2× factor keeps activations resident; +20 % covers framework overhead and small batches.

As an example, if we have a model like LLaMa 3.2 11 Billion parameters, it would require at least (11 (Billion parameters) x 2 = 22 + 20% overhead) = 26.4 GB GPU Memory. Therefore it is recommended to use at least a GPU card like the NVIDIA A100 which has 80 GB of GPU Memory. This allows it to have enough resources to both store the model but also handle some inference processing. However this formula does not take into effect how many tokens/sec this hardware will generate, luckily NVIDIA has some good benchmark content that shows the performance of a model on their GPUs also with tokens context size (Performance — NVIDIA NIM LLMs Benchmarking)

From a hardware perspective, the interesting part is also that AMD is now currently pretty well positioned with their new GPU cards in terms of providing much higher memory on their cards compared to the ones from NVIDIA. AMDs framework ROCm which has now released in version 7 also brags about a lot of performance gains as seen in the screenshot below from last weeks conference.

Also AMDs cards are also known for having higher memory capacity compared to NVIDIAs, allowing us to have larger models on a single GPU.

GPU	GPU Memory	GPU Memory bandwidth
NVIDIA H100	80 GB	3.35 TB/s
NVIDIA H200	144 GB	4.80 TB/s
AMD MI300X	192 GB	5.30 TB/s
AMD MI325X	256 GB	6 TB/s
MI400 (2026)	432 GB	19.6TB

While this is also dependant on the Inference Quantization as well and also if you plan to do fine-tuning as well that adds a lot of additional overhead on the GPU.

Light-weight, sub-10 B parameter models can now execute on-device. Throughout the last years a wave of runtimes has arrived—Ollama, LM Studio, Windows AI Foundry —each exposing an OpenAI-compatible REST/GRPC API. That means you can embed a local LLM in your app, then layer higher-level services such as retrieval-augmented generation (RAG) or agent frameworks on top without changing client code.

Even now yesterday, Microsoft announced Mu which is a new LLM that is embedded into Windows Introducing Mu language model and how it enabled the agent in Windows Settings  | Windows Experience Blog which is powered by the Copilot runtime on Windows that I’ve written about here –> How does Windows Recall work? – msandbu.org

Microsoft also introduced Windows AI foundry that provides a AI framework to manage and deploy models locally similiar to Ollama (What is Windows AI Foundry? | Microsoft Learn)

Ollama and LM Studio package the entire runtime, model registry, and REST gateway into a single install. After installation, Ollama starts as a background daemon on (Linux/macOS/Windows) that exposes an OpenAI-compatible endpoint on localhost. One command pulls the optimised weights and registers the service; a second command drops you into an interactive shell:

ollama pull llama3.2
ollama run llama3.2

Ollama also spins up an OpenAI-compatible REST API (default http://127.0.0.1:11434) that third-party tooling can consume. Point AnythingLLM or Continue (VS Code) at that endpoint and they will treat the local model exactly as they would ChatGPT—Continue even works as a drop-in GitHub Copilot replacement. The screenshot below illustrates AnythingLLM invoking Ollama for inference.

AnythingLLM ships with a built-in embedding pipeline: drop in PDFs, Word docs, or images and it vectorises the content so you can chat with your own data—all on the local machine. In effect, you get a self-contained ChatGPT clone without data ever leaving the device. However, runtimes like Ollama are optimised for a single user. If you need a centrally hosted assistant that serves multiple concurrent users, you’ll have to migrate to an inference layer designed for scale—e.g. vLLM, NVIDIA Triton, or a turnkey private-AI platform that handles batching, load balancing, and quota enforcement.

When a single-user runtime is no longer enough—think company-wide chatbots, developer copilots, or RAG services—you need an architecture and platform built for larger scale. That requires you to have the ability to scale across multiple nodes and GPUs.

Strategy	What it does	Tooling
Model parallelism	Shards one large model across several GPUs	TensorRT-LLM, Megatron-LM
Data parallelism	Replicates the model; each replica handles different requests	vLLM,
Smart batching	Merges requests that arrive within a few µs to boost throughput	vLLM “continuous batching”

While there are many inferencing runtime engines such as LLM, as well as NVIDIA’s TensorRT-LLM, and the future open-source inferencing runtime called llm-d (llm-d: Kubernetes-native distributed inferencing | Red Hat Developer) the interesting part is how there are already many “Enterprise” solutions in the market available that provide GenAI and AI services on Private Infrastructure.

•HPE Private Cloud AI
•OpenShift Enterprise AI
•VMware Private Cloud AI
•Nutanix Enterprise AI
•Open-source with vLLM
•NVIDIA AI Enterprise (RUN:AI)

While some are more integrated then others such as VMware that combines all with virtualisation, while Openshift is more from a Cloud native perspective. Most of them use vLLM underneath and if you also want to provide additional isolation between workloads running on GenAI to support multi-tenancy you can also use for instance NVIDIA MIG Isolation. NVIDIA MIG also allows you to run dedicated workloads in a isolated partition on the GPU (as seen below) the problem is that with most cases you limit the amount of GPU memory so that it can no longer run the LLM.

Most of the services listed above provide a set of management capabilities that can run inferencing, fine-tuning but also MLOps services. Most of these also use vLLM (in combination with other inferencing runtimes).

Share this:

Leave a Reply Cancel reply