LLM Inference on Dedicated Bare Metal

Deploy and serve large language models on hardware you don't share with anyone.
Full VRAM, full bandwidth, consistent latency, every request.

Run your LLM pipeline on infrastructure built for it

Full GPU, Full VRAM

96 GB GDDR7 ECC per GPU, 768 GB total. No virtualisation layer, no shared tenants, no memory limits imposed by infrastructure.

FP4 & FP8 Precision

Native Blackwell precision support for modern LLM inference frameworks, vLLM, TGI, TensorRT-LLM, Ollama.

Predictable Cost

No egress fees, no hidden infrastructure costs. From $1.34/GPU/hr on a 12-month term.

Why bare metal matters for LLM inference

Shared GPU infrastructure introduces latency variance you cannot predict or control. When multiple tenants compete for the same memory bandwidth, inference throughput degrades, especially under load.

On 1Legion bare metal, you get the full GPU. No competing jobs. The same throughput at request 1 and at request 10,000.

The 8x RTX Pro 6000 Blackwell Max-Q server runs models up to 70B parameters in full precision, and larger models with FP4 quantisation, on dedicated hardware, with direct engineering support.

Apply for a Bare Metal Pilot →

Ready to deploy your LLM on bare metal?

Tell us about your model and workload.
We will match you with the right configuration.

Contact our Engineers