r/LocalLLaMA Shares a University-Hospital Stack Serving 1B+ Tokens Per Day Locally

A high-signal self-post on r/LocalLLaMA offers one of the more concrete deployment write-ups the community has seen this week. The author says they lead a research lab at a university hospital and have spent the last few weeks refining an internal LLM server that now processes more than 1 billion tokens per day locally. The split is roughly two-thirds ingestion and one-third decode, and the post is written as an operations note for people trying to run similar stacks rather than as a simple benchmark boast.

The hardware is modest outside the GPUs: 2x H200, 124 GB of RAM, a 16-core CPU, and 512 GB of disk. After testing Qwen 3 models, GLM-Air, and GPT-OSS, the author settled on GPT-OSS-120B. The argument is practical: the model delivered roughly 220 to 250 tok/s for single-user decode, followed JSON instructions well, handled tool calling reliably enough for the team's workflows, and benefited from the fact that its deployed weights match the published evaluations. On this setup, the author found mxfp4 on Hopper clearly better optimized than other paths they had tried.

The architecture is equally pragmatic. LiteLLM sits in front as the OpenAI-compatible proxy handling keys, rate limits, routing, and the priority queue. Behind it are two vLLM instances, one per GPU, with PostgreSQL for usage tracking, Prometheus and Grafana for observability, and MkDocs for internal documentation. The author specifically chose one replica per GPU instead of tensor parallel across both cards because GPT-OSS-120B fits comfortably on a single H200 in mxfp4 and the independent replicas avoid NCCL overhead. With LiteLLM's simple-shuffle routing, the prompt-token split after about six days was reportedly 2.10B versus 2.11B, which is almost perfectly balanced.

The operational details are what made the post travel. The configuration uses mxfp4 quantization, a 128k context window, 0.80 GPU-memory utilization, chunked prefill, prefix caching, and 128 max sequences per instance, plus environment variables such as VLLM_USE_FLASHINFER_MXFP4_MOE=1 and NCCL_P2P_DISABLE=1. The author argues that decode throughput, not KV cache, is the real limit in this setup, and that leaving 20% VRAM headroom helps absorb logprobs-related memory spikes without materially hurting steady-state throughput.

The reported results give the post real weight: 6.57B total tokens over roughly six days, 2.76M requests, and a 1-hour average combined throughput of 24,225 tok/s. But the most useful section may be the unresolved problem. When LiteLLM cools down one overloaded vLLM replica, the traffic shifts to the other replica, which then overloads in turn and creates a ping-pong failure pattern. That is exactly the kind of production-edge detail that makes the post valuable. r/LocalLLaMA is not just reacting to a big number here; it is reacting to a deployment note that looks close to something a real team could copy, adapt, and learn from.

r/LocalLLaMA Shares a University-Hospital Stack Serving 1B+ Tokens Per Day Locally

Related Articles

GLM5.2 at home turns local LLM enthusiasm into a hardware bill

LocalLLaMA Follows a 1.1M Tok/s Qwen 3.5 27B Run as vLLM Tuning Becomes the Real Story

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix

Related Articles

GLM5.2 at home turns local LLM enthusiasm into a hardware bill
A LocalLLaMA build with five RTX PRO 6000 cards and a 5090 made the practical cost of serious local inference hard to ignore.

LocalLLaMA Follows a 1.1M Tok/s Qwen 3.5 27B Run as vLLM Tuning Becomes the Real Story
LLM Reddit Mar 28, 2026 2 min read

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix
LLM Reddit Mar 15, 2026 2 min read