r/LocalLLaMA Shares a University-Hospital Stack Serving 1B+ Tokens Per Day Locally
Original: Serving 1B+ tokens/day locally in my research lab View original →
A high-signal self-post on r/LocalLLaMA offers one of the more concrete deployment write-ups the community has seen this week. The author says they lead a research lab at a university hospital and have spent the last few weeks refining an internal LLM server that now processes more than 1 billion tokens per day locally. The split is roughly two-thirds ingestion and one-third decode, and the post is written as an operations note for people trying to run similar stacks rather than as a simple benchmark boast.
The hardware is modest outside the GPUs: 2x H200, 124 GB of RAM, a 16-core CPU, and 512 GB of disk. After testing Qwen 3 models, GLM-Air, and GPT-OSS, the author settled on GPT-OSS-120B. The argument is practical: the model delivered roughly 220 to 250 tok/s for single-user decode, followed JSON instructions well, handled tool calling reliably enough for the team's workflows, and benefited from the fact that its deployed weights match the published evaluations. On this setup, the author found mxfp4 on Hopper clearly better optimized than other paths they had tried.
The architecture is equally pragmatic. LiteLLM sits in front as the OpenAI-compatible proxy handling keys, rate limits, routing, and the priority queue. Behind it are two vLLM instances, one per GPU, with PostgreSQL for usage tracking, Prometheus and Grafana for observability, and MkDocs for internal documentation. The author specifically chose one replica per GPU instead of tensor parallel across both cards because GPT-OSS-120B fits comfortably on a single H200 in mxfp4 and the independent replicas avoid NCCL overhead. With LiteLLM's simple-shuffle routing, the prompt-token split after about six days was reportedly 2.10B versus 2.11B, which is almost perfectly balanced.
The operational details are what made the post travel. The configuration uses mxfp4 quantization, a 128k context window, 0.80 GPU-memory utilization, chunked prefill, prefix caching, and 128 max sequences per instance, plus environment variables such as VLLM_USE_FLASHINFER_MXFP4_MOE=1 and NCCL_P2P_DISABLE=1. The author argues that decode throughput, not KV cache, is the real limit in this setup, and that leaving 20% VRAM headroom helps absorb logprobs-related memory spikes without materially hurting steady-state throughput.
The reported results give the post real weight: 6.57B total tokens over roughly six days, 2.76M requests, and a 1-hour average combined throughput of 24,225 tok/s. But the most useful section may be the unresolved problem. When LiteLLM cools down one overloaded vLLM replica, the traffic shifts to the other replica, which then overloads in turn and creates a ping-pong failure pattern. That is exactly the kind of production-edge detail that makes the post valuable. r/LocalLLaMA is not just reacting to a big number here; it is reacting to a deployment note that looks close to something a real team could copy, adapt, and learn from.
Related Articles
A LocalLLaMA thread pulled attention to DFlash, a block-diffusion draft model for speculative decoding whose paper claims lossless acceleration above 6x and direct support for vLLM, SGLang, and selected Transformers backends.
A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.
A March 26, 2026 r/LocalLLaMA post about serving Qwen 3.5 27B on Google Cloud B200 clusters reached 205 points and 52 comments at crawl time. The linked write-up reports 1,103,941 total tokens per second on 12 nodes after switching from tensor to data parallelism, shrinking context length, enabling FP8 KV cache, and using MTP-1 speculative decoding.
Comments (0)
No comments yet. Be the first to comment!