r/LocalLLaMA Benchmarks: <code>Krasis</code> reports 3,324 tok/s prefill for 80B MoE on one RTX 5080

What the Reddit post is claiming

The r/LocalLLaMA thread reports benchmark results for Krasis, described as a hybrid runtime for large MoE models. At crawl time, the post had score 180 and 53 comments. The central design idea is operational separation: run prefill on GPU (where parallel matrix work is strongest), then run decode on CPU while using system RAM aggressively to compensate for limited VRAM.

Published benchmark numbers

The headline number in the post is for Qwen3-Coder-Next (80B, Q4) on a single RTX 5080 16GB system: 3,324 tok/s prefill, 9.7s TTFT at 35K context, and 14.9 tok/s decode. The same post also includes EPYC-based runs with RTX 2000 Ada 16GB, showing Q4 and Q8 variants and additional models such as Qwen3.5-35B-A3B, Qwen3-235B-A22B, and DeepSeek V2-Lite.

The author states that prefill tests use long prompts (10K-50K tokens) and decode is averaged over short generation windows (64 tokens). That framing matters: these numbers are intended to emphasize input processing latency, not only raw generation throughput.

Why this matters for practical workflows

Agentic and IDE-integrated use cases often send large tool specs, file context, and conversation history, so prefill delay dominates user-perceived latency.
Many hybrid/offload setups still spend too much wall time in CPU-heavy prefill paths when context grows.
A runtime that keeps prefill GPU-centric can improve time-to-first-token in long-context sessions without requiring enterprise-class GPUs.

Tradeoffs and open verification points

The same post and project documentation describe clear constraints: large RAM requirements, NVIDIA-only support, expensive first-run preprocessing, and significant disk cache footprints. The approach is also tuned for MoE models, so behavior on dense models may not match the headline claims.

For teams evaluating deployment options, the practical next step is reproducibility: can independent users match prefill gains under similar memory bandwidth and PCIe constraints? If yes, this pattern could become a useful middle ground between lightweight local runtimes and full datacenter inference stacks.

Sources: Reddit thread, Krasis GitHub repository

r/LocalLLaMA Benchmarks: <code>Krasis</code> reports 3,324 tok/s prefill for 80B MoE on one RTX 5080

What the Reddit post is claiming

Published benchmark numbers

Why this matters for practical workflows

Tradeoffs and open verification points

Related Articles

LocalLLaMA Gets a MacBook Air M5 Benchmark for 21 Coding Models, Not Just Vibes

HN Spots the Real DeepSeek V4 Story: The Docs Link Was Thin, but the Weights Were Already Live

HN’s GPT-5.5 read: the real question is whether it finishes the job

Comments (0)

Leave a Comment

Related Articles

LocalLLaMA Gets a MacBook Air M5 Benchmark for 21 Coding Models, Not Just Vibes

HN Spots the Real DeepSeek V4 Story: The Docs Link Was Thin, but the Weights Were Already Live

HN’s GPT-5.5 read: the real question is whether it finishes the job