r/LocalLLaMA Benchmarks: <code>Krasis</code> reports 3,324 tok/s prefill for 80B MoE on one RTX 5080

Original: I built a hybrid MoE runtime that does 3,324 tok/s prefill on a single 5080. Here are the benchmarks. View original →

Read in other languages: 한국어日本語
LLM Mar 1, 2026 By Insights AI (Reddit) 2 min read 6 views Source

What the Reddit post is claiming

The r/LocalLLaMA thread reports benchmark results for Krasis, described as a hybrid runtime for large MoE models. At crawl time, the post had score 180 and 53 comments. The central design idea is operational separation: run prefill on GPU (where parallel matrix work is strongest), then run decode on CPU while using system RAM aggressively to compensate for limited VRAM.

Published benchmark numbers

The headline number in the post is for Qwen3-Coder-Next (80B, Q4) on a single RTX 5080 16GB system: 3,324 tok/s prefill, 9.7s TTFT at 35K context, and 14.9 tok/s decode. The same post also includes EPYC-based runs with RTX 2000 Ada 16GB, showing Q4 and Q8 variants and additional models such as Qwen3.5-35B-A3B, Qwen3-235B-A22B, and DeepSeek V2-Lite.

The author states that prefill tests use long prompts (10K-50K tokens) and decode is averaged over short generation windows (64 tokens). That framing matters: these numbers are intended to emphasize input processing latency, not only raw generation throughput.

Why this matters for practical workflows

  • Agentic and IDE-integrated use cases often send large tool specs, file context, and conversation history, so prefill delay dominates user-perceived latency.
  • Many hybrid/offload setups still spend too much wall time in CPU-heavy prefill paths when context grows.
  • A runtime that keeps prefill GPU-centric can improve time-to-first-token in long-context sessions without requiring enterprise-class GPUs.

Tradeoffs and open verification points

The same post and project documentation describe clear constraints: large RAM requirements, NVIDIA-only support, expensive first-run preprocessing, and significant disk cache footprints. The approach is also tuned for MoE models, so behavior on dense models may not match the headline claims.

For teams evaluating deployment options, the practical next step is reproducibility: can independent users match prefill gains under similar memory bandwidth and PCIe constraints? If yes, this pattern could become a useful middle ground between lightweight local runtimes and full datacenter inference stacks.

Sources: Reddit thread, Krasis GitHub repository

Share:

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.