#speculative-decoding

LLM Reddit Jun 14, 2026 1 min read

Xiaomi MiMo 1T 모델 1000tps 주장, LocalLLaMA가 본 진짜 쟁점

LocalLLaMA의 관심은 속도 숫자보다 FP4, DFlash speculative decoding, commodity GPU 조합이 실제로 어디까지 재현될 수 있느냐에 모였다.

LLM Hacker News May 16, 2026 1 min read

Orthrus-Qwen3: 동일 출력 품질 유지하며 추론 속도 7.8배 향상

Orthrus 프레임워크가 Qwen3 모델에서 forward pass당 최대 7.8배 토큰 생성 속도를 달성했다. 단일 KV 캐시로 자동회귀와 확산 뷰를 통합하는 이중 뷰 아키텍처 덕분에 출력 분포는 원본과 동일하다.

#inference #qwen3 #speculative-decoding

LLM Reddit May 6, 2026 1 min read

Qwen 3.6 27B + MTP로 로컬 추론 속도 2.5배 향상, 48GB에서 262k 컨텍스트

llama.cpp의 새 MTP 지원 PR을 활용해 Qwen 3.6 27B의 추론 속도를 2.5배 높이는 방법이 공유됐다. 48GB 메모리에서 262,000 토큰 컨텍스트로 로컬 에이전틱 코딩이 가능해졌다.

#qwen #mtp #local-llm

LLM Reddit May 6, 2026 1 min read

Google, Gemma 4에 MTP 드래프터 출시 — 추론 속도 최대 3배 향상

Google이 Gemma 4 모델군을 위한 Multi-Token Prediction(MTP) 드래프터를 공개했다. 추측적 디코딩 아키텍처를 통해 출력 품질 저하 없이 추론 속도를 최대 3배 높인다.

#gemma #google #mtp

LLM Reddit Apr 28, 2026 1 min read

RTX 3090에서 거의 2배, LocalLLaMA가 Luce DFlash에 몰린 이유

LocalLLaMA는 이 글을 또 하나의 벤치마크 이미지로 넘기지 않았다. 단일 RTX 3090에서 Qwen3.6-27B 처리량을 평균 1.98배까지 끌어올렸고, 재학습 없이 긴 컨텍스트까지 버틴다는 점이 스레드의 열기를 만들었다.

#qwen #speculative-decoding #gguf

LLM Reddit Apr 14, 2026 2 min read

Reddit, Apple Silicon에서 Qwen3.5 추론을 4배 안팎으로 끌어올린 DFlash MLX 구현에 주목

LocalLLaMA에서는 Apple Silicon에서 Qwen3.5 추론을 4배 안팎까지 끌어올린 MLX용 DFlash 구현이, 과장된 demo가 아니라 baseline을 다시 잡고 open source로 공개한 engineering 작업이라는 점 때문에 주목받았다. 2026년 4월 13일 글 작성자는 stock MLX 기준으로 Qwen3.5-9B 2048 tokens에서 30.96 tok/s를 127.07 tok/s로 높였고 acceptance는 89.36%라고 공개했다.

#dflash #speculative-decoding #mlx

LLM Reddit Apr 13, 2026 1 min read

r/LocalLLaMA가 추적한 Apple Silicon용 DFlash: MLX에서 lossless speculative decoding 4.1x

r/LocalLLaMA의 새 글은 M5 Max와 MLX 0.31.1 환경에서 DFlash speculative decoding을 공개하고, Qwen3.5-9B에서 127.07 tok/s와 4.13x speedup을 보고했다. 중요한 점은 headline보다 재현 조건과 bandwidth bottleneck 해석이 구체적이라는 데 있다.

#mlx #apple-silicon #speculative-decoding

LLM Reddit Apr 12, 2026 1 min read

LocalLLaMA 벤치마크, Gemma 4 31B speculative decoding 평균 29% 속도 향상 보고

r/LocalLLaMA의 새 벤치마크는 Gemma 4 31B와 E2B draft 조합에서 speculative decoding이 평균 29%, code 생성에서는 약 50%의 속도 향상을 낼 수 있다고 전했다.

#gemma-4 #speculative-decoding #llama-cpp

LLM Reddit Apr 11, 2026 2 min read

LocalLLaMA, Apple Silicon에서 DFlash로 Qwen 추론 2~3배 가속 보고

LocalLLaMA의 한 구현 보고는 Apple Silicon용 native MLX DFlash runtime으로 Qwen 계열 inference를 2배에서 3배 이상 가속했다고 주장한다. 중요한 점은 speedup뿐 아니라 greedy baseline과 bit-for-bit identical output을 유지했다고 설명한 부분이다.

#apple-silicon #mlx #speculative-decoding

LLM Reddit Apr 7, 2026 1 min read

LocalLLaMA, DFlash를 더 빠른 speculative decoding을 위한 오픈소스 경로로 주목

LocalLLaMA 스레드는 speculative decoding용 block-diffusion draft model인 DFlash에 관심을 모았다. 논문은 6x 이상의 lossless acceleration과 vLLM, SGLang, 일부 Transformers backend 지원을 내세운다.

#speculative-decoding #inference #vllm

LLM X/Twitter Apr 1, 2026 2 min read

Together Research, RL 기반 adaptive speculative decoding 시스템 Aurora 공개

Together Research는 2026년 3월 31일 live inference trace를 학습해 speculative draft model을 serving 중단 없이 비동기적으로 갱신하는 open-source framework Aurora를 공개했다. 회사의 블로그와 논문은 Aurora가 문제를 asynchronous RL로 재정의하며, traffic shift 상황에서 강한 static speculator 대비 1.25x 추가 속도 향상을 낼 수 있다고 설명한다.

#together-ai #aurora #speculative-decoding

LLM Reddit Mar 21, 2026 2 min read

r/LocalLLaMA가 주목한 mlx-lm의 Qwen3.5 native MTP와 1.5x 추론 개선 가능성

r/LocalLLaMA에서 주목받은 mlx-lm PR #990은 Qwen3.5의 built-in MTP head를 native speculative decoding에 활용해 15.3 -> 23.3 tok/s (~1.5x throughput boost), ~80.6% acceptance rate를 제시했다. 다만 converted checkpoint, batching 비활성화, MoE 미검증 같은 운영상 제약도 함께 확인해야 한다.

#mlx-lm #qwen3.5 #mtp