r/LocalLLaMA: Community benchmark data turns Apple Silicon local LLM claims into something measurable

A popular r/LocalLLaMA thread this week tried to solve a familiar problem in local LLM discussions on Macs: everyone has screenshots, almost nobody has comparable data. The author says LM Studio's new submission flow and the open-source oMLX app have turned that frustration into a shared dataset, with nearly 10,000 Apple Silicon benchmark runs collected in roughly two weeks across more than 400 unique models. The post drew attention because it offers something the local inference community rarely gets at this scale: a reference point that is at least structured enough to compare chips, context lengths, and model sizes without leaning entirely on one-off anecdotes.

The post adds several details that help explain why the dataset grew quickly. The author says oMLX hit 3.8k GitHub stars in 3 days and that benchmark submissions started arriving "like a flood" once the tool spread outside its initial audience. With that volume, the headline is less the absolute number of runs and more the shape of the hardware curves. The author highlights that an M5 Max can reach about 1,200 PP tok/s at 1k to 8k context on Qwen 3.5 122B 4bit and stay above 1,000 through 16k, while an M3 Ultra starts around 893 PP tok/s at 1k and remains steady through 8k before tapering. The M4 Max, by contrast, sits in the 500s across most context lengths, clearly a lower tier than the top-end chips.

That framing matters because the author argues the interesting comparison is not the single best number at 1k context. It is the crossover behavior at longer contexts, where unified memory bandwidth, cache behavior, and model size interact in ways that simple "which Mac is fastest" debates usually miss. The thread also links a live comparison tool at omlx.ai/c/jmxd8a4, making the discussion more inspectable than a static chart. Comments immediately pushed on the next-order questions: how to verify community-submitted results, what happens beyond 128k context, and how different engines behave on the same chip under large prompts and concurrent workloads.

The practical value of the post is that it nudges the Apple local inference conversation away from vibes and toward a shared measurement culture. That does not make the data perfect, and community-submitted benchmarks will always need skepticism. But even an imperfect dataset is better than isolated tok/s bragging when buyers and developers are choosing hardware or deciding whether a local coding workflow is viable. For anyone building or buying around Apple Silicon, the thread is worth watching because it may become the de facto public baseline for MLX-heavy local LLM performance.

r/LocalLLaMA: Community benchmark data turns Apple Silicon local LLM claims into something measurable

Related Articles

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max

r/LocalLLaMA benchmark argues M5 Max shines most on MoE prompt processing

Papers with Code now has to track “papers without code”

Related Articles

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max
LLM Reddit Mar 14, 2026 2 min read

r/LocalLLaMA benchmark argues M5 Max shines most on MoE prompt processing
LLM Reddit Mar 23, 2026 2 min read

Papers with Code now has to track “papers without code”