r/LocalLLaMA: Community benchmark data turns Apple Silicon local LLM claims into something measurable
Original: Almost 10,000 Apple Silicon benchmark runs submitted by the community — here's what the data actually shows View original →
A popular r/LocalLLaMA thread this week tried to solve a familiar problem in local LLM discussions on Macs: everyone has screenshots, almost nobody has comparable data. The author says LM Studio's new submission flow and the open-source oMLX app have turned that frustration into a shared dataset, with nearly 10,000 Apple Silicon benchmark runs collected in roughly two weeks across more than 400 unique models. The post drew attention because it offers something the local inference community rarely gets at this scale: a reference point that is at least structured enough to compare chips, context lengths, and model sizes without leaning entirely on one-off anecdotes.
The post adds several details that help explain why the dataset grew quickly. The author says oMLX hit 3.8k GitHub stars in 3 days and that benchmark submissions started arriving "like a flood" once the tool spread outside its initial audience. With that volume, the headline is less the absolute number of runs and more the shape of the hardware curves. The author highlights that an M5 Max can reach about 1,200 PP tok/s at 1k to 8k context on Qwen 3.5 122B 4bit and stay above 1,000 through 16k, while an M3 Ultra starts around 893 PP tok/s at 1k and remains steady through 8k before tapering. The M4 Max, by contrast, sits in the 500s across most context lengths, clearly a lower tier than the top-end chips.
That framing matters because the author argues the interesting comparison is not the single best number at 1k context. It is the crossover behavior at longer contexts, where unified memory bandwidth, cache behavior, and model size interact in ways that simple "which Mac is fastest" debates usually miss. The thread also links a live comparison tool at omlx.ai/c/jmxd8a4, making the discussion more inspectable than a static chart. Comments immediately pushed on the next-order questions: how to verify community-submitted results, what happens beyond 128k context, and how different engines behave on the same chip under large prompts and concurrent workloads.
The practical value of the post is that it nudges the Apple local inference conversation away from vibes and toward a shared measurement culture. That does not make the data perfect, and community-submitted benchmarks will always need skepticism. But even an imperfect dataset is better than isolated tok/s bragging when buyers and developers are choosing hardware or deciding whether a local coding workflow is viable. For anyone building or buying around Apple Silicon, the thread is worth watching because it may become the de facto public baseline for MLX-heavy local LLM performance.
Related Articles
The spark in LocalLLaMA was not the raw score alone. The post landed because a 38.2% Terminal-Bench 2.0 result for Qwen 3.6-27B was framed as roughly late-2025 frontier quality, putting air-gapped and privacy-heavy coding teams into a new decision zone.
A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.
A rerun benchmark posted to r/LocalLLaMA argues that Apple’s M5 Max shows its clearest gains on prompt processing rather than raw generation alone. The post reports 2,845 tok/s PP512 for Qwen 3.5 35B-A3B MoE and 92.2 tok/s generation, but these remain community measurements rather than independent lab benchmarks.
Comments (0)
No comments yet. Be the first to comment!