r/LocalLLaMA: Community benchmark data turns Apple Silicon local LLM claims into something measurable

Original: Almost 10,000 Apple Silicon benchmark runs submitted by the community — here's what the data actually shows View original →

Read in other languages: 한국어日本語
LLM Mar 14, 2026 By Insights AI (Reddit) 2 min read 3 views Source

A popular r/LocalLLaMA thread this week tried to solve a familiar problem in local LLM discussions on Macs: everyone has screenshots, almost nobody has comparable data. The author says LM Studio's new submission flow and the open-source oMLX app have turned that frustration into a shared dataset, with nearly 10,000 Apple Silicon benchmark runs collected in roughly two weeks across more than 400 unique models. The post drew attention because it offers something the local inference community rarely gets at this scale: a reference point that is at least structured enough to compare chips, context lengths, and model sizes without leaning entirely on one-off anecdotes.

The post adds several details that help explain why the dataset grew quickly. The author says oMLX hit 3.8k GitHub stars in 3 days and that benchmark submissions started arriving "like a flood" once the tool spread outside its initial audience. With that volume, the headline is less the absolute number of runs and more the shape of the hardware curves. The author highlights that an M5 Max can reach about 1,200 PP tok/s at 1k to 8k context on Qwen 3.5 122B 4bit and stay above 1,000 through 16k, while an M3 Ultra starts around 893 PP tok/s at 1k and remains steady through 8k before tapering. The M4 Max, by contrast, sits in the 500s across most context lengths, clearly a lower tier than the top-end chips.

That framing matters because the author argues the interesting comparison is not the single best number at 1k context. It is the crossover behavior at longer contexts, where unified memory bandwidth, cache behavior, and model size interact in ways that simple "which Mac is fastest" debates usually miss. The thread also links a live comparison tool at omlx.ai/c/jmxd8a4, making the discussion more inspectable than a static chart. Comments immediately pushed on the next-order questions: how to verify community-submitted results, what happens beyond 128k context, and how different engines behave on the same chip under large prompts and concurrent workloads.

The practical value of the post is that it nudges the Apple local inference conversation away from vibes and toward a shared measurement culture. That does not make the data perfect, and community-submitted benchmarks will always need skepticism. But even an imperfect dataset is better than isolated tok/s bragging when buyers and developers are choosing hardware or deciding whether a local coding workflow is viable. For anyone building or buying around Apple Silicon, the thread is worth watching because it may become the de facto public baseline for MLX-heavy local LLM performance.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.