LocalLLaMA jumped on DeepSeek's visual-primitives idea, then watched the repo vanish
Original: DeepSeek released 'Thinking-with-Visual-Primitives' framework View original →
LocalLLaMA jumped on DeepSeek's Thinking with Visual Primitives post for two reasons at once: the underlying idea looked genuinely important, and then the repo disappeared fast enough to turn the thread into a small archival scramble.
According to the Reddit write-up, the framework was released by DeepSeek with collaborators at Peking University and Tsinghua University. The core move is simple to describe and unusually concrete: instead of keeping image reasoning fully in natural language, the model can interleave coordinate points and bounding boxes into its chain of thought as explicit spatial tokens. In other words, the model is not only describing what it thinks it sees. It can point during reasoning. That matters because multimodal systems often fail at reference precision. They talk around an object instead of grounding attention on the exact region that matters.
The paper mirror and repo link gave commenters enough to see why the idea landed. Several users framed it as the kind of mechanism frontier labs have likely been using internally, but that open-model communities rarely get to inspect in detail. One high-voted reaction called it a big deal for open models because it replaces vague verbal scaffolding with a minimal visual language the model can manipulate. Another thread kept circling the practical upside: if points and boxes become first-class reasoning units, tasks such as counting, locating, or multi-step spatial comparison may depend less on prose that drifts away from the image.
Then came the second half of the drama. The Reddit post notes that DeepSeek removed the repository shortly after release, and commenters quickly traded mirror links and jokes about how familiar that release pattern already feels. The disappearance amplified the thread rather than killing it. In communities like LocalLLaMA, a deleted repo is not just scarcity theater. It is also a signal to preserve the artifact before it vanishes behind a cleanup pass or internal review.
That combination is why the post traveled. The community did not just see another multimodal paper. It saw a rare, inspectable attempt to make visual grounding part of the model's actual reasoning loop, then watched the window half-close in real time.
Related Articles
DeepSeek turned a temporary V4-Pro API discount into standard pricing, intensifying the cost race around frontier-class LLM access. The posted table cuts output pricing from $3.48 to $0.87 per million tokens.
Local multimodal AI is moving into the 12B class. Google Gemma introduced Gemma 4 12B under Apache 2.0, describing a unified encoder-free design for image, audio, and text inputs.
Google DeepMind released DiffusionGemma, a 26B MoE open model that uses text diffusion instead of token-by-token decoding. The pitch is up to 4x faster generation on dedicated GPUs for local, interactive workflows.