LocalLLaMA jumped on DeepSeek's visual-primitives idea, then watched the repo vanish
Original: DeepSeek released 'Thinking-with-Visual-Primitives' framework View original →
LocalLLaMA jumped on DeepSeek's Thinking with Visual Primitives post for two reasons at once: the underlying idea looked genuinely important, and then the repo disappeared fast enough to turn the thread into a small archival scramble.
According to the Reddit write-up, the framework was released by DeepSeek with collaborators at Peking University and Tsinghua University. The core move is simple to describe and unusually concrete: instead of keeping image reasoning fully in natural language, the model can interleave coordinate points and bounding boxes into its chain of thought as explicit spatial tokens. In other words, the model is not only describing what it thinks it sees. It can point during reasoning. That matters because multimodal systems often fail at reference precision. They talk around an object instead of grounding attention on the exact region that matters.
The paper mirror and repo link gave commenters enough to see why the idea landed. Several users framed it as the kind of mechanism frontier labs have likely been using internally, but that open-model communities rarely get to inspect in detail. One high-voted reaction called it a big deal for open models because it replaces vague verbal scaffolding with a minimal visual language the model can manipulate. Another thread kept circling the practical upside: if points and boxes become first-class reasoning units, tasks such as counting, locating, or multi-step spatial comparison may depend less on prose that drifts away from the image.
Then came the second half of the drama. The Reddit post notes that DeepSeek removed the repository shortly after release, and commenters quickly traded mirror links and jokes about how familiar that release pattern already feels. The disappearance amplified the thread rather than killing it. In communities like LocalLLaMA, a deleted repo is not just scarcity theater. It is also a signal to preserve the artifact before it vanishes behind a cleanup pass or internal review.
That combination is why the post traveled. The community did not just see another multimodal paper. It saw a rare, inspectable attempt to make visual grounding part of the model's actual reasoning loop, then watched the window half-close in real time.
Related Articles
Google DeepMind has introduced Gemma 4 as a new open-model family built from Gemini 3 research. The lineup spans E2B and E4B edge models through 26B and 31B local-workstation models, with function calling, multimodal reasoning, and 140-language support at the center of the release.
HN did not latch onto DeepSeek V4 because of a polished launch page. The thread took off when commenters realized the front-page link was just updated docs while the weights and base models were already live for inspection.
LocalLLaMA upvoted this because it felt like real plumbing, not another benchmark screenshot. The excitement was about DeepSeek open-sourcing faster expert-parallel communication and reusable GPU kernels.
Comments (0)
No comments yet. Be the first to comment!