Why it matters: open models rarely arrive with both giant context claims and deployable model splits. DeepSeek put hard numbers on the release with a 1M-context design, a 1.6T/49B Pro model, and a 284B/13B Flash variant.
#llm
RSS FeedHN did not push Browser Harness because it was another browser wrapper. It took off because the repo lets an LLM patch its own browser helpers in the middle of a task, trading safety rails for raw flexibility.
HN did not latch onto DeepSeek V4 because of a polished launch page. The thread took off when commenters realized the front-page link was just updated docs while the weights and base models were already live for inspection.
HN latched onto a pain every heavy coding-tool user knows: the bug is tiny, but the diff balloons anyway. A new write-up turns that annoyance into a measurable benchmark and argues that better prompting and RL can make models edit with more restraint.
Hacker News focused less on the Copilot plan mechanics and more on what the change reveals: long-running coding agents are turning flat AI subscriptions into a compute-cost problem.
HN read Kimi K2.6 as a test of whether open-weight coding agents can last through real engineering work. The 12-hour and 13-hour coding cases drew attention, while commenters immediately pressed on speed, provider accuracy, and benchmark realism.
r/singularity reacted because the post turned LLM consciousness into a fight over computation itself. Alexander Lerchner’s “Abstraction Fallacy” paper argues that computation depends on a mapmaker, while commenters pushed back with questions about definitions, Chinese Room echoes, and philosophy versus neuroscience.
HN upvoted this because it turned vague limit anxiety into numbers. Tokenomics says 541 anonymous submissions averaged 466 request tokens on Opus 4.7 versus 349 on Opus 4.6, a 38.1% increase, and the thread immediately argued over what that means for real Claude usage.
HN treated “AI cybersecurity is not proof of work” as a serious argument about search, model capability, and security asymmetry. The thread pushed past hype into a harder question: when an LLM flags a bug, did it understand the exploit path or just sample a suspicious pattern?
The r/singularity thread did not just react to Opus 4.7 scoring 41.0% where Opus 4.6 scored 94.7%. The interesting part was the community trying to separate real capability loss from refusal behavior, routing, and benchmark interpretation.
A new arXiv paper shows why low average violation rates can make LLM judges look safer than they are. On SummEval, 33-67% of documents showed at least one directed 3-cycle, and prediction-set width tracked absolute error strongly.
HN did not just ask whether Claude Opus 4.7 scores higher; it asked whether the product behavior is stable enough to build around. The thread quickly moved into adaptive thinking, tokenizer costs, safety filters, and bruised trust after recent Claude complaints.