#inference

AI sources.twitter Mar 14, 2026 1 min read

Together AI, 실시간 Voice Agent용 one-cloud stack 공개

Together AI는 March 12, 2026에 real-time voice agent용 one-cloud stack을 공개한다고 밝혔다. 공개 자료에는 under-500ms latency, 25+ region 확장, 그리고 voice-agent deployment에서 time-to-first-64-tokens를 77ms까지 낮춘 kernel 최적화 사례가 포함돼 있다.

#voice-agents #inference #realtime

LLM Mar 14, 2026 2 min read

Ares 논문, LLM agent 추론 비용 최대 52.7% 절감 제시

2026년 3월 9일 제출된 arXiv 논문 Ares는 multi-step LLM agent에서 단계별 reasoning effort를 동적으로 조절하는 방식을 제안했다. 저자들은 fixed high-effort 대비 reasoning token 사용량을 최대 52.7% 줄이면서 성공률 저하는 작았다고 보고했다.

#llm-agents #reasoning #efficiency

LLM Hacker News Mar 13, 2026 2 min read

Hacker News, transformer 내부에서 program execution을 수행한다는 Percepta 주장에 주목

Percepta는 2026년 3월 11일 공개한 글에서 transformer 내부에 computer를 만들고, arbitrary C program을 수백만 step 실행하며, 2D attention head로 inference를 지수적으로 가속할 수 있다고 주장했다. HN 이용자들은 흥미로운 연구 방향으로 봤지만, 더 명확한 설명과 benchmark, 실제 확장성에 대한 근거를 요구했다.

#transformers #inference #llm-research

AI Mar 12, 2026 1 min read

Meta, 2년간 4세대 개발 속도로 MTIA 로드맵 공개

Meta는 next-gen AI 확장에 custom silicon이 필수라며 Meta Training and Inference Accelerator(MTIA) 로드맵을 공개했다. 회사는 전통적인 chip cycle과 빠른 model architecture 변화의 간격을 줄이기 위해 2년 만에 4세대를 출시했다고 설명했다.

#meta #mtia #ai-chips

LLM Reddit Mar 12, 2026 1 min read

Reddit, Mac용 Qwen 3.5 llama.cpp Metal speedup를 주목하다

r/LocalLLaMA 게시글은 Mac 사용자를 March 11, 2026에 merge된 llama.cpp pull request #20361로 이끌었다. 이 PR은 fused GDN recurrent Metal kernel을 추가하며, Qwen 3.5 계열에서 대략 12-36% throughput 향상을 제시한다. Reddit commenters는 change가 master에는 들어갔지만 일부 local benchmark에서는 여전히 MLX가 더 빠를 수 있다고 덧붙였다.

#llama.cpp #qwen #apple-silicon

LLM Reddit Mar 12, 2026 1 min read

r/LocalLLaMA가 주목한 llama.cpp reasoning budget 제어

새로운 llama.cpp 변경은 <code>--reasoning-budget</code>를 template stub이 아니라 sampler 차원의 실제 제어로 바꾼다. LocalLLaMA thread는 긴 think loop를 줄이는 것과 answer quality를 지키는 것 사이의 tradeoff, 특히 local Qwen 3.5 환경에서의 의미를 집중적으로 논의했다.

#llama.cpp #reasoning #local-llms

LLM sources.twitter Mar 11, 2026 1 min read

NVIDIA, multi-agent AI용 Nemotron 3 Super 공개

NVIDIA AI Developer는 2026년 3월 11일 Nemotron 3 Super를 공개하며, 12B active parameters를 사용하는 오픈 120B-parameter hybrid MoE 모델과 native 1M-token context를 강조했다. NVIDIA는 이 모델이 이전 Nemotron Super 대비 최대 5배 높은 throughput으로 agentic workload를 겨냥한다고 설명했다.

#nvidia #nemotron #open-models

LLM sources.twitter Mar 11, 2026 1 min read

Microsoft Foundry, Fireworks AI로 Azure open model inference 강화

Microsoft는 Fireworks AI가 Microsoft Foundry에 들어오면서 Azure에서 high-performance, low-latency open model inference를 제공한다고 밝혔다. day-zero access, custom model 반입, enterprise control을 한곳에서 제공하는 것이 핵심 메시지다.

#azure #microsoft-foundry #open-models

LLM Hacker News Mar 11, 2026 1 min read

Hacker News가 Apple Silicon용 온디바이스 음성 AI 스택을 밀어 올리다

Launch HN 스레드는 RunAnywhere의 MetalRT와 RCLI를 끌어올리며, Apple Silicon에서 STT·LLM·TTS를 클라우드 없이 엮는 저지연 음성 AI 파이프라인에 관심을 모았다.

#apple-silicon #on-device-ai #voice-ai

LLM Reddit Mar 11, 2026 1 min read

LocalLLaMA가 다시 소환한 Qwen2-72B layer duplication 실험

LocalLLaMA에서 재조명된 글은 Qwen2-72B의 중간 7개 layer block을 weight 수정 없이 반복 실행해 benchmark를 끌어올렸다는 David Noel Ng의 실험을 다뤘다.

#open-llm #benchmarks #transformers

LLM Hacker News Mar 11, 2026 1 min read

Hacker News가 조명한 Apple Silicon용 RunAnywhere 로컬 Voice AI 스택

Launch HN 스레드로 RunAnywhere의 RCLI가 부각됐다. 이 프로젝트는 Apple Silicon에서 STT, LLM, TTS, 로컬 RAG, 38개 macOS action을 모두 로컬로 묶어 macOS용 Voice AI를 구축하려는 시도다.

#apple-silicon #local-ai #voice-ai

LLM Hacker News Mar 10, 2026 2 min read

HN, Claude Code '$5k 사용자' 밈이 API 가격과 실제 추론 비용을 혼동하는지 논쟁

화제가 된 HN 스레드는 Claude Code 사용자 1명당 월 $5,000이 든다는 숫자가 Anthropic의 실제 serving cost가 아니라 retail API 기준 사용액을 가리킬 가능성이 크다고 본다.

#anthropic #claude-code #inference