#gpu

AI Hacker News Apr 20, 2026 1 min read

Wasm에서 GPU로 zero-copy inference, HN은 “어디서 빨라지나”를 물었다

HN이 이 post를 흥미롭게 본 이유는 Apple Silicon unified memory가 Wasm sandbox와 GPU buffer 사이의 copy boundary를 실제로 줄일 수 있느냐는 구현 질문이었다.

#wasm #gpu #inference

AI sources.twitter Apr 18, 2026 1 min read

Cloudflare Unweight, Llama 번들을 손실 없이 최대 22% 줄이는 GPU 커널 공개

중요한 점은 Cloudflare가 GPU를 더 사는 문제가 아니라 LLM serving의 memory-bandwidth 병목을 직접 줄이려 한다는 데 있다. 글은 Llama 3.1 8B에서 15-22% 모델 크기 감소, 약 3GB VRAM 절감, 공개 GPU kernel을 제시한다.

#cloudflare #llm-inference #gpu

AI Apr 14, 2026 1 min read

Hugging Face, Hub에서 GPU kernel 바로 배포… PyTorch 대비 최대 2.5배

Hugging Face는 최적화된 GPU 코드를 Hub-native artifact로 바꿔 PyTorch 배포의 까다로운 단계를 줄이려 한다. Clement Delangue는 새 Kernels 흐름이 GPU, PyTorch 빌드, OS에 맞는 precompiled binary를 내려주며 PyTorch baseline 대비 1.7배에서 2.5배 성능 향상을 노린다고 적었다.

#hugging-face #kernels #pytorch

AI Hacker News Apr 13, 2026 1 min read

Hacker News가 끌어올린 AMD ROCm 전략: CUDA moat를 넘기 위한 'one step after another'

Hacker News front page에 오른 EE Times 인터뷰는 AMD가 ROCm, Triton, OneROCm, open-source 전략으로 CUDA 의존도를 단계적으로 낮추려는 접근을 정리한다. 핵심은 화려한 호환성 선언보다 vLLM과 SGLang이 자연스럽게 돌아가는 boring한 software 완성도다.

#rocm #cuda #amd

AI Reddit Apr 11, 2026 2 min read

Reddit가 지적한 RTX 5090 배치 FP32 workload의 cuBLAS 성능 이상

MachineLearning 커뮤니티의 한 글은 RTX 5090에서 cuBLAS가 batched FP32 MatMul에 비효율적인 kernel을 고르고 있을 가능성을 제기한다. 핵심은 단순한 체감 저하가 아니라, reproducible benchmark와 profiling data를 갖춘 dispatch 문제 제기라는 점이다.

#cublas #rtx-5090 #cuda

AI Reddit Apr 11, 2026 1 min read

RTX 5090의 cuBLAS FP32 dispatch 경로에 성능 문제가 있다는 주장

r/MachineLearning의 글과 연결된 benchmark writeup은 RTX 5090의 batched FP32 SGEMM이 비효율적인 cuBLAS 경로를 타며 GPU 계산 자원을 크게 남기고 있다고 주장한다.

#cuda #cublas #gpu

LLM Reddit Apr 8, 2026 1 min read

r/LocalLLaMA, Qwen3.5 27B를 local inference의 sweet spot으로 평가

r/LocalLLaMA의 한 글은 Qwen3.5 27B가 quality와 deployability 사이에서 드문 균형점을 만든다고 주장한다. 게시물은 RTX A6000 48GB, llama.cpp with CUDA, 32K context에서 약 19.7 tokens/sec를 보고했고, 댓글에서는 dense 27B와 35B-A3B MoE의 VRAM economics가 활발히 비교됐다.

#qwen #local-llm #llama-cpp

LLM Hacker News Apr 8, 2026 1 min read

MegaTrain, 단일 GPU에서 100B+ 파라미터 LLM 풀프리시전 학습을 노린 HN 화제 논문

MegaTrain은 parameters와 optimizer states를 host memory에 두고 layer를 GPU로 streaming하는 방식으로, 단일 GPU에서 100B+ 파라미터 LLM의 full precision 학습을 겨냥한다. 최근 Hacker News에서 이 논문이 주목받은 이유는 training bottleneck을 GPU 수량이 아니라 memory system 설계 문제로 다시 정의하기 때문이다.

#llm-training #systems #gpu

LLM Hacker News Apr 3, 2026 1 min read

Hacker News가 조명한 Lemonade, GPU·NPU용 local AI server

Lemonade는 GPU·NPU를 겨냥한 OpenAI-compatible server로 local AI inference를 패키징해, everyday PC에서 open model 배포를 더 쉽게 하려는 스택이다.

#local-ai #llm #gpu

LLM Reddit Mar 29, 2026 2 min read

r/MachineLearning이 올린 TurboQuant for weights, 4-bit weight quantization의 실전화

r/MachineLearning의 새 글이 TurboQuant를 KV cache 논의에서 weight compression 단계로 끌어왔다. GitHub 구현은 low-bit LLM inference용 drop-in path를 목표로 한다.

#quantization #llm #inference

LLM Hacker News Mar 28, 2026 1 min read

Hacker News가 주목한 ATLAS, local coding agent 비용 모델에 던지는 질문

Hacker News에서 화제가 된 ATLAS는 consumer GPU 기반 local coding agent의 비용 효율을 크게 강조한다. 다만 README의 74.6% LiveCodeBench 수치는 best-of-3 plus repair 파이프라인과 다른 task 수를 전제로 하므로, Claude 4.5 Sonnet과의 비교는 비통제 비교로 읽어야 한다.

#coding-agents #benchmarks #local-inference

LLM Reddit Mar 27, 2026 1 min read

Intel Arc Pro B70, 32GB local inference의 새 sub-$1,000 기준점 될까

LocalLLaMA에서 이 글이 빠르게 올라온 이유는 Intel GPU 뉴스를 실제 local inference 사용자가 보는 지표, 즉 VRAM, bandwidth, software support, cost로 번역해냈기 때문이다.

#intel #gpu #vram