#systems

LLM Hacker News Apr 8, 2026 1 min read

MegaTrain, 단일 GPU에서 100B+ 파라미터 LLM 풀프리시전 학습을 노린 HN 화제 논문

MegaTrain은 parameters와 optimizer states를 host memory에 두고 layer를 GPU로 streaming하는 방식으로, 단일 GPU에서 100B+ 파라미터 LLM의 full precision 학습을 겨냥한다. 최근 Hacker News에서 이 논문이 주목받은 이유는 training bottleneck을 GPU 수량이 아니라 memory system 설계 문제로 다시 정의하기 때문이다.

#llm-training #systems #gpu

AI Hacker News Mar 21, 2026 2 min read

Hacker News가 다시 꺼낸 Flash-KMeans, Exact K-Means를 GPU online primitive로

10 Mar 2026에 제출된 arXiv 논문 Flash-KMeans는 Exact K-Means의 GPU 병목인 N x K distance matrix의 HBM materialization과 centroid update의 atomic contention을 직접 겨냥한다. Hacker News에서 180 points와 14 comments를 모은 이유는, 이 결과가 FlashAttention-style systems optimization, CPU와 GPU의 차이, 그리고 K-Means의 online primitive화라는 실무 질문으로 바로 이어졌기 때문이다.

#k-means #gpu #systems

AI Reddit Mar 17, 2026 1 min read

r/MachineLearning: GraphZero, mmap과 zero-copy tensor로 대형 graph를 RAM 없이 다루는 C++ engine

2026년 3월 15일 r/MachineLearning에서는 GraphZero v0.2 소개 글이 334 points와 27 comments를 모았다. post와 GitHub README는 SSD mmap, custom binary format, nanobind bridge를 이용해 100M+ node graph를 consumer hardware에서 다루는 방식을 설명한다.

#graph-neural-networks #pytorch #c++

LLM Reddit Feb 26, 2026 1 min read

Reddit가 주목한 DeepSeek DualPath: 에이전트형 LLM의 KV-Cache I/O 병목 해소

r/LocalLLaMA에서 화제가 된 DualPath 논문은 KV-Cache 로딩 경로를 분리해 I/O 병목을 완화하는 시스템 설계를 제안한다. arXiv 초록 기준으로 오프라인 최대 1.87배, 온라인 평균 1.96배 처리량 개선을 보고했다.

#llm-inference #kv-cache #rdma