#systems

LLM Hacker News Apr 8, 2026 2 min read

MegaTrain turns a Hacker News paper pick into a memory-systems debate about single-GPU LLM training

MegaTrain proposes training 100B+ parameter LLMs at full precision on a single GPU by keeping parameters and optimizer states in host memory and streaming layers through the device. The recent Hacker News interest is notable because the paper reframes the problem as one of memory-system design rather than simple GPU count.

#llm-training #systems #gpu

AI Hacker News Mar 21, 2026 3 min read

Hacker News Reframes Flash-KMeans as Exact K-Means Moves Toward an Online GPU Primitive

Flash-KMeans is an arXiv paper submitted on 10 Mar 2026 that targets two concrete GPU bottlenecks in Exact K-Means: materializing the N x K distance matrix in HBM and atomic contention during centroid updates. The Hacker News thread reached 180 points and 14 comments because systems-minded readers immediately connected the work to FlashAttention-style dataflow optimization, practical deployment questions, and the broader shift of K-Means from offline preprocessing to an online AI primitive.

#k-means #gpu #systems

AI Reddit Mar 17, 2026 2 min read

r/MachineLearning: GraphZero Uses mmap and Zero-Copy Tensors to Tame Massive Graphs

A March 15, 2026 post on r/MachineLearning reached 334 points and 27 comments by presenting GraphZero v0.2, a C++ and Python graph engine that keeps giant datasets on disk and hands zero-copy tensors to PyTorch on demand.

#graph-neural-networks #pytorch #c++

AI Reddit Mar 15, 2026 2 min read

r/MachineLearning Spots GraphZero, a Zero-Copy Graph Engine for 100M+ Node Workloads

A March 15, 2026 r/MachineLearning post highlighted GraphZero, a C++ engine that memory-maps graph topology and features from SSD so large GNN datasets can stay off RAM.

#gnn #pytorch #mmap

LLM Reddit Feb 26, 2026 1 min read

Reddit Spotlights DeepSeek DualPath for KV-Cache I/O Bottlenecks in Agentic LLMs

A trending r/LocalLLaMA thread highlighted the DualPath paper on KV-Cache bottlenecks in disaggregated inference systems. The arXiv abstract reports up to 1.87x offline throughput and 1.96x average online throughput gains while meeting SLO.

#llm-inference #kv-cache #rdma