投稿が示したこと

Cloudflareは、自社networkでLLMを動かすにはGPU memory bandwidthをより効率的に使う必要があると書き、“up to a 22% model footprint reduction”を示した。これは重要なresearch and infrastructure signalだ。inference costの制約はraw computeだけでなく、memory bandwidthに強く左右されているからだ。

CloudflareアカウントはWorkers AI、network performance、developer platformのproduction infrastructureに関する投稿が多い。リンク先はこの仕組みをUnweightと呼ぶ。model weightsをlosslessに圧縮し、bit-exact outputsを保ち、special hardwareなしで動く設計だ。technical paperとGPU kernelsも公開されているため、通常のproduct teaserより外部検証しやすい。

技術的な主張

中心となる結果はLlama 3.1 8Bで測定されている。CloudflareはUnweightがMLP weightsで約30%のcompressionを得て、全体では15-22%のmodel-size reductionと約3GBのVRAM savingsにつながると説明した。distribution bundleでは約22%、inference bundleではselected projectionsだけを圧縮して約13%のfootprint reductionだという。

この方式はquantizationと違い、意図的にlosslessだ。BF16 weightsのexponent byteにある冗長性をHuffman codingで圧縮し、Hopper tensor coresに渡す直前のfast on-chip shared memoryでweightを復元する。これにより、decompressed weightsを遅いmain GPU memoryへ戻す往復を避ける。ただしtradeoffも明記されている。現在のLlama 3.1 8B実装ではH100 SXM5上で30-40%のthroughput overheadがあり、batch sizeが大きいほど狭まる。

次に見るべきなのは、open kernelsで外部チームがmemory savingsを再現できるか、Cloudflareがthroughput overheadをどこまで削れるか、そしてこの方法がLlama系SwiGLU models以外にも広がるかだ。出典: Cloudflare source tweet · Cloudflare technical post · GPU kernels

#unweight

Cloudflare Unweight、Llama bundleをlosslessに最大22%削るGPU kernelを公開