Skip to content

Two Strix Halo boards as a vLLM cluster: the hard part is RDMA

Original: AMD Strix Halo RDMA Cluster Setup Guide View original →

Read in other languages: 한국어日本語
LLM Jun 28, 2026 By Insights AI (HN) 2 min read 1 views Source

The AMD Strix Halo RDMA Cluster Setup Guide captures a practical shift in local LLM work. The goal is not simply to run a model on one small machine, but to connect two Framework Desktop Mainboards with AMD Ryzen AI Max 300-series chips, 128GB of unified memory each, and Intel E810 100GbE NICs so vLLM can serve a model with tensor parallelism across both nodes.

The central detail is RDMA. The guide explains the serving stack as Ray for the control plane, RCCL for AMD collective communication, and RoCE v2 over Ethernet for the data plane. In tensor parallelism, the nodes exchange partial results after every layer, so latency matters as much as raw bandwidth. The guide contrasts roughly 70-100 microseconds over TCP/IP with about 5 microseconds over RDMA, which is why the network path becomes part of the model experience.

The setup is specific rather than aspirational. It covers Fedora 43, BIOS and kernel parameters, static addresses, MTU 9000, firewall trust, passwordless SSH, RDMA device exposure inside the container, and a custom librccl.so patch. It also calls out a hardware wrinkle: the Framework board exposes a physical PCIe x4 slot, so 100GbE cards require a riser or adapter. A modified slot is mentioned as a test setup, but the guide explicitly steers users toward safer risers.

HN discussion centered on the homelab boundary. Commenters liked the possibility of bridging the gap between 24GB consumer GPUs and much larger memory pools by combining two unified-memory boxes. At the same time, they questioned cost, token speed, PCIe limits, NIC heat, and whether Apple machines could eventually expose similar RDMA benefits over Thunderbolt.

The guide is not a turnkey product announcement, and that is the point. Local LLM performance is now shaped by memory layout, interconnect latency, containers, and serving orchestration as much as by the model file. For builders trying to run larger models outside cloud GPU rentals, this is a concrete map of the work still required.

Share: Long

Related Articles