Reddit is into a headless Gemma 4 server built from a Xiaomi phone, not another 48 GB rig
Original: 24/7 Headless AI Server on Xiaomi 12 Pro (Snapdragon 8 Gen 1 + Ollama/Gemma4) View original →
r/LocalLLaMA loved this because it flips the usual local-LLM flex on its head. Instead of another tower full of GPUs, the post shows a Xiaomi 12 Pro turned into a 24/7 headless Gemma 4 node. With 929 upvotes and 235 comments on the Reddit thread, the reaction was basically: yes, this is the kind of practical weird build people actually want to see.
The author says they flashed LineageOS, stripped away the Android UI and background bloat, and left roughly 9GB of RAM available for LLM work. The phone runs headless, keeps networking alive with a manually compiled wpa_supplicant, and uses a custom daemon to monitor CPU temperature and trigger an external active-cooling module through a Wi-Fi smart plug at 45°C. To avoid cooking the battery during 24/7 use, a power-delivery script cuts charging at 80%. The current setup serves Gemma 4 through Ollama as a LAN-accessible API.
The comments explain why the post hit so hard. One technically minded reply immediately suggested compiling llama.cpp directly on the device and dropping Ollama to squeeze out more inference speed. Another highly upvoted response said they were tired of seeing 48GB and 96GB build showcases and wanted good models running on normal consumer hardware instead. That is the real community angle here: this is not benchmark theater, it is an existence proof that local AI experiments do not have to start with workstation-class gear.
A phone like this is not replacing a serious GPU server, and the thread does not pretend otherwise. The appeal is different. A repurposed handset can become a quiet always-on endpoint for lightweight assistants, home-lab APIs, and personal local inference experiments. For a community obsessed with turning what it already owns into something useful, this Xiaomi build landed exactly where it should.
Related Articles
A high-scoring LocalLLaMA post says Qwen 3.5 9B on a 16GB M1 Pro handled memory recall and basic tool calling well enough for real agent work, even though creative reasoning still trailed frontier models.
Ollama said on March 20, 2026 that NVIDIA’s Nemotron-Cascade-2 can now run through its local model stack. The official model page positions it as an open 30B MoE model with 3B activated parameters, thinking and instruct modes, and built-in paths into agent tools such as OpenClaw, Codex, and Claude.
Ollama used a March 30, 2026 preview to move its Apple Silicon path onto MLX. The release pairs higher prefill and decode throughput with NVFP4 support and cache changes aimed at coding and agent workflows.
Comments (0)
No comments yet. Be the first to comment!