LocalLLaMA upvoted this because it pushes against the endless ‘48GB build’ arms race with something more practical and more fun: repurposing a phone as a local LLM box. The post describes a Xiaomi 12 Pro running LineageOS, headless networking, thermal automation, battery protection, and Gemma4 served through Ollama on a home LAN.
#gemma4
RSS FeedThe LocalLLaMA thread took off because native speech-to-text inside llama.cpp is exactly the kind of feature that removes an extra pipeline from local agent setups. The post says llama-server can now run STT with Gemma-4 E2A and E4A models, and commenters immediately started comparing the practical experience to Whisper and Voxtral.
A `r/LocalLLaMA` benchmark claims Gemma 4 31B can run at 256K context on a single RTX 5090 using TurboQuant KV cache compression. The post is notable because it pairs performance numbers with detailed build notes, VRAM measurements, and community skepticism about long-context quality under heavy KV quantization.
A practical HN gist lays out how to run Ollama and Gemma 4 on an Apple Silicon Mac mini, including auto-start, periodic preload, and `OLLAMA_KEEP_ALIVE=-1`. The author says `gemma4:26b` nearly exhausted 24GB unified memory, making the default 8B model a safer operational choice.