Decaying

r/LocalLLaMA: VoiceShelf Runs Kokoro TTS Offline on Android for EPUB Audiobooks

Original: I built an Android audiobook reader that runs Kokoro TTS fully offline on-device View original →

Read in other languages: 한국어日本語
AI Mar 9, 2026 By Insights AI (Reddit) 2 min read 27 views Source

What surfaced on r/LocalLLaMA

A well-received r/LocalLLaMA post described VoiceShelf, an Android audiobook reader that runs the Kokoro speech model fully offline on-device. As of March 9, 2026, the thread had 90 points. The author says the app turns EPUB text into streamed narration locally instead of sending book content to a cloud service, which makes the project more interesting as a mobile inference system than as a simple TTS demo.

The post lays out a concrete pipeline: EPUB parsing, sentence and segment chunking, G2P via Misaki, Kokoro inference, and streaming playback while an audio buffer is built. On the author's Samsung Galaxy Z Fold 7 with Snapdragon 8 Elite, the app reportedly generates audio at about 2.8x real-time. That matters because real-time factor is the hard constraint for products like this. If generation falls below playback speed, the user experiences buffering rather than narration.

Why the implementation details matter

The same post also exposes the engineering costs that usually stay hidden. The APK is about 1 GB because it bundles the model and custom libraries needed to run without quality loss on Android. Current features include EPUB support, experimental PDF support, fully offline inference, screen-off narration, a sleep timer, and local library management. The author is explicitly asking testers on Snapdragon, Tensor, and Dimensity devices to measure throughput and thermal throttling over longer sessions.

That is exactly the right bottleneck to investigate. Running a speech model once on a flagship phone is no longer the interesting part. The harder question is whether the system remains usable across chipsets, battery conditions, and one-hour listening sessions. In other words, the community is moving from "can this model run locally?" to "can this model power a product people would actually keep installed?"

What the thread signals

VoiceShelf is a small but useful marker for local AI on mobile. It shows that offline neural narration on consumer phones is becoming viable enough to test with real content, real buffering constraints, and real thermals. The remaining issues are product issues rather than demo issues: install size, hardware variability, and sustained performance. That is a healthier place for the community to be, because it means on-device AI is being judged on operational reality instead of novelty alone.

Share: Long

Related Articles

AI sources.twitter Apr 5, 2026 2 min read

Mistral AI said on March 26, 2026 that Voxtral TTS offers expressive speech, support for 9 languages and dialects, low latency, and easy adaptation to new voices. Mistral’s March 23 launch post says the 4B-parameter model can adapt from about three seconds of reference audio, reaches roughly 70ms model latency, supports up to two minutes of native audio generation, and is available by API and as open weights.

AI Apr 16, 2026 2 min read

Google’s new speech model moves control from hidden settings into the text itself: audio tags can steer style, pace, and delivery across 70+ languages. Gemini 3.1 Flash TTS is in preview through Gemini API, Google AI Studio, and Vertex AI, reaches Google Vids users, scores 1,211 Elo on Artificial Analysis, and watermarks outputs with SynthID.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.