r/LocalLLaMA: StepFun Releases the SFT Dataset Behind Step 3.5 Flash
Original: StepFun releases SFT dataset used to train Step 3.5 Flash View original →
r/LocalLLaMA reacted positively when StepFun published a meaningful part of its training stack instead of stopping at a model release. The Reddit post linking Step-3.5-Flash-SFT had 124 upvotes and 16 comments at crawl time. On Hugging Face, StepFun describes the dataset as a general-domain supervised fine-tuning release for chat models and publishes the training interface in one repository: raw JSON shards, tokenizer snapshots, and compiled variants intended for StepTronOSS training.
The README makes the release technically useful. It documents a conversations structure with ordered turns and an optional reasoning_content field on assistant messages. The repo includes tokenizer snapshots for both Step-3.5-Flash and Qwen3 specifically to preserve chat-template alignment, plus tokenizer-specific compiled shards for StepTronOSS. StepFun also publishes compatibility rules that matter if someone tries to reproduce the recipe: use a sequential sampler, do not mix tokenizer and compiled variants, and keep transformers<5.0 when relying on apply_chat_template(...).
Why the community cared
- Open raw data plus tokenizer snapshots is more reproducible than the usual weight-only “open” release.
- The optional
reasoning_contentfield gives finetuners something they can keep, strip, or transform depending on their own training recipe. - The comment thread quickly surfaced licensing tension because StepFun says users must comply with both Apache-2.0 and CC-BY-NC-2.0 simultaneously.
That combination of openness and friction is exactly what made the thread interesting. Several commenters praised StepFun for releasing a real training surface instead of vague transparency claims. Others immediately focused on the dual-license structure and whether a non-commercial condition can coexist cleanly with the more permissive expectations people associate with Apache-style releases. Another practical point from the discussion was that shipping Qwen3 tokenizer snapshots reduces the usual pain of chat-template mismatch when developers want to reuse the data outside the original model family.
For the open-model ecosystem, this sits in an important middle ground. StepFun did not merely publish a dataset URL for optics. It exposed more of the path connecting data, tokenizer behavior, and a reference training stack. That does not resolve licensing questions, but it does make the release technically substantial for researchers and builders who want to understand how reasoning-, code-, and agent-oriented chat models are assembled in practice.
Source: Hugging Face · Community discussion: r/LocalLLaMA
Related Articles
A 54-point Reddit post flagged merged PR #19441 as the moment qwen3-omni-moe and qwen3-asr support reached llama.cpp, with commenters focused on local multimodal and ASR use cases.
Hacker News liked that Zed did more than add extra agents to a sidebar. The thread focused on worktree isolation, repo scoping, and whether Zed found a more usable shape for multi-agent coding than the usual terminal pile-up. By crawl time on April 25, 2026, the post had 278 points and 160 comments.
HN did not push Browser Harness because it was another browser wrapper. It took off because the repo lets an LLM patch its own browser helpers in the middle of a task, trading safety rails for raw flexibility.
Comments (0)
No comments yet. Be the first to comment!