HY-World 2.0 opens code and weights for navigable 3D worlds
Original: HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds View original →
HY-World 2.0 is a fresh entry in the race to make world models useful beyond video clips. In an arXiv paper submitted on April 15, 2026 at 17:59:17 UTC, Team HY-World describes a multimodal framework that can reconstruct, generate, and simulate 3D worlds from text prompts, single-view images, multi-view images, or videos.
The output is not just another flat generation. HY-World 2.0 produces 3D world representations, including high-fidelity navigable 3D Gaussian Splatting scenes from text or a single image. The pipeline has four named stages: Panorama Generation with HY-Pano 2.0, Trajectory Planning with WorldNav, World Expansion with WorldStereo 2.0, and World Composition with WorldMirror 2.0.
The paper also introduces WorldLens, a rendering platform meant to make those generated worlds interactive. The authors describe an engine-agnostic architecture with automatic IBL lighting, efficient collision detection, and training-rendering co-design, plus support for character exploration. That matters because a world model becomes more useful when a user, simulator, or embodied agent can move through the generated space rather than merely watch it.
The release is notable because it is open. The authors say they are releasing model weights, code, and technical details, and report that HY-World 2.0 reaches the strongest results among open-source approaches on several benchmarks, with results comparable to the closed-source model Marble. Those claims still need outside testing, especially on unusual scenes and downstream simulation tasks, but open artifacts give researchers a path to check the work instead of only watching demos.
For developers, code and weights also change the evaluation conversation. It becomes possible to test camera paths, lighting assumptions, memory consistency, and collision behavior directly, instead of inferring quality from a curated video. That is the difference between an impressive media model and a tool that can be stress-tested.
The near-term audience is broader than generative media. Navigable 3D world models could support game prototyping, synthetic data, robotics simulation, spatial reasoning research, and interactive scene editing. The open question is whether these systems can preserve geometry, physics cues, and object consistency when users push them beyond the polished examples in a paper.
Related Articles
Why it matters: NVIDIA is aiming generative video research at simulation-ready 3D environments rather than short clips. The tweet says Lyra 2.0 maintains per-frame 3D geometry and uses self-augmented training, while the project page shows outputs as Gaussian splats and meshes that can be exported to Isaac Sim.
Google DeepMind announced Genie 3, a world model that generates interactive environments from text or image prompts. The system targets 720p at 24fps and sustains coherent interactive worlds for over one minute.
Kitten TTS v0.8 drew Hacker News attention by promising ONNX-based speech synthesis in 15M to 80M models that can run locally on CPUs, while commenters stress-tested real-world usability.
Comments (0)
No comments yet. Be the first to comment!