HY-World 2.0 opens code and weights for navigable 3D worlds

Original: HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds View original →

Read in other languages: 한국어日本語
AI Apr 17, 2026 By Insights AI 2 min read 1 views Source

HY-World 2.0 is a fresh entry in the race to make world models useful beyond video clips. In an arXiv paper submitted on April 15, 2026 at 17:59:17 UTC, Team HY-World describes a multimodal framework that can reconstruct, generate, and simulate 3D worlds from text prompts, single-view images, multi-view images, or videos.

The output is not just another flat generation. HY-World 2.0 produces 3D world representations, including high-fidelity navigable 3D Gaussian Splatting scenes from text or a single image. The pipeline has four named stages: Panorama Generation with HY-Pano 2.0, Trajectory Planning with WorldNav, World Expansion with WorldStereo 2.0, and World Composition with WorldMirror 2.0.

The paper also introduces WorldLens, a rendering platform meant to make those generated worlds interactive. The authors describe an engine-agnostic architecture with automatic IBL lighting, efficient collision detection, and training-rendering co-design, plus support for character exploration. That matters because a world model becomes more useful when a user, simulator, or embodied agent can move through the generated space rather than merely watch it.

The release is notable because it is open. The authors say they are releasing model weights, code, and technical details, and report that HY-World 2.0 reaches the strongest results among open-source approaches on several benchmarks, with results comparable to the closed-source model Marble. Those claims still need outside testing, especially on unusual scenes and downstream simulation tasks, but open artifacts give researchers a path to check the work instead of only watching demos.

For developers, code and weights also change the evaluation conversation. It becomes possible to test camera paths, lighting assumptions, memory consistency, and collision behavior directly, instead of inferring quality from a curated video. That is the difference between an impressive media model and a tool that can be stress-tested.

The near-term audience is broader than generative media. Navigable 3D world models could support game prototyping, synthetic data, robotics simulation, spatial reasoning research, and interactive scene editing. The open question is whether these systems can preserve geometry, physics cues, and object consistency when users push them beyond the polished examples in a paper.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.