LoGeR Pushes Feedforward 3D Reconstruction to 19,000-Frame Videos

What LoGeR is trying to solve

A new Hacker News thread pointed to LoGeR, short for Long-Context Geometric Reconstruction with Hybrid Memory, from researchers at Google DeepMind and UC Berkeley. The project targets a hard computer vision problem: how to reconstruct stable 3D geometry from very long video sequences without falling back to expensive backend optimization. At crawl time the HN thread had 115 points and 25 comments, which is a meaningful signal for a technical research project rather than a general-product launch.

On the project page, the authors frame the problem around two bottlenecks. The first is a context wall: full bidirectional models can model local geometry well, but their quadratic cost does not scale to long video streams. The second is a data wall: simply making attention cheaper is not enough if training still happens on short video “bubbles” that do not generalize to expansive real scenes.

Hybrid memory instead of one giant sequence

LoGeR’s answer is to process the input causally in chunks and bridge those chunks with a hybrid memory module. The local path uses Sliding Window Attention (SWA) to preserve high-precision alignment around neighboring chunk boundaries. The global path uses Test-Time Training (TTT) to keep a compressed long-range state that reduces scale drift as the sequence grows. The architecture also combines per-frame attention and chunk-wise bi-attention, so the model is not forced to choose between local fidelity and long-horizon consistency.

The practical claim is important: the system reports handling sequences up to 19,000 frames while staying fully feedforward and avoiding post-hoc bundle-adjustment style cleanup. That makes the method interesting for robotics, AR, mapping, and embodied systems where latency and deployment simplicity matter as much as raw reconstruction quality.

Reported results

The project page reports an average ATE of 18.65 on KITTI and says LoGeR delivers a 30.8% relative improvement over prior feedforward approaches on a 19k-frame VBR benchmark. It also reports large gains on shorter-sequence tasks, including a 69.2% relative gain on 7-Scenes reconstruction and strong pose improvements on ScanNet and TUM-Dynamics. Those are project-reported numbers, but they are specific enough to show that the contribution is not just “works on long videos” marketing. The method appears to preserve competitive short-context accuracy while materially extending the usable horizon.

The broader significance is that long-context video understanding is increasingly becoming an architecture problem, not only a scale problem. LoGeR suggests there is room between full-attention systems that do not scale and compressed-memory systems that lose too much geometry. If the released code and paper reproduce cleanly, this could become a useful reference point for future long-horizon visual mapping models.

Project page · ArXiv · Hacker News discussion

LoGeR Pushes Feedforward 3D Reconstruction to 19,000-Frame Videos

What LoGeR is trying to solve

Hybrid memory instead of one giant sequence

Reported results

Related Articles

Meta ships SAM 3.1 with object multiplexing for 32 FPS video tracking on a single H100

DeepMind CEO Demis Hassabis Calls AI Mouse Pointer 'Pretty Magical'

Google DeepMind Reimagines the Mouse with Gemini-Powered Magic Pointer

Related Articles

Meta ships SAM 3.1 with object multiplexing for 32 FPS video tracking on a single H100
AI Mar 28, 2026 2 min read

DeepMind CEO Demis Hassabis Calls AI Mouse Pointer 'Pretty Magical'
AI X/Twitter May 17, 2026 1 min read

Google DeepMind Reimagines the Mouse with Gemini-Powered Magic Pointer
AI May 13, 2026 1 min read