LoGeR Pushes Feedforward 3D Reconstruction to 19,000-Frame Videos
Original: LoGeR – 3D reconstruction from extremely long videos (DeepMind, UC Berkeley) View original →
What LoGeR is trying to solve
A new Hacker News thread pointed to LoGeR, short for Long-Context Geometric Reconstruction with Hybrid Memory, from researchers at Google DeepMind and UC Berkeley. The project targets a hard computer vision problem: how to reconstruct stable 3D geometry from very long video sequences without falling back to expensive backend optimization. At crawl time the HN thread had 115 points and 25 comments, which is a meaningful signal for a technical research project rather than a general-product launch.
On the project page, the authors frame the problem around two bottlenecks. The first is a context wall: full bidirectional models can model local geometry well, but their quadratic cost does not scale to long video streams. The second is a data wall: simply making attention cheaper is not enough if training still happens on short video “bubbles” that do not generalize to expansive real scenes.
Hybrid memory instead of one giant sequence
LoGeR’s answer is to process the input causally in chunks and bridge those chunks with a hybrid memory module. The local path uses Sliding Window Attention (SWA) to preserve high-precision alignment around neighboring chunk boundaries. The global path uses Test-Time Training (TTT) to keep a compressed long-range state that reduces scale drift as the sequence grows. The architecture also combines per-frame attention and chunk-wise bi-attention, so the model is not forced to choose between local fidelity and long-horizon consistency.
The practical claim is important: the system reports handling sequences up to 19,000 frames while staying fully feedforward and avoiding post-hoc bundle-adjustment style cleanup. That makes the method interesting for robotics, AR, mapping, and embodied systems where latency and deployment simplicity matter as much as raw reconstruction quality.
Reported results
The project page reports an average ATE of 18.65 on KITTI and says LoGeR delivers a 30.8% relative improvement over prior feedforward approaches on a 19k-frame VBR benchmark. It also reports large gains on shorter-sequence tasks, including a 69.2% relative gain on 7-Scenes reconstruction and strong pose improvements on ScanNet and TUM-Dynamics. Those are project-reported numbers, but they are specific enough to show that the contribution is not just “works on long videos” marketing. The method appears to preserve competitive short-context accuracy while materially extending the usable horizon.
The broader significance is that long-context video understanding is increasingly becoming an architecture problem, not only a scale problem. LoGeR suggests there is room between full-attention systems that do not scale and compressed-memory systems that lose too much geometry. If the released code and paper reproduce cleanly, this could become a useful reference point for future long-horizon visual mapping models.
Related Articles
Meta introduced SAM 3.1 on March 27, 2026 as a drop-in upgrade for real-time video detection and tracking. The company says object multiplexing lets the model track up to 16 objects in one forward pass and doubles throughput from 16 to 32 FPS on a single H100 for medium-object-count videos.
This paper argues that image generators may be turning into the vision equivalent of large language models. DeepMind says Vision Banana, built on Nano Banana Pro, beats or rivals specialist systems such as Segment Anything and Depth Anything on 2D and 3D tasks after lightweight instruction tuning.
Google DeepMind said on March 26, 2026 that it is releasing a public toolkit to measure harmful manipulation by AI systems. The company says the work spans nine studies with more than 10,000 participants and now informs safety evaluations for models including Gemini 3 Pro.
Comments (0)
No comments yet. Be the first to comment!