LoGeR Pushes Feedforward 3D Reconstruction to 19,000-Frame Videos
Original: LoGeR – 3D reconstruction from extremely long videos (DeepMind, UC Berkeley) View original →
What LoGeR is trying to solve
A new Hacker News thread pointed to LoGeR, short for Long-Context Geometric Reconstruction with Hybrid Memory, from researchers at Google DeepMind and UC Berkeley. The project targets a hard computer vision problem: how to reconstruct stable 3D geometry from very long video sequences without falling back to expensive backend optimization. At crawl time the HN thread had 115 points and 25 comments, which is a meaningful signal for a technical research project rather than a general-product launch.
On the project page, the authors frame the problem around two bottlenecks. The first is a context wall: full bidirectional models can model local geometry well, but their quadratic cost does not scale to long video streams. The second is a data wall: simply making attention cheaper is not enough if training still happens on short video “bubbles” that do not generalize to expansive real scenes.
Hybrid memory instead of one giant sequence
LoGeR’s answer is to process the input causally in chunks and bridge those chunks with a hybrid memory module. The local path uses Sliding Window Attention (SWA) to preserve high-precision alignment around neighboring chunk boundaries. The global path uses Test-Time Training (TTT) to keep a compressed long-range state that reduces scale drift as the sequence grows. The architecture also combines per-frame attention and chunk-wise bi-attention, so the model is not forced to choose between local fidelity and long-horizon consistency.
The practical claim is important: the system reports handling sequences up to 19,000 frames while staying fully feedforward and avoiding post-hoc bundle-adjustment style cleanup. That makes the method interesting for robotics, AR, mapping, and embodied systems where latency and deployment simplicity matter as much as raw reconstruction quality.
Reported results
The project page reports an average ATE of 18.65 on KITTI and says LoGeR delivers a 30.8% relative improvement over prior feedforward approaches on a 19k-frame VBR benchmark. It also reports large gains on shorter-sequence tasks, including a 69.2% relative gain on 7-Scenes reconstruction and strong pose improvements on ScanNet and TUM-Dynamics. Those are project-reported numbers, but they are specific enough to show that the contribution is not just “works on long videos” marketing. The method appears to preserve competitive short-context accuracy while materially extending the usable horizon.
The broader significance is that long-context video understanding is increasingly becoming an architecture problem, not only a scale problem. LoGeR suggests there is room between full-attention systems that do not scale and compressed-memory systems that lose too much geometry. If the released code and paper reproduce cleanly, this could become a useful reference point for future long-horizon visual mapping models.
Related Articles
Highlighted in r/MachineLearning, VeridisQuo fuses an EfficientNet-B4 spatial stream with FFT and DCT frequency features, then uses GradCAM remapping to show which facial regions triggered a deepfake prediction.
Google says Cinematic Video Overviews are rolling out to NotebookLM Ultra users in English. The company says the feature combines Gemini 3, Nano Banana Pro, and Veo 3 to generate more immersive videos than the earlier narrated-slide format.
A well-received r/MachineLearning post introduced VeridisQuo, an open-source deepfake detector that fuses spatial and frequency-domain signals and overlays GradCAM heatmaps onto manipulated video frames. The project stands out because the author shared concrete architecture and training details instead of just a demo clip.
Comments (0)
No comments yet. Be the first to comment!