This paper argues that image generators may be turning into the vision equivalent of large language models. DeepMind says Vision Banana, built on Nano Banana Pro, beats or rivals specialist systems such as Segment Anything and Depth Anything on 2D and 3D tasks after lightweight instruction tuning.
#computer-vision
RSS FeedMeta said on March 27, 2026 that SAM 3.1 is a drop-in update to SAM 3 that improves video processing efficiency through object multiplexing. The project's release notes say the update introduces shared-memory joint multi-object tracking, new checkpoints, and about 7x speedup at 128 objects on a single H100 compared with the November 2025 SAM 3 release.
Meta introduced SAM 3.1 on March 27, 2026 as a drop-in upgrade for real-time video detection and tracking. The company says object multiplexing lets the model track up to 16 objects in one forward pass and doubles throughput from 16 to 32 FPS on a single H100 for medium-object-count videos.
Google DeepMind introduced D4RT on January 22, 2026 as a unified model for dynamic 4D scene reconstruction and tracking. The company says it runs 18x to 300x faster than prior methods and is efficient enough for real-time applications in robotics and augmented reality.
A post on r/artificial drew attention to painter Michael Hafftka publishing his catalog raisonne as an open dataset on Hugging Face. The dataset card lists roughly 3,780 works, structured metadata, and a CC-BY-NC-4.0 license.
A March 16, 2026 r/artificial post linking a Popular Science report reached 590 points and 62 comments. The story says Niantic Spatial trained its Visual Positioning System on more than 30 billion Pokémon Go images and is now partnering with Coco Robotics so delivery robots can localize with centimeter-level precision in GPS-challenged streets.
A Hacker News discussion highlighted LoGeR, a Google DeepMind and UC Berkeley project that uses hybrid memory to scale dense 3D reconstruction across extremely long videos without post-hoc optimization.
Highlighted in r/MachineLearning, VeridisQuo fuses an EfficientNet-B4 spatial stream with FFT and DCT frequency features, then uses GradCAM remapping to show which facial regions triggered a deepfake prediction.
A well-received r/MachineLearning post introduced VeridisQuo, an open-source deepfake detector that fuses spatial and frequency-domain signals and overlays GradCAM heatmaps onto manipulated video frames. The project stands out because the author shared concrete architecture and training details instead of just a demo clip.
Google DeepMind introduced D4RT, a single model framework for dynamic 4D scene reconstruction and tracking. The company reports up to 300x efficiency gains versus prior methods, highlighting real-time potential for robotics and AR workloads.