GenCAD is an AI system that generates parametric CAD command sequences from image inputs. Unlike mesh or voxel-based 3D generation, it outputs the complete CAD program history — making designs fully editable. The system combines an autoregressive transformer, contrastive learning, and a latent diffusion model.
#computer-vision
RSS FeedLocalLLaMA reacted hard because DeepSeek's visual-primitives idea makes points and boxes part of reasoning itself, and the repo going private only made the thread hotter.
This paper argues that image generators may be turning into the vision equivalent of large language models. DeepMind says Vision Banana, built on Nano Banana Pro, beats or rivals specialist systems such as Segment Anything and Depth Anything on 2D and 3D tasks after lightweight instruction tuning.
Meta said on March 27, 2026 that SAM 3.1 is a drop-in update to SAM 3 that improves video processing efficiency through object multiplexing. The project's release notes say the update introduces shared-memory joint multi-object tracking, new checkpoints, and about 7x speedup at 128 objects on a single H100 compared with the November 2025 SAM 3 release.
Meta introduced SAM 3.1 on March 27, 2026 as a drop-in upgrade for real-time video detection and tracking. The company says object multiplexing lets the model track up to 16 objects in one forward pass and doubles throughput from 16 to 32 FPS on a single H100 for medium-object-count videos.
A post on r/artificial drew attention to painter Michael Hafftka publishing his catalog raisonne as an open dataset on Hugging Face. The dataset card lists roughly 3,780 works, structured metadata, and a CC-BY-NC-4.0 license.
A March 16, 2026 r/artificial post linking a Popular Science report reached 590 points and 62 comments. The story says Niantic Spatial trained its Visual Positioning System on more than 30 billion Pokémon Go images and is now partnering with Coco Robotics so delivery robots can localize with centimeter-level precision in GPS-challenged streets.
A Hacker News discussion highlighted LoGeR, a Google DeepMind and UC Berkeley project that uses hybrid memory to scale dense 3D reconstruction across extremely long videos without post-hoc optimization.
Highlighted in r/MachineLearning, VeridisQuo fuses an EfficientNet-B4 spatial stream with FFT and DCT frequency features, then uses GradCAM remapping to show which facial regions triggered a deepfake prediction.
Google DeepMind introduced D4RT, a single model framework for dynamic 4D scene reconstruction and tracking. The company reports up to 300x efficiency gains versus prior methods, highlighting real-time potential for robotics and AR workloads.