VeridisQuo combines spatial and frequency cues for explainable deepfake detection
Original: [P] VeridisQuo - open-source deepfake detector that combines spatial + frequency analysis and shows you where the face was manipulated View original →
VeridisQuo is a student-built deepfake detection project that tries to make video forensics more explainable, not just more accurate. The core premise is that many detectors focus mainly on pixel-level visual features, while generated media also leaves traces in the frequency domain through compression artifacts and spectral inconsistencies. VeridisQuo therefore combines a standard spatial vision backbone with dedicated frequency analysis and then visualizes where the model believes manipulation is happening.
According to the README and the r/MachineLearning write-up, the spatial branch uses an ImageNet-pretrained EfficientNet-B4 to produce a 1,792-dimensional representation. The frequency branch computes both FFT and DCT features from each cropped face image, producing two 512-dimensional vectors that are fused into a 1,024-dimensional representation by a small MLP. Those streams are concatenated into a 2,816-dimensional input for the final classifier. The full model is reported at about 25.05 million parameters.
- The model operates on 224x224 RGB face crops.
- The training data is based on FaceForensics++ (C23) and a preprocessed dataset of roughly 716,438 face images.
- The preprocessing pipeline uses 1 FPS frame extraction, YOLOv11n face detection, and padded face crops.
- GradCAM heatmaps are remapped back onto the original video to show suspected manipulation regions.
The explainability layer is what makes the release notable. Deepfake detectors often look strong on curated benchmarks but are hard to trust in deployment because users cannot see whether the model is responding to true manipulation artifacts or to accidental shortcuts. By projecting GradCAM signals back onto the source frames, VeridisQuo gives researchers at least one way to inspect whether attention is landing on blend boundaries, jaw regions, and other facial areas that plausibly correlate with generated edits.
The authors also shared limitations instead of overselling the result. They reported roughly 96% accuracy on the held-out test split and a false-positive rate around 7-8%, but also noted that random real-world videos still skew too often toward “FAKE.” That admission is important because it acknowledges the usual generalization gap between benchmark evaluation and open-world use. For a university project, that level of transparency makes the release more useful to the community.
The community post is on r/MachineLearning. The original materials are available in the GitHub repository and the Hugging Face demo.
Related Articles
This paper argues that image generators may be turning into the vision equivalent of large language models. DeepMind says Vision Banana, built on Nano Banana Pro, beats or rivals specialist systems such as Segment Anything and Depth Anything on 2D and 3D tasks after lightweight instruction tuning.
A well-received r/MachineLearning post introduced VeridisQuo, an open-source deepfake detector that fuses spatial and frequency-domain signals and overlays GradCAM heatmaps onto manipulated video frames. The project stands out because the author shared concrete architecture and training details instead of just a demo clip.
Meta said on March 27, 2026 that SAM 3.1 is a drop-in update to SAM 3 that improves video processing efficiency through object multiplexing. The project's release notes say the update introduces shared-memory joint multi-object tracking, new checkpoints, and about 7x speedup at 128 objects on a single H100 compared with the November 2025 SAM 3 release.
Comments (0)
No comments yet. Be the first to comment!