VeridisQuo combines spatial and frequency cues for explainable deepfake detection

VeridisQuo is a student-built deepfake detection project that tries to make video forensics more explainable, not just more accurate. The core premise is that many detectors focus mainly on pixel-level visual features, while generated media also leaves traces in the frequency domain through compression artifacts and spectral inconsistencies. VeridisQuo therefore combines a standard spatial vision backbone with dedicated frequency analysis and then visualizes where the model believes manipulation is happening.

According to the README and the r/MachineLearning write-up, the spatial branch uses an ImageNet-pretrained EfficientNet-B4 to produce a 1,792-dimensional representation. The frequency branch computes both FFT and DCT features from each cropped face image, producing two 512-dimensional vectors that are fused into a 1,024-dimensional representation by a small MLP. Those streams are concatenated into a 2,816-dimensional input for the final classifier. The full model is reported at about 25.05 million parameters.

The model operates on 224x224 RGB face crops.
The training data is based on FaceForensics++ (C23) and a preprocessed dataset of roughly 716,438 face images.
The preprocessing pipeline uses 1 FPS frame extraction, YOLOv11n face detection, and padded face crops.
GradCAM heatmaps are remapped back onto the original video to show suspected manipulation regions.

The explainability layer is what makes the release notable. Deepfake detectors often look strong on curated benchmarks but are hard to trust in deployment because users cannot see whether the model is responding to true manipulation artifacts or to accidental shortcuts. By projecting GradCAM signals back onto the source frames, VeridisQuo gives researchers at least one way to inspect whether attention is landing on blend boundaries, jaw regions, and other facial areas that plausibly correlate with generated edits.

The authors also shared limitations instead of overselling the result. They reported roughly 96% accuracy on the held-out test split and a false-positive rate around 7-8%, but also noted that random real-world videos still skew too often toward “FAKE.” That admission is important because it acknowledges the usual generalization gap between benchmark evaluation and open-world use. For a university project, that level of transparency makes the release more useful to the community.

The community post is on r/MachineLearning. The original materials are available in the GitHub repository and the Hugging Face demo.

VeridisQuo combines spatial and frequency cues for explainable deepfake detection

Related Articles

Grok Voice agents now cost $0.05 per minute to build

AI model rivalry shifts from benchmark charts to token bills

Kimi’s rise puts Chinese open-weight models back in Washington’s sights