Reddit Project Watch: VeridisQuo Combines EfficientNet, FFT, and DCT for Explainable Deepfake Detection

A r/MachineLearning project post that crossed the 100-point mark this week introduced VeridisQuo, an open-source deepfake detector built around a two-stream design. The post is notable because it includes a rare amount of implementation detail for a community showcase: model architecture, training data, hardware budget, and the explainability path used to visualize manipulated regions in video.

The core idea is to combine conventional spatial features with frequency-domain cues. According to the author, VeridisQuo uses an EfficientNet-B4 backbone for the spatial branch, producing a 1792-dimensional representation from each face crop. In parallel, a frequency module computes FFT features with radial binning and a Hann window plus DCT features on 8x8 blocks. Those two 512-dimensional vectors are fused through an MLP into a 1024-dimensional representation, which is then concatenated with the spatial branch for a 2816-dimensional classifier input.

The explainability piece is what makes the project easy to evaluate from the outside. VeridisQuo computes GradCAM heatmaps on the EfficientNet backbone and remaps them back onto the original video frames, so users can inspect which face regions influenced the detector. The author says the model often highlights blending boundaries and jawline areas, which is consistent with the kinds of local artifacts many practitioners expect in compressed or composited deepfake footage.

The training setup is also concrete. The project uses FaceForensics++ (C23), including Face2Face, FaceShifter, FaceSwap, and NeuralTextures. Frames were extracted at 1 FPS, faces were detected with YOLOv11n, and the resulting training set reached roughly 716K face images. Training reportedly ran for 7 epochs on a rented RTX 3090 in about four hours using AdamW, cosine annealing, and CrossEntropyLoss. The author’s main claim is that the frequency branch alone does not beat the spatial backbone, but the fused model helps more on higher-quality fakes where pixel-level artifacts are harder to spot.

That is why the Reddit post resonated. Many deepfake demos stop at qualitative output, while this one offers a readable architectural hypothesis: compressed video artifacts live partly in the frequency domain, and the detector gets stronger when those cues are fused with spatial features and paired with visual explanations. The GitHub repository and Hugging Face demo make it easy for other practitioners to test whether that tradeoff holds outside the original training set.

Primary sources: the Reddit post and the VeridisQuo repository.

Reddit Project Watch: VeridisQuo Combines EfficientNet, FFT, and DCT for Explainable Deepfake Detection

Related Articles

Vision Banana turns image generators into all-purpose vision models

VeridisQuo combines spatial and frequency cues for explainable deepfake detection

Meta releases SAM 3.1 with object multiplexing for faster multi-object video tracking

Comments (0)

Leave a Comment

Related Articles

Vision Banana turns image generators into all-purpose vision models

VeridisQuo combines spatial and frequency cues for explainable deepfake detection
AI Reddit Mar 9, 2026 2 min read

Meta releases SAM 3.1 with object multiplexing for faster multi-object video tracking
AI sources.twitter Mar 31, 2026 2 min read