Reddit Project Watch: VeridisQuo Combines EfficientNet, FFT, and DCT for Explainable Deepfake Detection
Original: [P] VeridisQuo - open-source deepfake detector that combines spatial + frequency analysis and shows you where the face was manipulated View original →
A r/MachineLearning project post that crossed the 100-point mark this week introduced VeridisQuo, an open-source deepfake detector built around a two-stream design. The post is notable because it includes a rare amount of implementation detail for a community showcase: model architecture, training data, hardware budget, and the explainability path used to visualize manipulated regions in video.
The core idea is to combine conventional spatial features with frequency-domain cues. According to the author, VeridisQuo uses an EfficientNet-B4 backbone for the spatial branch, producing a 1792-dimensional representation from each face crop. In parallel, a frequency module computes FFT features with radial binning and a Hann window plus DCT features on 8x8 blocks. Those two 512-dimensional vectors are fused through an MLP into a 1024-dimensional representation, which is then concatenated with the spatial branch for a 2816-dimensional classifier input.
The explainability piece is what makes the project easy to evaluate from the outside. VeridisQuo computes GradCAM heatmaps on the EfficientNet backbone and remaps them back onto the original video frames, so users can inspect which face regions influenced the detector. The author says the model often highlights blending boundaries and jawline areas, which is consistent with the kinds of local artifacts many practitioners expect in compressed or composited deepfake footage.
The training setup is also concrete. The project uses FaceForensics++ (C23), including Face2Face, FaceShifter, FaceSwap, and NeuralTextures. Frames were extracted at 1 FPS, faces were detected with YOLOv11n, and the resulting training set reached roughly 716K face images. Training reportedly ran for 7 epochs on a rented RTX 3090 in about four hours using AdamW, cosine annealing, and CrossEntropyLoss. The author’s main claim is that the frequency branch alone does not beat the spatial backbone, but the fused model helps more on higher-quality fakes where pixel-level artifacts are harder to spot.
That is why the Reddit post resonated. Many deepfake demos stop at qualitative output, while this one offers a readable architectural hypothesis: compressed video artifacts live partly in the frequency domain, and the detector gets stronger when those cues are fused with spatial features and paired with visual explanations. The GitHub repository and Hugging Face demo make it easy for other practitioners to test whether that tradeoff holds outside the original training set.
Primary sources: the Reddit post and the VeridisQuo repository.
Related Articles
Highlighted in r/MachineLearning, VeridisQuo fuses an EfficientNet-B4 spatial stream with FFT and DCT frequency features, then uses GradCAM remapping to show which facial regions triggered a deepfake prediction.
A Hacker News discussion highlighted LoGeR, a Google DeepMind and UC Berkeley project that uses hybrid memory to scale dense 3D reconstruction across extremely long videos without post-hoc optimization.
OpenAI announced on X that Codex Security has entered research preview. The company positions it as an application security agent that can detect, validate, and patch complex vulnerabilities with more context and less noise.
Comments (0)
No comments yet. Be the first to comment!