A Robust DeepFake Detection system

Deepfake videos are a growing concern in today’s digital landscape due to their potential to spread misinformation and manipulate public opinion. In this project, we aim to develop a system that detects deepfake videos using the DeepFake Detection Challenge (DFDC) dataset from Kaggle. Our ultimate goal is to create a reliable tool for real-time detection of deepfakes, helping to preserve the authenticity of media content.

Problem Domain and Brief Project Description

The advancement of artificial intelligence technologies has made it possible to create highly realistic yet fake videos. Deepfake videos use AI to superimpose faces or manipulate expressions, making it increasingly challenging to distinguish between real and fake content. This poses a threat to the credibility of online media and can have serious implications for society.

Our project seeks to address this issue by creating a system capable of detecting deepfake videos. The system will take video content as input and provide a prediction on whether it is real or fake. The DFDC dataset, which contains a variety of real and fake videos, serves as the foundation for our training and evaluation.

Exploring Deep Fake

The DeepFakes has spurred a lot of research and development in the field of deepfake detection, leading to the creation and evaluation of numerous models and architectures. Some general models and architectures that have shown promising progress in the DeepFake include:

EfficientNet: EfficientNet is a family of models known for their efficient use of parameters and computational resources while maintaining high performance. Variants such as EfficientNet-B4 and EfficientNet-B4-ST have been employed in deepfake detection with success.
XceptionNet: XceptionNet is an extension of InceptionNet and is known for its use of depthwise separable convolutions. It has been widely used in image classification tasks and is effective in detecting deepfake videos due to its strong feature extraction capabilities. XceptionNet for FaceForensics++: https://paperswithcode.com/paper/faceforensics-learning-to-detect-manipulated
Attention Mechanisms: Attention-based models, such as self-attention and multi-head attention, have been used in conjunction with CNN architectures to capture long-range dependencies and improve deepfake detection accuracy.
Ensemble Models: Ensemble methods combine multiple models to improve overall performance and robustness. Ensembles of CNNs with self-attention mechanisms and other deep learning architectures have shown effectiveness in deepfake detection.

These models and architectures have contributed significantly to the progress of deepfake detection.

Experimenting Model Architectures

While many deepfake detection approaches leverage temporal information from video data, our project explores model architectures that focus on spatial modeling without incorporating explicit temporal components.

The rationale behind this approach is that even in the absence of temporal modeling, there may be inherent visual cues and artifacts within individual video frames that can reliably distinguish real from deepfake content. Facial textures, lighting inconsistencies, and other localized anomalies may be present in deepfake images that can be captured by models that prioritize spatial feature extraction and analysis.

Our experiments involve the following types of model architectures:

CoAtNet: This model represents a hybrid approach, combining the strengths of CNNs and transformers. It incorporates depthwise convolutions from CNNs for efficient feature extraction and relative attention from transformers to capture long-range dependencies within image data.

paper

DenseNet with Vision Transformers: We are experimenting with integrating DenseNet architecture with vision transformers to leverage the rich feature extraction capabilities of DenseNet and the long-range context-handling ability of transformers.
Efficient Vision Transformer: Similar to a CNN-ViT integration, this approach has proven successful in detecting deepfakes using the datasets we used. It offers an efficient architecture that balances feature extraction with transformer-based context analysis.

paper

Data Preprocessing

Proper data preprocessing is essential for training accurate and robust models. In our project, we focus on:

Face Extraction: We employ various methods such as MTCNN (Multi-task Cascaded Convolutional Networks) for effective face extraction from videos.
Data Augmentation: We utilize techniques such as ensembling and CutMix augmentations to improve model performance. These methods combine data samples in novel ways to expose models to a wider variety of data during training.

Model Evaluation

After training the model, we will evaluate its performance using a test set from the DFDC dataset. This evaluation will help us understand how well the model is performing in detecting deepfakes and where improvements might be needed.

Current best models

1. Cross Efficient Vision Transformer

AUC: 0.951
Model: Cross Efficient Vision Transformer
Tags: CNN+Transformer, Vision Transformer, EfficientNet
Paper: Combining EfficientNet and Vision Transformers for Video Deepfake Detection
Year: 2021

The Cross-Efficient Vision Transformer model represents an innovative approach by combining EfficientNet and Vision Transformers for deepfake detection. This hybrid architecture leverages the strengths of EfficientNet’s feature extraction capabilities and Vision Transformer’s long-range dependency analysis. The model’s AUC of 0.951 signifies its high effectiveness in detecting deepfake videos.

2. Efficient Vision Transformer

AUC: 0.919
Model: Efficient Vision Transformer
Tags: CNN+Transformer, Vision Transformer, EfficientNet
Paper: Combining EfficientNet and Vision Transformers for Video Deepfake Detection
Year: 2021

The Efficient Vision Transformer is another powerful model that integrates EfficientNet and Vision Transformers. With an AUC of 0.919, this architecture has demonstrated strong performance in detecting deepfake videos, showcasing the potential of combining CNNs and transformers for this task.

3. EfficientNetB4 + EfficientNetB4ST + B4Att

LogLoss: 0.4640
Model: EfficientNetB4 + EfficientNetB4ST + B4Att
Paper: https://paperswithcode.com/paper/video-face-manipulation-detection-through
Year: 2020

This ensemble model combines variants of EfficientNet (EfficientNetB4 and EfficientNetB4ST) with B4Att to enhance deepfake detection. The approach uses multiple models in an ensemble to improve overall performance and robustness. The reported LogLoss of 0.4640 indicates a solid performance in detecting manipulated videos.

Challenges and Future Work

While our model is still under development, we anticipate several challenges and areas for future work:

Distinguishing Real and Fake: The sophistication of deepfake technology presents a challenge in accurately distinguishing between real and fake content. Continued improvement and experimentation with model architectures will be key.
Real-Time Detection: Achieving real-time deepfake detection is a long-term goal. This requires optimizing the model for speed and efficiency while maintaining accuracy.
Scalability and Adaptability: Our system needs to be scalable and adaptable to different types of video content and evolving deepfake techniques.

Tools and Resources

For this project, we leveraged several tools and resources:

DFDC Dataset from Kaggle: This comprehensive dataset provided real and fake videos for training and evaluation.
BlazeFace, MTCNN and face_recognition
Code and Resources: We are building upon existing code and resources for deep learning and computer vision, which provides a strong foundation for our project.

Note: Due to our computation power we are considering only some parts of the DFDC dataset for training our model.

Conclusion

While our deepfake detection model is still under development, our project holds great potential for addressing the pressing issue of deep-fake videos. By leveraging advanced techniques such as vision transformers, we aim to create an accurate system for detecting deepfakes in real time.

As we continue our work, we are excited to explore new approaches and make improvements to our model. This project will ultimately contribute to a safer and more trustworthy digital environment by combating the spread of misinformation and ensuring the authenticity of online media. We look forward to sharing our progress and results as our project evolves.