Introduction
With the proliferation of deepfake technology, the ability to discern real from manipulated content has become a critical concern. Deepfake videos, often indistinguishable from authentic footage, pose serious threats to privacy, security, and the integrity of information.
In response to this challenge, cutting-edge research in computer vision and machine learning has yielded innovative solutions for detecting deepfakes. Among these, the fusion of convolutional neural networks (ConvNets) and attention-based models has emerged as a promising approach.
In this project, we present an in-depth exploration of deepfake detection using ConvNets with attention, specifically focusing on the CoAtNet architecture. CoAtNet, a novel family of image recognition models, seamlessly integrates ConvNets and attention mechanisms, offering a powerful tool for analyzing facial images extracted from videos.
Data processing
We only used 3 chunks of 50 chunks of data from the DFDC dataset for our project which resulted in the size of (~30gb).
Overview of preprocessing steps:
Face Extraction Using BlazeFace:
BlazeFace is a lightweight and efficient face-detection model. It is utilized to extract faces from each frame of the videos in the DFDC dataset. This step ensures that only the relevant facial regions are considered for analysis. Here are some examples of face extractions from a video sample.
Real video:
DeepFake video:
Normalization of Pixel Values:
After face extraction, pixel values of the extracted facial images are normalized. Normalization standardizes the pixel values to a common scale, typically between 0 and 1, to improve the convergence and stability of the training process.
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
These mean and std values are given by Kaggle during the competition for the DFDC dataset. using these we are normalizing the pixel values.
The above figure is an example of normalized pixel values. We first normalized its pixel values and then to visualize it, we did invert_normalization.
Augmentation Techniques like Albumenation:
Augmentation techniques, such as the Albumenation library, are applied to increase the diversity and robustness of the training dataset. Albumenation introduces variations in the training data by applying transformations such as rotation, flipping, scaling, and color adjustments to the facial images.
#Basic Geometric Transformations
RandomRotate90: Rotates the image by 90, 180, or 270 degrees (controlled by p=0.2).
Transpose: Flips rows and columns (potentially useful for text or certain object orientations, with p=0.2).
Horizontal Flip: Mirrors the image horizontally (p=0.5).
Vertical Flip: Mirrors the image vertically (p=0.5).
#Random Effects:
OneOf ([GaussNoise()], p=0.2): Adds random noise to the image with a 0.2 probability within this group (other options are not applied if noise is chosen).
#Combined Transformations:
Shift Scale Rotate: This applies a combination of random shift, scale, and rotation in a single step (p=0.2).
#Pixel-Level Adjustments:
One Of ([CLAHE(clip_limit=2), Sharpen(), Emboss(), Random Brightness Contrast()], p=0.2): Within this group, one of these transformations is applied with a 0.2 probability:
CLAHE: Contrast Limited Adaptive Histogram Equalization (improves local contrast).
Sharpen: Enhances image edges.
Emboss: Creates a raised or sunken effect.
Random Brightness Contrast: Randomly adjusts brightness and contrast.
#Color Adjustments:
Hue Saturation Value: Randomly modifies the image’s hue (color), saturation (intensity), and value (brightness) with a 0.2 probability.
Temporal Consistency vs. Face Extraction and Classification
- Temporal Consistency refers to maintaining coherence across sequential frames in video analysis, often achieved through models integrating time-based architectures like LSTM or GRU to capture temporal dependencies. However, recent advancements demonstrate that face extraction and classification alone can yield effective results without explicitly modeling temporal relationships.
- By focusing solely on face extraction and classification, without considering temporal consistency, the model can efficiently detect deepfake content while simplifying the architecture and reducing computational complexity.
- We solely focused on detecting Image or video manipulations Ignoring Audio whereas many of the current best models detect audio manipulations too.
- The current models leverage efficient vision transformers, such as the Cross Efficient Vision Transformer (CEViT), which combines the efficiency of vision transformers with cross-modal fusion for improved performance across various tasks in computer vision.
CoAtNet Architecture
CoAtNet is a new family of image recognition models that combines the strengths of convolutional neural networks(ConvNets) and attention-based models (like Transformers). The CoatNet model is specifically designed for efficient image classification tasks, making it well-suited for processing large volumes of facial images extracted from videos.
The CoAtNet architecture comprises five stages (S0, S1, S2, S3, S4), each tailored to specific characteristics of the data and task at hand. Beginning with a simple 2-layer convolutional stem in S0, the subsequent stages employ a combination of MBConv blocks with squeeze-excitation (SE) and Transformer blocks.
To optimize model performance, we adopt a strategic approach to stage selection. Convolution stages precede Transformer stages, leveraging the former’s proficiency in processing local patterns common in early stages. This leads to four variants: C-C-C-C, C-C-C-T, C-C-T-T, and C-T-T-T, with varying degrees of convolution and Transformer stages. Through rigorous experimentation, we determine that the C-C-T-T configuration yields the best balance between generalization ability and model capacity.
Project Architecture
Our approach to this project is to use CoATnet-0. The CoAtNet authors proposed 5 best architectures for best performance(CoAtNet-0 to CoAtNet-4). CoAtNet-0 is the smaller model. we used it for our detection to make a small and compact model. Here is a brief explanation of our model layers:
model = Coatnet(image_size=(224, 224), in_channels=3, num_blocks=[2, 2, 3, 5, 2], channels=[64, 96, 192, 384, 768], num_classes=2)
model.summary()
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 64, 112, 112] 1,728
BatchNorm2d-2 [-1, 64, 112, 112] 128
GELU-3 [-1, 64, 112, 112] 0
Conv2d-4 [-1, 64, 112, 112] 36,864
BatchNorm2d-5 [-1, 64, 112, 112] 128
GELU-6 [-1, 64, 112, 112] 0
MaxPool2d-7 [-1, 64, 56, 56] 0
Conv2d-8 [-1, 96, 56, 56] 6,144
BatchNorm2d-9 [-1, 64, 112, 112] 128
Conv2d-10 [-1, 256, 56, 56] 16,384
BatchNorm2d-11 [-1, 256, 56, 56] 512
GELU-12 [-1, 256, 56, 56] 0
Conv2d-13 [-1, 256, 56, 56] 2,304
BatchNorm2d-14 [-1, 256, 56, 56] 512
GELU-15 [-1, 256, 56, 56] 0
AdaptiveAvgPool2d-16 [-1, 256, 1, 1] 0
Linear-17 [-1, 16] 4,096
GELU-18 [-1, 16] 0
Linear-19 [-1, 256] 4,096
Sigmoid-20 [-1, 256] 0
SE-21 [-1, 256, 56, 56] 0
Conv2d-22 [-1, 96, 56, 56] 24,576
BatchNorm2d-23 [-1, 96, 56, 56] 192
PreNorm-24 [-1, 96, 56, 56] 0
MBConv-25 [-1, 96, 56, 56] 0
BatchNorm2d-26 [-1, 96, 56, 56] 192
Conv2d-27 [-1, 384, 56, 56] 36,864
BatchNorm2d-28 [-1, 384, 56, 56] 768
GELU-29 [-1, 384, 56, 56] 0
Conv2d-30 [-1, 384, 56, 56] 3,456
BatchNorm2d-31 [-1, 384, 56, 56] 768
GELU-32 [-1, 384, 56, 56] 0
AdaptiveAvgPool2d-33 [-1, 384, 1, 1] 0
Linear-34 [-1, 24] 9,216
GELU-35 [-1, 24] 0
Linear-36 [-1, 384] 9,216
Sigmoid-37 [-1, 384] 0
SE-38 [-1, 384, 56, 56] 0
Conv2d-39 [-1, 96, 56, 56] 36,864
BatchNorm2d-40 [-1, 96, 56, 56] 192
PreNorm-41 [-1, 96, 56, 56] 0
MBConv-42 [-1, 96, 56, 56] 0
MaxPool2d-43 [-1, 96, 28, 28] 0
Conv2d-44 [-1, 192, 28, 28] 18,432
BatchNorm2d-45 [-1, 96, 56, 56] 192
Conv2d-46 [-1, 384, 28, 28] 36,864
BatchNorm2d-47 [-1, 384, 28, 28] 768
GELU-48 [-1, 384, 28, 28] 0
Conv2d-49 [-1, 384, 28, 28] 3,456
BatchNorm2d-50 [-1, 384, 28, 28] 768
GELU-51 [-1, 384, 28, 28] 0
AdaptiveAvgPool2d-52 [-1, 384, 1, 1] 0
Linear-53 [-1, 24] 9,216
GELU-54 [-1, 24] 0
Linear-55 [-1, 384] 9,216
Sigmoid-56 [-1, 384] 0
SE-57 [-1, 384, 28, 28] 0
Conv2d-58 [-1, 192, 28, 28] 73,728
BatchNorm2d-59 [-1, 192, 28, 28] 384
PreNorm-60 [-1, 192, 28, 28] 0
MBConv-61 [-1, 192, 28, 28] 0
BatchNorm2d-62 [-1, 192, 28, 28] 384
Conv2d-63 [-1, 768, 28, 28] 147,456
BatchNorm2d-64 [-1, 768, 28, 28] 1,536
GELU-65 [-1, 768, 28, 28] 0
Conv2d-66 [-1, 768, 28, 28] 6,912
BatchNorm2d-67 [-1, 768, 28, 28] 1,536
GELU-68 [-1, 768, 28, 28] 0
AdaptiveAvgPool2d-69 [-1, 768, 1, 1] 0
Linear-70 [-1, 48] 36,864
GELU-71 [-1, 48] 0
Linear-72 [-1, 768] 36,864
Sigmoid-73 [-1, 768] 0
SE-74 [-1, 768, 28, 28] 0
Conv2d-75 [-1, 192, 28, 28] 147,456
BatchNorm2d-76 [-1, 192, 28, 28] 384
PreNorm-77 [-1, 192, 28, 28] 0
MBConv-78 [-1, 192, 28, 28] 0
BatchNorm2d-79 [-1, 192, 28, 28] 384
Conv2d-80 [-1, 768, 28, 28] 147,456
BatchNorm2d-81 [-1, 768, 28, 28] 1,536
GELU-82 [-1, 768, 28, 28] 0
Conv2d-83 [-1, 768, 28, 28] 6,912
BatchNorm2d-84 [-1, 768, 28, 28] 1,536
GELU-85 [-1, 768, 28, 28] 0
AdaptiveAvgPool2d-86 [-1, 768, 1, 1] 0
Linear-87 [-1, 48] 36,864
GELU-88 [-1, 48] 0
Linear-89 [-1, 768] 36,864
Sigmoid-90 [-1, 768] 0
SE-91 [-1, 768, 28, 28] 0
Conv2d-92 [-1, 192, 28, 28] 147,456
BatchNorm2d-93 [-1, 192, 28, 28] 384
PreNorm-94 [-1, 192, 28, 28] 0
MBConv-95 [-1, 192, 28, 28] 0
MaxPool2d-96 [-1, 192, 14, 14] 0
Conv2d-97 [-1, 384, 14, 14] 73,728
MaxPool2d-98 [-1, 192, 14, 14] 0
Rearrange-99 [-1, 196, 192] 0
LayerNorm-100 [-1, 196, 192] 384
Linear-101 [-1, 196, 768] 147,456
Softmax-102 [-1, 8, 196, 196] 0
Linear-103 [-1, 196, 384] 98,688
Dropout-104 [-1, 196, 384] 0
Attention-105 [-1, 196, 384] 0
PreNorm-106 [-1, 196, 384] 0
Rearrange-107 [-1, 384, 14, 14] 0
Rearrange-108 [-1, 196, 384] 0
LayerNorm-109 [-1, 196, 384] 768
Linear-110 [-1, 196, 768] 295,680
GELU-111 [-1, 196, 768] 0
Dropout-112 [-1, 196, 768] 0
Linear-113 [-1, 196, 384] 295,296
Dropout-114 [-1, 196, 384] 0
FeedForward-115 [-1, 196, 384] 0
PreNorm-116 [-1, 196, 384] 0
Rearrange-117 [-1, 384, 14, 14] 0
Transformer-118 [-1, 384, 14, 14] 0
Rearrange-119 [-1, 196, 384] 0
LayerNorm-120 [-1, 196, 384] 768
Linear-121 [-1, 196, 768] 294,912
Softmax-122 [-1, 8, 196, 196] 0
Linear-123 [-1, 196, 384] 98,688
Dropout-124 [-1, 196, 384] 0
Attention-125 [-1, 196, 384] 0
PreNorm-126 [-1, 196, 384] 0
Rearrange-127 [-1, 384, 14, 14] 0
Rearrange-128 [-1, 196, 384] 0
LayerNorm-129 [-1, 196, 384] 768
Linear-130 [-1, 196, 1536] 591,360
GELU-131 [-1, 196, 1536] 0
Dropout-132 [-1, 196, 1536] 0
Linear-133 [-1, 196, 384] 590,208
Dropout-134 [-1, 196, 384] 0
FeedForward-135 [-1, 196, 384] 0
PreNorm-136 [-1, 196, 384] 0
Rearrange-137 [-1, 384, 14, 14] 0
Transformer-138 [-1, 384, 14, 14] 0
Rearrange-139 [-1, 196, 384] 0
LayerNorm-140 [-1, 196, 384] 768
Linear-141 [-1, 196, 768] 294,912
Softmax-142 [-1, 8, 196, 196] 0
Linear-143 [-1, 196, 384] 98,688
Dropout-144 [-1, 196, 384] 0
Attention-145 [-1, 196, 384] 0
PreNorm-146 [-1, 196, 384] 0
Rearrange-147 [-1, 384, 14, 14] 0
Rearrange-148 [-1, 196, 384] 0
LayerNorm-149 [-1, 196, 384] 768
Linear-150 [-1, 196, 1536] 591,360
GELU-151 [-1, 196, 1536] 0
Dropout-152 [-1, 196, 1536] 0
Linear-153 [-1, 196, 384] 590,208
Dropout-154 [-1, 196, 384] 0
FeedForward-155 [-1, 196, 384] 0
PreNorm-156 [-1, 196, 384] 0
Rearrange-157 [-1, 384, 14, 14] 0
Transformer-158 [-1, 384, 14, 14] 0
Rearrange-159 [-1, 196, 384] 0
LayerNorm-160 [-1, 196, 384] 768
Linear-161 [-1, 196, 768] 294,912
Softmax-162 [-1, 8, 196, 196] 0
Linear-163 [-1, 196, 384] 98,688
Dropout-164 [-1, 196, 384] 0
Attention-165 [-1, 196, 384] 0
PreNorm-166 [-1, 196, 384] 0
Rearrange-167 [-1, 384, 14, 14] 0
Rearrange-168 [-1, 196, 384] 0
LayerNorm-169 [-1, 196, 384] 768
Linear-170 [-1, 196, 1536] 591,360
GELU-171 [-1, 196, 1536] 0
Dropout-172 [-1, 196, 1536] 0
Linear-173 [-1, 196, 384] 590,208
Dropout-174 [-1, 196, 384] 0
FeedForward-175 [-1, 196, 384] 0
PreNorm-176 [-1, 196, 384] 0
Rearrange-177 [-1, 384, 14, 14] 0
Transformer-178 [-1, 384, 14, 14] 0
Rearrange-179 [-1, 196, 384] 0
LayerNorm-180 [-1, 196, 384] 768
Linear-181 [-1, 196, 768] 294,912
Softmax-182 [-1, 8, 196, 196] 0
Linear-183 [-1, 196, 384] 98,688
Dropout-184 [-1, 196, 384] 0
Attention-185 [-1, 196, 384] 0
PreNorm-186 [-1, 196, 384] 0
Rearrange-187 [-1, 384, 14, 14] 0
Rearrange-188 [-1, 196, 384] 0
LayerNorm-189 [-1, 196, 384] 768
Linear-190 [-1, 196, 1536] 591,360
GELU-191 [-1, 196, 1536] 0
Dropout-192 [-1, 196, 1536] 0
Linear-193 [-1, 196, 384] 590,208
Dropout-194 [-1, 196, 384] 0
FeedForward-195 [-1, 196, 384] 0
PreNorm-196 [-1, 196, 384] 0
Rearrange-197 [-1, 384, 14, 14] 0
Transformer-198 [-1, 384, 14, 14] 0
MaxPool2d-199 [-1, 384, 7, 7] 0
Conv2d-200 [-1, 768, 7, 7] 294,912
MaxPool2d-201 [-1, 384, 7, 7] 0
Rearrange-202 [-1, 49, 384] 0
LayerNorm-203 [-1, 49, 384] 768
Linear-204 [-1, 49, 768] 294,912
Softmax-205 [-1, 8, 49, 49] 0
Linear-206 [-1, 49, 768] 197,376
Dropout-207 [-1, 49, 768] 0
Attention-208 [-1, 49, 768] 0
PreNorm-209 [-1, 49, 768] 0
Rearrange-210 [-1, 768, 7, 7] 0
Rearrange-211 [-1, 49, 768] 0
LayerNorm-212 [-1, 49, 768] 1,536
Linear-213 [-1, 49, 1536] 1,181,184
GELU-214 [-1, 49, 1536] 0
Dropout-215 [-1, 49, 1536] 0
Linear-216 [-1, 49, 768] 1,180,416
Dropout-217 [-1, 49, 768] 0
FeedForward-218 [-1, 49, 768] 0
PreNorm-219 [-1, 49, 768] 0
Rearrange-220 [-1, 768, 7, 7] 0
Transformer-221 [-1, 768, 7, 7] 0
Rearrange-222 [-1, 49, 768] 0
LayerNorm-223 [-1, 49, 768] 1,536
Linear-224 [-1, 49, 768] 589,824
Softmax-225 [-1, 8, 49, 49] 0
Linear-226 [-1, 49, 768] 197,376
Dropout-227 [-1, 49, 768] 0
Attention-228 [-1, 49, 768] 0
PreNorm-229 [-1, 49, 768] 0
Rearrange-230 [-1, 768, 7, 7] 0
Rearrange-231 [-1, 49, 768] 0
LayerNorm-232 [-1, 49, 768] 1,536
Linear-233 [-1, 49, 3072] 2,362,368
GELU-234 [-1, 49, 3072] 0
Dropout-235 [-1, 49, 3072] 0
Linear-236 [-1, 49, 768] 2,360,064
Dropout-237 [-1, 49, 768] 0
FeedForward-238 [-1, 49, 768] 0
PreNorm-239 [-1, 49, 768] 0
Rearrange-240 [-1, 768, 7, 7] 0
Transformer-241 [-1, 768, 7, 7] 0
AvgPool2d-242 [-1, 768, 1, 1] 0
Linear-243 [-1, 1000] 768,000
================================================================
Total params: 17,757,760
Trainable params: 17,757,760
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 382.18
Params size (MB): 67.74
Estimated Total Size (MB): 450.49
Layer (1–7): Convolutional and Pooling Layers
Layer(8–21): Mobile Inverted Bottleneck Convolution (MBConv)
Layer(22–57): Additional MBConv Blocks and Downsampling
Layer(58–74): Repeat MBConv Blocks and Attention Mechanisms
Layer(75–98): Additional MBConv Blocks and Attention Mechanisms
Layer(99–138): Transformer Blocks
Layer(139–243): Classification Head
Stem Layer (S0):
- Conv2d-1 & BatchNorm2d-2: Initial convolutional layer followed by batch normalization for basic feature extraction.
- GELU-3: Applies the Gaussian Error Linear Unit (GELU) activation function to introduce non-linearity.
- Conv2d-4 & BatchNorm2d-5: Additional convolutional layer with batch normalization for feature enhancement.
MaxPool2d-7: Max-pooling operation to reduce spatial dimensions and aggregate features
.
Convolution Blocks (S1-S2):Conv2d-22: Convolutional layer with batch normalization.
SE-21: Squeeze-Excitation module for channel-wise feature recalibration.
MBConv-25: Mobile inverted bottleneck convolutional block for efficient feature extraction.
MaxPool2d-43: Max-pooling to downsample feature maps.
Transformer Blocks (S3-S4):
- Attention Modules (Attention-105, Attention-125): Self-attention mechanism for capturing long-range dependencies.
- FeedForward Modules (FeedForward-115, FeedForward-135): Fully connected layers with activation functions for feature processing.
- Layer Normalization (LayerNorm-100, LayerNorm-109, LayerNorm-119, LayerNorm-129, LayerNorm-139): Normalizes activations across channels.
Output Layer:
- AvgPool2d-242: Average pooling operation to reduce spatial dimensions.
- Linear-243: Fully connected layer mapping feature representation to output space, typically representing class probabilities for deepfake detection.
CoAtNet-2:
It is very similar to the CoAtNet-0 but has very few modifications with the blocks and channels which makes it a slightly larger model(~600mb) than CoAtNet-0(~200mb). We followed the same C-C-T-T structure for the stages of the CoAtNet-2 model.
model = CoAtNet(image_size=(224, 224), in_channels=3, num_blocks=[2, 2, 6, 14, 2], channels = [128, 128, 256, 512, 1026], num_classes=2)
Training
As we trained our model on only 25 epochs considering computation and time factors, we were unable to get a good understanding of inferences from the learning curves.
Learning curves of CoAtNet-0 model:
- Convergence: It’s difficult to determine convergence definitively from the plot. As we can observe both training and validation loss curves flatten out towards the end, which suggests some degree of convergence. We need to analyze a few more epochs to confirm if the flattening continues or if there are minor fluctuations.
- Gap between losses: A persistent gap between training and validation loss, with validation loss being higher, is still a good sign. This indicates the model is generalizing reasonably well. A smaller gap suggests the model might be underfitting the training data, while a much larger gap could indicate overfitting.
- Decreasing trend: The overall decreasing trend in both loss curves is very positive. It confirms the model is actively learning and improving its performance throughout training. The rate of decrease can also be informative. A sharp initial decrease followed by a plateau suggests the model captured the key patterns quickly.
Learning curves of CoAtNet-2 model:
- Convergence: Both the training and validation loss curves converge towards the end, indicating the model has learned and is not overfitting or underfitting significantly.
- Gap between losses: There is a noticeable gap between the training and validation loss, with validation loss being higher. This is expected and suggests some generalization error, as the model performs slightly better on training data.
- Decreasing trend: The overall decreasing trend of both loss curves over epochs is a positive sign, indicating the model is learning and improving its performance as training progresses.
Results
Results of CoAtNet-0 model:
We achieved 79% accuracy with our CoAtNet 0 model, which is good but not satisfactory. Here is a plot of confidence scores of predictions on the test set of size=0.1
With the CoAtNet-0, we were, able to produce correct predictions for 8 out of 10 videos from the test set. An example prediction is shown below with a confidence score of around 0.77
Test-Video:
Our Prediction score: 0.7742524743080139
The prediction score is on a scale of 0–1, where a prediction score >0.5 is fake and a prediction score <0.5 is real.
Some Fun Test:
Recently, Reid Hoffman, the Co-founder of LinkedIn, shared a video showcasing his AI twin, which appears incredibly realistic and capable of fooling any person. However, our small model indicates that the AI twin is real, highlighting its limitations.
Results of CoAtNet 2 model:
We achieved 89% accuracy with our CoAtNet 2 model which is a little larger than CoAtNet 0 while extending its layers and transforming it as a larger model.
We considered 0.1 of our training as the test set. We have not balanced the test set, which contains 30 real videos and 70 fake videos.
Here, we can observe clearly that the CoAtNet-2 model is performing very well on real videos with 0 errors and maintaining good confidence scores making most of them scatter below 0.2 for real and above 0.7 for fake.
Some Fun Test:
We got the correct prediction for Reid Hoffman’s AI twin as a Fake video, which depicts our model performing well.
Note: We used the face_recognition library for extracting faces while performing predictions and testing the model because of its efficiency of face extractions. but as we mentioned we used BlazeFace for the training stage because of its lightweight architecture which saves a lot of time and computation.
Challenges Faced
Initially, we attempted to train the model on individual data chunks, saving and loading the model between chunks. However, this approach proved suboptimal and failed to yield satisfactory results.
Subsequently, we experimented with training the model on the entire dataset using various batch sizes and epochs. Through rigorous testing, we discovered that the model performed best when the data was skewed to contain more real data than fake, with an optimal batch size of 32 and diminishing returns observed after 25 epochs.
To address these challenges, we considered scaling up to larger models like CoatNet-2 or CoatNet-3, which could offer improved performance but come with significantly larger file sizes (over 900 MB after training).
In our reference project by “The CVIT,” over 90% accuracy was achieved, but with a model size of around 1 GB. In contrast, our CoAtNet-0 model, at approximately 200 MB size, detects around 8 out of 10 videos correctly in the DFDC dataset and shows promising performance on real deepfake videos encountered.
Despite our efforts, we reached a deadlock in further improving the small model’s performance. Currently, we are training CoatNet-2 on three data chunks for 25 epochs with a batch size of 12, which is taking approximately 30 hours on a limited GPU.
Checkout the Project Code:
https://github.com/Nikhilreddy024/Deepfake_detection-using-Coatnet
References:
- CoAtNet- Marrying Convolution and Attention for All Data Sizes: https://arxiv.org/pdf/2106.04803
- https://github.com/chinhsuanwu/coatnet-pytorch
- CViT: https://github.com/erprogs/CViT
- https://link.springer.com/article/10.1007/s42979-023-02294-y
- https://arxiv.org/abs/2102.11126
- https://www.youtube.com/@LazyProgrammerOfficial
Top comments (0)