Nikhil Reddy

Posted on Feb 1 • Originally published at Medium

DeepFake Detection Using Convolutions with Attention (CoAtNet)

Introduction

With the proliferation of deepfake technology, the ability to discern real from manipulated content has become a critical concern. Deepfake videos, often indistinguishable from authentic footage, pose serious threats to privacy, security, and the integrity of information.

In response to this challenge, cutting-edge research in computer vision and machine learning has yielded innovative solutions for detecting deepfakes. Among these, the fusion of convolutional neural networks (ConvNets) and attention-based models has emerged as a promising approach.

In this project, we present an in-depth exploration of deepfake detection using ConvNets with attention, specifically focusing on the CoAtNet architecture. CoAtNet, a novel family of image recognition models, seamlessly integrates ConvNets and attention mechanisms, offering a powerful tool for analyzing facial images extracted from videos.

Data processing

We only used 3 chunks of 50 chunks of data from the DFDC dataset for our project which resulted in the size of (~30gb).

Overview of preprocessing steps:

Face Extraction Using BlazeFace:
BlazeFace is a lightweight and efficient face-detection model. It is utilized to extract faces from each frame of the videos in the DFDC dataset. This step ensures that only the relevant facial regions are considered for analysis. Here are some examples of face extractions from a video sample.

Real video:

DeepFake video:

Normalization of Pixel Values:

After face extraction, pixel values of the extracted facial images are normalized. Normalization standardizes the pixel values to a common scale, typically between 0 and 1, to improve the convergence and stability of the training process.

mean = [0.485, 0.456, 0.406]

std = [0.229, 0.224, 0.225]

These mean and std values are given by Kaggle during the competition for the DFDC dataset. using these we are normalizing the pixel values.

The above figure is an example of normalized pixel values. We first normalized its pixel values and then to visualize it, we did invert_normalization.

Augmentation Techniques like Albumenation:

Augmentation techniques, such as the Albumenation library, are applied to increase the diversity and robustness of the training dataset. Albumenation introduces variations in the training data by applying transformations such as rotation, flipping, scaling, and color adjustments to the facial images.

#Basic Geometric Transformations

RandomRotate90: Rotates the image by 90, 180, or 270 degrees (controlled by p=0.2).

Transpose: Flips rows and columns (potentially useful for text or certain object orientations, with p=0.2).

Horizontal Flip: Mirrors the image horizontally (p=0.5).

Vertical Flip: Mirrors the image vertically (p=0.5).

#Random Effects:

OneOf ([GaussNoise()], p=0.2): Adds random noise to the image with a 0.2 probability within this group (other options are not applied if noise is chosen).

#Combined Transformations:

Shift Scale Rotate: This applies a combination of random shift, scale, and rotation in a single step (p=0.2).

#Pixel-Level Adjustments:

One Of ([CLAHE(clip_limit=2), Sharpen(), Emboss(), Random Brightness Contrast()], p=0.2): Within this group, one of these transformations is applied with a 0.2 probability:

CLAHE: Contrast Limited Adaptive Histogram Equalization (improves local contrast).

Sharpen: Enhances image edges.

Emboss: Creates a raised or sunken effect.

Random Brightness Contrast: Randomly adjusts brightness and contrast.

#Color Adjustments:

Hue Saturation Value: Randomly modifies the image’s hue (color), saturation (intensity), and value (brightness) with a 0.2 probability.

Temporal Consistency vs. Face Extraction and Classification

Temporal Consistency refers to maintaining coherence across sequential frames in video analysis, often achieved through models integrating time-based architectures like LSTM or GRU to capture temporal dependencies. However, recent advancements demonstrate that face extraction and classification alone can yield effective results without explicitly modeling temporal relationships.
By focusing solely on face extraction and classification, without considering temporal consistency, the model can efficiently detect deepfake content while simplifying the architecture and reducing computational complexity.
We solely focused on detecting Image or video manipulations Ignoring Audio whereas many of the current best models detect audio manipulations too.
The current models leverage efficient vision transformers, such as the Cross Efficient Vision Transformer (CEViT), which combines the efficiency of vision transformers with cross-modal fusion for improved performance across various tasks in computer vision.

CoAtNet Architecture

CoAtNet is a new family of image recognition models that combines the strengths of convolutional neural networks(ConvNets) and attention-based models (like Transformers). The CoatNet model is specifically designed for efficient image classification tasks, making it well-suited for processing large volumes of facial images extracted from videos.

The CoAtNet architecture comprises five stages (S0, S1, S2, S3, S4), each tailored to specific characteristics of the data and task at hand. Beginning with a simple 2-layer convolutional stem in S0, the subsequent stages employ a combination of MBConv blocks with squeeze-excitation (SE) and Transformer blocks.
To optimize model performance, we adopt a strategic approach to stage selection. Convolution stages precede Transformer stages, leveraging the former’s proficiency in processing local patterns common in early stages. This leads to four variants: C-C-C-C, C-C-C-T, C-C-T-T, and C-T-T-T, with varying degrees of convolution and Transformer stages. Through rigorous experimentation, we determine that the C-C-T-T configuration yields the best balance between generalization ability and model capacity.

[2106.04803] CoAtNet: Marrying Convolution and Attention for All Data Sizes

Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets(pronounced "coat" nets), a family of hybrid models built from two key insights: (1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention; (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints across various datasets: Without extra data, CoAtNet achieves 86.0% ImageNet top-1 accuracy; When pre-trained with 13M images from ImageNet-21K, our CoAtNet achieves 88.56% top-1 accuracy, matching ViT-huge pre-trained with 300M images from JFT-300M while using 23x less data; Notably, when we further scale up CoAtNet with JFT-3B, it achieves 90.88% top-1 accuracy on ImageNet, establishing a new state-of-the-art result.

arxiv.org

Project Architecture

Our approach to this project is to use CoATnet-0. The CoAtNet authors proposed 5 best architectures for best performance(CoAtNet-0 to CoAtNet-4). CoAtNet-0 is the smaller model. we used it for our detection to make a small and compact model. Here is a brief explanation of our model layers:

model = Coatnet(image_size=(224, 224), in_channels=3, num_blocks=[2, 2, 3, 5, 2], channels=[64, 96, 192, 384, 768], num_classes=2)

model.summary()

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 64, 112, 112]           1,728 
       BatchNorm2d-2         [-1, 64, 112, 112]             128
              GELU-3         [-1, 64, 112, 112]               0
            Conv2d-4         [-1, 64, 112, 112]          36,864
       BatchNorm2d-5         [-1, 64, 112, 112]             128
              GELU-6         [-1, 64, 112, 112]               0
         MaxPool2d-7           [-1, 64, 56, 56]               0
            Conv2d-8           [-1, 96, 56, 56]           6,144
       BatchNorm2d-9         [-1, 64, 112, 112]             128
           Conv2d-10          [-1, 256, 56, 56]          16,384
      BatchNorm2d-11          [-1, 256, 56, 56]             512
             GELU-12          [-1, 256, 56, 56]               0
           Conv2d-13          [-1, 256, 56, 56]           2,304
      BatchNorm2d-14          [-1, 256, 56, 56]             512
             GELU-15          [-1, 256, 56, 56]               0
AdaptiveAvgPool2d-16            [-1, 256, 1, 1]               0
           Linear-17                   [-1, 16]           4,096
             GELU-18                   [-1, 16]               0
           Linear-19                  [-1, 256]           4,096
          Sigmoid-20                  [-1, 256]               0
               SE-21          [-1, 256, 56, 56]               0
           Conv2d-22           [-1, 96, 56, 56]          24,576
      BatchNorm2d-23           [-1, 96, 56, 56]             192
          PreNorm-24           [-1, 96, 56, 56]               0
           MBConv-25           [-1, 96, 56, 56]               0
      BatchNorm2d-26           [-1, 96, 56, 56]             192
           Conv2d-27          [-1, 384, 56, 56]          36,864
      BatchNorm2d-28          [-1, 384, 56, 56]             768
             GELU-29          [-1, 384, 56, 56]               0
           Conv2d-30          [-1, 384, 56, 56]           3,456
      BatchNorm2d-31          [-1, 384, 56, 56]             768
             GELU-32          [-1, 384, 56, 56]               0
AdaptiveAvgPool2d-33            [-1, 384, 1, 1]               0
           Linear-34                   [-1, 24]           9,216
             GELU-35                   [-1, 24]               0
           Linear-36                  [-1, 384]           9,216
          Sigmoid-37                  [-1, 384]               0
               SE-38          [-1, 384, 56, 56]               0
           Conv2d-39           [-1, 96, 56, 56]          36,864
      BatchNorm2d-40           [-1, 96, 56, 56]             192
          PreNorm-41           [-1, 96, 56, 56]               0
           MBConv-42           [-1, 96, 56, 56]               0
        MaxPool2d-43           [-1, 96, 28, 28]               0
           Conv2d-44          [-1, 192, 28, 28]          18,432
      BatchNorm2d-45           [-1, 96, 56, 56]             192
           Conv2d-46          [-1, 384, 28, 28]          36,864
      BatchNorm2d-47          [-1, 384, 28, 28]             768
             GELU-48          [-1, 384, 28, 28]               0
           Conv2d-49          [-1, 384, 28, 28]           3,456
      BatchNorm2d-50          [-1, 384, 28, 28]             768
             GELU-51          [-1, 384, 28, 28]               0
AdaptiveAvgPool2d-52            [-1, 384, 1, 1]               0
           Linear-53                   [-1, 24]           9,216
             GELU-54                   [-1, 24]               0
           Linear-55                  [-1, 384]           9,216
          Sigmoid-56                  [-1, 384]               0
               SE-57          [-1, 384, 28, 28]               0
           Conv2d-58          [-1, 192, 28, 28]          73,728
      BatchNorm2d-59          [-1, 192, 28, 28]             384
          PreNorm-60          [-1, 192, 28, 28]               0
           MBConv-61          [-1, 192, 28, 28]               0
      BatchNorm2d-62          [-1, 192, 28, 28]             384
           Conv2d-63          [-1, 768, 28, 28]         147,456
      BatchNorm2d-64          [-1, 768, 28, 28]           1,536
             GELU-65          [-1, 768, 28, 28]               0
           Conv2d-66          [-1, 768, 28, 28]           6,912
      BatchNorm2d-67          [-1, 768, 28, 28]           1,536
             GELU-68          [-1, 768, 28, 28]               0
AdaptiveAvgPool2d-69            [-1, 768, 1, 1]               0
           Linear-70                   [-1, 48]          36,864
             GELU-71                   [-1, 48]               0
           Linear-72                  [-1, 768]          36,864
          Sigmoid-73                  [-1, 768]               0
               SE-74          [-1, 768, 28, 28]               0
           Conv2d-75          [-1, 192, 28, 28]         147,456
      BatchNorm2d-76          [-1, 192, 28, 28]             384
          PreNorm-77          [-1, 192, 28, 28]               0
           MBConv-78          [-1, 192, 28, 28]               0
      BatchNorm2d-79          [-1, 192, 28, 28]             384
           Conv2d-80          [-1, 768, 28, 28]         147,456
      BatchNorm2d-81          [-1, 768, 28, 28]           1,536
             GELU-82          [-1, 768, 28, 28]               0
           Conv2d-83          [-1, 768, 28, 28]           6,912
      BatchNorm2d-84          [-1, 768, 28, 28]           1,536
             GELU-85          [-1, 768, 28, 28]               0
AdaptiveAvgPool2d-86            [-1, 768, 1, 1]               0
           Linear-87                   [-1, 48]          36,864
             GELU-88                   [-1, 48]               0
           Linear-89                  [-1, 768]          36,864
          Sigmoid-90                  [-1, 768]               0
               SE-91          [-1, 768, 28, 28]               0
           Conv2d-92          [-1, 192, 28, 28]         147,456
      BatchNorm2d-93          [-1, 192, 28, 28]             384
          PreNorm-94          [-1, 192, 28, 28]               0
           MBConv-95          [-1, 192, 28, 28]               0
        MaxPool2d-96          [-1, 192, 14, 14]               0
           Conv2d-97          [-1, 384, 14, 14]          73,728
        MaxPool2d-98          [-1, 192, 14, 14]               0
        Rearrange-99             [-1, 196, 192]               0
       LayerNorm-100             [-1, 196, 192]             384
          Linear-101             [-1, 196, 768]         147,456
         Softmax-102          [-1, 8, 196, 196]               0
          Linear-103             [-1, 196, 384]          98,688
         Dropout-104             [-1, 196, 384]               0
       Attention-105             [-1, 196, 384]               0
         PreNorm-106             [-1, 196, 384]               0
       Rearrange-107          [-1, 384, 14, 14]               0
       Rearrange-108             [-1, 196, 384]               0
       LayerNorm-109             [-1, 196, 384]             768
          Linear-110             [-1, 196, 768]         295,680
            GELU-111             [-1, 196, 768]               0
         Dropout-112             [-1, 196, 768]               0
          Linear-113             [-1, 196, 384]         295,296
         Dropout-114             [-1, 196, 384]               0
     FeedForward-115             [-1, 196, 384]               0
         PreNorm-116             [-1, 196, 384]               0
       Rearrange-117          [-1, 384, 14, 14]               0
     Transformer-118          [-1, 384, 14, 14]               0
       Rearrange-119             [-1, 196, 384]               0
       LayerNorm-120             [-1, 196, 384]             768
          Linear-121             [-1, 196, 768]         294,912
         Softmax-122          [-1, 8, 196, 196]               0
          Linear-123             [-1, 196, 384]          98,688
         Dropout-124             [-1, 196, 384]               0
       Attention-125             [-1, 196, 384]               0
         PreNorm-126             [-1, 196, 384]               0
       Rearrange-127          [-1, 384, 14, 14]               0
       Rearrange-128             [-1, 196, 384]               0
       LayerNorm-129             [-1, 196, 384]             768
          Linear-130            [-1, 196, 1536]         591,360
            GELU-131            [-1, 196, 1536]               0
         Dropout-132            [-1, 196, 1536]               0
          Linear-133             [-1, 196, 384]         590,208
         Dropout-134             [-1, 196, 384]               0
     FeedForward-135             [-1, 196, 384]               0
         PreNorm-136             [-1, 196, 384]               0
       Rearrange-137          [-1, 384, 14, 14]               0
     Transformer-138          [-1, 384, 14, 14]               0
       Rearrange-139             [-1, 196, 384]               0
       LayerNorm-140             [-1, 196, 384]             768
          Linear-141             [-1, 196, 768]         294,912
         Softmax-142          [-1, 8, 196, 196]               0
          Linear-143             [-1, 196, 384]          98,688
         Dropout-144             [-1, 196, 384]               0
       Attention-145             [-1, 196, 384]               0
         PreNorm-146             [-1, 196, 384]               0
       Rearrange-147          [-1, 384, 14, 14]               0
       Rearrange-148             [-1, 196, 384]               0
       LayerNorm-149             [-1, 196, 384]             768
          Linear-150            [-1, 196, 1536]         591,360
            GELU-151            [-1, 196, 1536]               0
         Dropout-152            [-1, 196, 1536]               0
          Linear-153             [-1, 196, 384]         590,208
         Dropout-154             [-1, 196, 384]               0
     FeedForward-155             [-1, 196, 384]               0
         PreNorm-156             [-1, 196, 384]               0
       Rearrange-157          [-1, 384, 14, 14]               0
     Transformer-158          [-1, 384, 14, 14]               0
       Rearrange-159             [-1, 196, 384]               0
       LayerNorm-160             [-1, 196, 384]             768
          Linear-161             [-1, 196, 768]         294,912
         Softmax-162          [-1, 8, 196, 196]               0
          Linear-163             [-1, 196, 384]          98,688
         Dropout-164             [-1, 196, 384]               0
       Attention-165             [-1, 196, 384]               0
         PreNorm-166             [-1, 196, 384]               0
       Rearrange-167          [-1, 384, 14, 14]               0
       Rearrange-168             [-1, 196, 384]               0
       LayerNorm-169             [-1, 196, 384]             768
          Linear-170            [-1, 196, 1536]         591,360
            GELU-171            [-1, 196, 1536]               0
         Dropout-172            [-1, 196, 1536]               0
          Linear-173             [-1, 196, 384]         590,208
         Dropout-174             [-1, 196, 384]               0
     FeedForward-175             [-1, 196, 384]               0
         PreNorm-176             [-1, 196, 384]               0
       Rearrange-177          [-1, 384, 14, 14]               0
     Transformer-178          [-1, 384, 14, 14]               0
       Rearrange-179             [-1, 196, 384]               0
       LayerNorm-180             [-1, 196, 384]             768
          Linear-181             [-1, 196, 768]         294,912
         Softmax-182          [-1, 8, 196, 196]               0
          Linear-183             [-1, 196, 384]          98,688
         Dropout-184             [-1, 196, 384]               0
       Attention-185             [-1, 196, 384]               0
         PreNorm-186             [-1, 196, 384]               0
       Rearrange-187          [-1, 384, 14, 14]               0
       Rearrange-188             [-1, 196, 384]               0
       LayerNorm-189             [-1, 196, 384]             768
          Linear-190            [-1, 196, 1536]         591,360
            GELU-191            [-1, 196, 1536]               0
         Dropout-192            [-1, 196, 1536]               0
          Linear-193             [-1, 196, 384]         590,208
         Dropout-194             [-1, 196, 384]               0
     FeedForward-195             [-1, 196, 384]               0
         PreNorm-196             [-1, 196, 384]               0
       Rearrange-197          [-1, 384, 14, 14]               0
     Transformer-198          [-1, 384, 14, 14]               0
       MaxPool2d-199            [-1, 384, 7, 7]               0
          Conv2d-200            [-1, 768, 7, 7]         294,912
       MaxPool2d-201            [-1, 384, 7, 7]               0
       Rearrange-202              [-1, 49, 384]               0
       LayerNorm-203              [-1, 49, 384]             768
          Linear-204              [-1, 49, 768]         294,912
         Softmax-205            [-1, 8, 49, 49]               0
          Linear-206              [-1, 49, 768]         197,376
         Dropout-207              [-1, 49, 768]               0
       Attention-208              [-1, 49, 768]               0
         PreNorm-209              [-1, 49, 768]               0
       Rearrange-210            [-1, 768, 7, 7]               0
       Rearrange-211              [-1, 49, 768]               0
       LayerNorm-212              [-1, 49, 768]           1,536
          Linear-213             [-1, 49, 1536]       1,181,184
            GELU-214             [-1, 49, 1536]               0
         Dropout-215             [-1, 49, 1536]               0
          Linear-216              [-1, 49, 768]       1,180,416
         Dropout-217              [-1, 49, 768]               0
     FeedForward-218              [-1, 49, 768]               0
         PreNorm-219              [-1, 49, 768]               0
       Rearrange-220            [-1, 768, 7, 7]               0
     Transformer-221            [-1, 768, 7, 7]               0
       Rearrange-222              [-1, 49, 768]               0
       LayerNorm-223              [-1, 49, 768]           1,536
          Linear-224              [-1, 49, 768]         589,824
         Softmax-225            [-1, 8, 49, 49]               0
          Linear-226              [-1, 49, 768]         197,376
         Dropout-227              [-1, 49, 768]               0
       Attention-228              [-1, 49, 768]               0
         PreNorm-229              [-1, 49, 768]               0
       Rearrange-230            [-1, 768, 7, 7]               0
       Rearrange-231              [-1, 49, 768]               0
       LayerNorm-232              [-1, 49, 768]           1,536
          Linear-233             [-1, 49, 3072]       2,362,368
            GELU-234             [-1, 49, 3072]               0
         Dropout-235             [-1, 49, 3072]               0
          Linear-236              [-1, 49, 768]       2,360,064
         Dropout-237              [-1, 49, 768]               0
     FeedForward-238              [-1, 49, 768]               0
         PreNorm-239              [-1, 49, 768]               0
       Rearrange-240            [-1, 768, 7, 7]               0
     Transformer-241            [-1, 768, 7, 7]               0
       AvgPool2d-242            [-1, 768, 1, 1]               0
          Linear-243                 [-1, 1000]         768,000
================================================================
Total params: 17,757,760
Trainable params: 17,757,760
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 382.18
Params size (MB): 67.74
Estimated Total Size (MB): 450.49

Layer (1–7): Convolutional and Pooling Layers

Layer(8–21): Mobile Inverted Bottleneck Convolution (MBConv)

Layer(22–57): Additional MBConv Blocks and Downsampling

Layer(58–74): Repeat MBConv Blocks and Attention Mechanisms

Layer(75–98): Additional MBConv Blocks and Attention Mechanisms

Layer(99–138): Transformer Blocks

Layer(139–243): Classification Head

Stem Layer (S0):

Conv2d-1 & BatchNorm2d-2: Initial convolutional layer followed by batch normalization for basic feature extraction.
GELU-3: Applies the Gaussian Error Linear Unit (GELU) activation function to introduce non-linearity.
Conv2d-4 & BatchNorm2d-5: Additional convolutional layer with batch normalization for feature enhancement.
MaxPool2d-7: Max-pooling operation to reduce spatial dimensions and aggregate features
.
Convolution Blocks (S1-S2):
Conv2d-22: Convolutional layer with batch normalization.
SE-21: Squeeze-Excitation module for channel-wise feature recalibration.
MBConv-25: Mobile inverted bottleneck convolutional block for efficient feature extraction.
MaxPool2d-43: Max-pooling to downsample feature maps.

Transformer Blocks (S3-S4):

Attention Modules (Attention-105, Attention-125): Self-attention mechanism for capturing long-range dependencies.
FeedForward Modules (FeedForward-115, FeedForward-135): Fully connected layers with activation functions for feature processing.
Layer Normalization (LayerNorm-100, LayerNorm-109, LayerNorm-119, LayerNorm-129, LayerNorm-139): Normalizes activations across channels.

Output Layer:

AvgPool2d-242: Average pooling operation to reduce spatial dimensions.
Linear-243: Fully connected layer mapping feature representation to output space, typically representing class probabilities for deepfake detection.

CoAtNet-2:

It is very similar to the CoAtNet-0 but has very few modifications with the blocks and channels which makes it a slightly larger model(~600mb) than CoAtNet-0(~200mb). We followed the same C-C-T-T structure for the stages of the CoAtNet-2 model.

model = CoAtNet(image_size=(224, 224), in_channels=3, num_blocks=[2, 2, 6, 14, 2], channels = [128, 128, 256, 512, 1026], num_classes=2)

Training

As we trained our model on only 25 epochs considering computation and time factors, we were unable to get a good understanding of inferences from the learning curves.

Learning curves of CoAtNet-0 model:

Convergence: It’s difficult to determine convergence definitively from the plot. As we can observe both training and validation loss curves flatten out towards the end, which suggests some degree of convergence. We need to analyze a few more epochs to confirm if the flattening continues or if there are minor fluctuations.
Gap between losses: A persistent gap between training and validation loss, with validation loss being higher, is still a good sign. This indicates the model is generalizing reasonably well. A smaller gap suggests the model might be underfitting the training data, while a much larger gap could indicate overfitting.
Decreasing trend: The overall decreasing trend in both loss curves is very positive. It confirms the model is actively learning and improving its performance throughout training. The rate of decrease can also be informative. A sharp initial decrease followed by a plateau suggests the model captured the key patterns quickly.

Learning curves of CoAtNet-2 model:

Convergence: Both the training and validation loss curves converge towards the end, indicating the model has learned and is not overfitting or underfitting significantly.
Gap between losses: There is a noticeable gap between the training and validation loss, with validation loss being higher. This is expected and suggests some generalization error, as the model performs slightly better on training data.
Decreasing trend: The overall decreasing trend of both loss curves over epochs is a positive sign, indicating the model is learning and improving its performance as training progresses.

Results

Results of CoAtNet-0 model:

We achieved 79% accuracy with our CoAtNet 0 model, which is good but not satisfactory. Here is a plot of confidence scores of predictions on the test set of size=0.1

With the CoAtNet-0, we were, able to produce correct predictions for 8 out of 10 videos from the test set. An example prediction is shown below with a confidence score of around 0.77

Test-Video:

result.mp4 - Google Drive

drive.google.com

Our Prediction score: 0.7742524743080139

The prediction score is on a scale of 0–1, where a prediction score >0.5 is fake and a prediction score <0.5 is real.

Some Fun Test:

Recently, Reid Hoffman, the Co-founder of LinkedIn, shared a video showcasing his AI twin, which appears incredibly realistic and capable of fooling any person. However, our small model indicates that the AI twin is real, highlighting its limitations.

Results of CoAtNet 2 model:

We achieved 89% accuracy with our CoAtNet 2 model which is a little larger than CoAtNet 0 while extending its layers and transforming it as a larger model.

We considered 0.1 of our training as the test set. We have not balanced the test set, which contains 30 real videos and 70 fake videos.

Here, we can observe clearly that the CoAtNet-2 model is performing very well on real videos with 0 errors and maintaining good confidence scores making most of them scatter below 0.2 for real and above 0.7 for fake.

Some Fun Test:

We got the correct prediction for Reid Hoffman’s AI twin as a Fake video, which depicts our model performing well.

Note: We used the face_recognition library for extracting faces while performing predictions and testing the model because of its efficiency of face extractions. but as we mentioned we used BlazeFace for the training stage because of its lightweight architecture which saves a lot of time and computation.

Challenges Faced

Initially, we attempted to train the model on individual data chunks, saving and loading the model between chunks. However, this approach proved suboptimal and failed to yield satisfactory results.

Subsequently, we experimented with training the model on the entire dataset using various batch sizes and epochs. Through rigorous testing, we discovered that the model performed best when the data was skewed to contain more real data than fake, with an optimal batch size of 32 and diminishing returns observed after 25 epochs.

To address these challenges, we considered scaling up to larger models like CoatNet-2 or CoatNet-3, which could offer improved performance but come with significantly larger file sizes (over 900 MB after training).

In our reference project by “The CVIT,” over 90% accuracy was achieved, but with a model size of around 1 GB. In contrast, our CoAtNet-0 model, at approximately 200 MB size, detects around 8 out of 10 videos correctly in the DFDC dataset and shows promising performance on real deepfake videos encountered.

Despite our efforts, we reached a deadlock in further improving the small model’s performance. Currently, we are training CoatNet-2 on three data chunks for 25 epochs with a batch size of 12, which is taking approximately 30 hours on a limited GPU.

Checkout the Project Code:
https://github.com/Nikhilreddy024/Deepfake_detection-using-Coatnet