DEV Community

daud99
daud99

Posted on • Edited on

Understanding/Exploring dataset (Part 2)

Visualizing Dataset

In previous blog, we looked into the .describe() function of pandas and tried to understand it.

We will be using seaborn which is just an extended form of matlplotlib for visualization.

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
Enter fullscreen mode Exit fullscreen mode

As our dataset contain quite a number of features 84. It is difficult to visualize it in a single graph. So, for better understanding we divided the feature sets into subset and visualize each of them individually.

def visualizeBarplot(column):
    discriptive_stat = merge_df.describe()
    discriptive_stat = discriptive_stat.T
    discriptive_stat.reset_index(inplace=True)
    prev = 0
    for each in range(20,81,20):
      plt.figure(figsize=(15,6)) # (width, height)
      sns.barplot(data=discriptive_stat.iloc[prev:each], x='index', y=column)
      plt.xticks(rotation=45, horizontalalignment="right")
      prev = each
Enter fullscreen mode Exit fullscreen mode

Just like below we can visualize any interesting statistics coefficient by passing into function such as mean, 50% etc.

# For specific column 
visualizeBarplot('std')
Enter fullscreen mode Exit fullscreen mode

Let's visualize the BarPlot for std.

Image description

You can note 1e7 on top of the graph indicating that our units on Y-axis actually needs to mulitplied by 1e7. So, 3.0 means 3 x 10 raised to power 7.

Fwd Header Length has a value of approximately 2.5 x 10^7 which indicates that the most of the value lies in the range of mean-std to mean+std. We know from the .describe function that MEAN for Fwd Header Length is -2.600024e+04.

Another, thing which in worth noting is that most of features are showing std near to zero. But as the unit of measurement is exponenital. We can't have a correct idea. Let's find which features have exact zero STD.

discriptive_stat[discriptive_stat['std']==0]
Enter fullscreen mode Exit fullscreen mode

Image description

These features with ZERO standard deviation doesn't really contribute in learning of the machine learning model. We better of removing them before trainning the actual model. We will do this in the cleaning phase.

We want to further know how MEAN & MEDIAN (50%) compare to each other. Are they close to each other or not.

def visualizeSideBySideBarplot(columns):
    df1 = pd.melt(discriptive_stat, id_vars=['index'], value_vars=columns)
    plt.figure(figsize=(30,6)) # (width, height)
    sns.barplot(data=df1, x='index', y="value", hue="variable")
    plt.xticks(rotation=45, horizontalalignment="right")
Enter fullscreen mode Exit fullscreen mode
visualizeSideBySideBarplot(['mean','50%'])
Enter fullscreen mode Exit fullscreen mode

Image description

We can clearly see from the graph that all features have a larger MEAN than the Median which is the strong indication of outliers.

To get more insight that outliers are in the first quartile or the last quartile. We will call the following

visualizeSideBySideBarplot(['25%','75%'])
Enter fullscreen mode Exit fullscreen mode

Image description

Correlation

One important aspect of understanding our dataset is seeing how different features corelate to each other.

Correlation measures the degree to which two phenomena are related to one another. For example, there is a correlation between summer temperatures and ice cream sales. When one goes up, so does the other. Two variables are positively correlated if a change in one is associated with a change in the other in the same direction, such as the relationship between height and weight. Taller people weigh more (on average); shorter people weigh less. A correlation is negative if a positive change in one variable is associated with a negative change in the other, such as the relationship between exercise and weight.

def corelationHeatMap(col_name):
  corr_matrix_cols = corr_matrix.columns
  plt.figure(figsize=(20,60)) # (width, height)
  index = [i for (i,each) in enumerate(corr_matrix.columns) if each == col_name][0]
  sns.heatmap(corr_matrix.iloc[:, [index]], annot=True)
Enter fullscreen mode Exit fullscreen mode

We can pass in the name of the feature to the function. We want to see correlation w.r.t to all other features just like we are doing with Flow Duration below.

corelationHeatMap(" Flow Duration")
Enter fullscreen mode Exit fullscreen mode

Image description

The power of correlation as a statistical tool is that we can encapsulate (to express or show the most important facts about something ) an association between two variables in a single descriptive statistic: the correlation coefficient. The correlation coefficient has two fabulously attractive characteristics. First, for math reasons that have been relegated to the appendix, it is a single number ranging from โ€“1 to 1. A correlation of 1, often described as perfect correlation, means that every change in one variable is associated with an equivalent change in the other variable in the same direction. A correlation of โ€“1, or perfect negative correlation, means that every change in one variable is associated with an equivalent change in the other variable in the opposite direction. The closer the correlation is to 1 or โ€“1, the stronger the association.

Numerical VS Categorical/String columns

The machine learing models work with numerical data. But often we have the categorical/string data. So, in order to make use of these features in our dataset. We need to make sure that these string features are represented in the numerical form which can be done in different way. We will see in the future blog. In order to do this first, we need to know which features are string features.
We can find them using the following way

def getCategoricalAndNumericalColumns():
  discriptive_stat = merge_df.describe()
  print("The Numerical columns are below")
  print(discriptive_stat.columns)
  print("The Categorical column are below")
  print(list(set(merge_df.columns)-set(discriptive_stat.columns)))
getCategoricalAndNumericalColumns()
Enter fullscreen mode Exit fullscreen mode

The Numerical columns are below
Index([' Source Port', ' Destination Port', ' Protocol', ' Flow Duration',
' Total Fwd Packets', ' Total Backward Packets',
'Total Length of Fwd Packets', ' Total Length of Bwd Packets',
' Fwd Packet Length Max', ' Fwd Packet Length Min',
' Fwd Packet Length Mean', ' Fwd Packet Length Std',
'Bwd Packet Length Max', ' Bwd Packet Length Min',
' Bwd Packet Length Mean', ' Bwd Packet Length Std', 'Flow Bytes/s',
' Flow Packets/s', ' Flow IAT Mean', ' Flow IAT Std', ' Flow IAT Max',
' Flow IAT Min', 'Fwd IAT Total', ' Fwd IAT Mean', ' Fwd IAT Std',
' Fwd IAT Max', ' Fwd IAT Min', 'Bwd IAT Total', ' Bwd IAT Mean',
' Bwd IAT Std', ' Bwd IAT Max', ' Bwd IAT Min', 'Fwd PSH Flags',
' Bwd PSH Flags', ' Fwd URG Flags', ' Bwd URG Flags',
' Fwd Header Length', ' Bwd Header Length', 'Fwd Packets/s',
' Bwd Packets/s', ' Min Packet Length', ' Max Packet Length',
' Packet Length Mean', ' Packet Length Std', ' Packet Length Variance',
'FIN Flag Count', ' SYN Flag Count', ' RST Flag Count',
' PSH Flag Count', ' ACK Flag Count', ' URG Flag Count',
' CWE Flag Count', ' ECE Flag Count', ' Down/Up Ratio',
' Average Packet Size', ' Avg Fwd Segment Size',
' Avg Bwd Segment Size', ' Fwd Header Length.1', 'Fwd Avg Bytes/Bulk',
' Fwd Avg Packets/Bulk', ' Fwd Avg Bulk Rate', ' Bwd Avg Bytes/Bulk',
' Bwd Avg Packets/Bulk', 'Bwd Avg Bulk Rate', 'Subflow Fwd Packets',
' Subflow Fwd Bytes', ' Subflow Bwd Packets', ' Subflow Bwd Bytes',
'Init_Win_bytes_forward', ' Init_Win_bytes_backward',
' act_data_pkt_fwd', ' min_seg_size_forward', 'Active Mean',
' Active Std', ' Active Max', ' Active Min', 'Idle Mean', ' Idle Std',
' Idle Max', ' Idle Min'],
dtype='object')
The Categorical column are below
[' Timestamp', ' Destination IP', 'Flow ID', ' Label', ' Source IP']

Finding the feature/Column with Null Values

We need to deal with the Null/empty values as it effect the learning of our models. So, if one features has most of it's value as Null. Then, we may be better of excluding it. However, there are other ways of dealing with empty values.

# Visualizing missing data
def visualizeNullHeatMaps():
    prev = 0
    for each in range(5,81,5):
      plt.figure(figsize=(15,200)) # (width, height)
      sns.heatmap(merge_df.iloc[:,prev:each].isnull(),yticklabels=False,cbar=False,cmap='viridis')
      prev = each
visualizeNullHeatMaps()
Enter fullscreen mode Exit fullscreen mode

Only FlowBytes/s feature contain the Null values. But to be exactly sure about how many of the values are NUll for FlowBytes/s.

If there is any Null/None value in the entire dataset.

merge_df.isnull().values.any()
Enter fullscreen mode Exit fullscreen mode

What is the name of the features/Cols with null value.

[col for col in merge_df if merge_df[col].isnull().values.any()]
Enter fullscreen mode Exit fullscreen mode

How many values in the given column/features are Null/None?

merge_df['Flow Bytes/s'].isnull().sum()
Enter fullscreen mode Exit fullscreen mode

Balance Vs InBalanced Dataset

Let's find out our dataset is balanced or not. Usually Balanced dataset is desired.

The balanced dataset is the one in which the number of instances/examples for all the features/columns are almost near to each other not too far from each other. For instance, you may have 1000 instances for class B but 50 instances for class A. This dataset will be the InBalanced Dataset.

sns.set_style('whitegrid')
plt.figure(figsize=(50,10)) # (width, height)
sns.countplot(x=' Label',data=merge_df,palette='RdBu_r')
Enter fullscreen mode Exit fullscreen mode

Image description

We can clearly see that our dataset is highly inbalanced which is a problem indeed.

But do not worry. We will see how can we balance this dataset in the future blog.

That's it for this blog. We would look into cleaning of data into next blog and will apply our understanding of the data into cleaning.

In case if there is some issue or I have gone wrong somewhere. Your feedback is welcomed.

David Out

Top comments (0)