DEV Community

Cover image for Exploratory Data Analysis using Data Visualization Techniques.
MercyMburu
MercyMburu

Posted on • Edited on

Exploratory Data Analysis using Data Visualization Techniques.

Happy to be here again. In today's article, two keywords in the title are going to be defined. Exploratory Data Analysis and Data Visualization. With an understanding of these and a sample project for the purposes of description, everything will be understood. I have discovered that Exploratory Data Analysis is a step that cannot be skipped in any Data Science project, whether one likes it or not.

Exploratory Data Analysis.

Is the process of investigating a dataset in order to come up with summaries/hypothesis based on our understanding of the data, discovering patterns, detecting outliers and gaining insights through various techniques. Data visualization is one of them.

Data Visualization

A graphical representation of the information and the data.

Importance of Data Visualization

  • In the cleaning process, it helps identify incorrect data or missing values.
  • The results can be interpreted and operated on because they become clear.
  • Enables us to visualize stuff that cannot be observed by directly looking. Phenomenons like weather patterns and medical conditions. Also matematical relationships e.g when doing finance analysis.
  • Helps us to construct and select variables. We can be able to choose which to discard and which to use.
  • Bridge the gap between Technical and non-technical users by explaining figuratively what has been written in code.

A pictorial description of how visualized Data is convinient.

Knowing the different types of analysis for data visualization is an important additional concept.
Univariate Analysis:In this type, we analyze almost all the properties of only one feature.
Bivariate Analysis: In this one, analysis of properties is done for two features. We compare exactly two features.
Multivariate Analysis:Here, we compare more than two variables.
Let's get right to it.

Import the necessary libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import seaborn as sns

Reading the data

Step two is reading the data which is mostly in csv format. Using pandas library. I used this dataset.
df = pd.read_csv('StudentsPerformance.csv')
df.head()

Our dataset looks like this after running df.head(), which outputs the first 5 rows.

Ouput after loading the dataset using pandas.
You can easily tell just by looking at the dataset that it contains data about different students at a school/college, and their scores in 3 subjects.

Describe the Data

After loading the dataset,the next step is to summarize the info and it's main characteristics. Consider it as a way to get summary statistics like the mean, the maximum, minimum values, the 25th percentile e.t.c of the different columns in a data frame.
The output is something like this.

output of dataset description
Please also note, if you want to include categorical features(features that have not been represented by numbers) also in your output, just run df.head(include='all').
Now, in the output, count, unique and the most appearing values(top) have been filled. See below,

Image description

Check for missing values

Incase of any missing entries, it is advisable we fill them. For categorical features, with mode and for numerical features with median or mean. Run df.isnull().sum()

Output for missing entries
Phwyuuks!! We don't have any missing values.
We can now proceed to observe any underlying patterns, analyze the data and identify any outliers using visual representations. I loooovee this part. Let's do it!

Graphs

Remember the three types of analysis we mentioned before? Let's look at them. We'll start with Univariate analysis. A bar graph. Look at the distribution of the students across gender, race, their lunch status and whether they have a course to prepare for or not.

plt.subplot(221)
df['gender'].value_counts().plot(kind='bar', title='Gender of students', figsize=(16,9))
plt.subplot(222)

df['race/ethnicity'].value_counts().plot(kind='bar', title='Race/ethnicity of students')

plt.xticks(rotation=0)

plt.subplot(223)

df['lunch'].value_counts().plot(kind='bar', title='Lunch status of students')

plt.xticks(rotation=0)

plt.subplot(224)

df['test preparation course'].value_counts().plot(kind='bar', title='Test preparation course')

plt.xticks(rotation=0)

plt.show()
Enter fullscreen mode Exit fullscreen mode

The output:

graph output

We can conclude a lot of information. For instance,

  • There are more girls that boys.
  • The majority of students belong to race groups C and D.
  • More than 60% of the students have a standard lunch.
  • More than 60% of the students have not taken any test preparation course.

Next, lets look at univariate analysis and use a boxplot. A boxplot helps us in visualizing the data in terms of quartiles. Numerical columns are visualized very well with boxplots. We use function df.boxplot()

Output boxplot

Boxplot

  • The horizontal green line in the middle represents the median of the data.
  • The hollow circles near the tails represent outliers in the dataset.
  • The middle portion represents the inter-quartile range(IQR) From those points, we conclude that a box plots show the distribution of data. How far is our middle value data dispersed or spread. So lets plot some distribution plots to see. We'll start with the math score. sns.distplot(df['math score'])

Mathscore distribution plot
Well, the tip of the curve is at around 65 marks, the mean of the math score of the students in the dataset. We can make for the reading score and the writing score.

Reading score distribution plot

  • For our reading score curve, it's not a perfect bell curve. We conclude that the mean of the reading score is at around 72 marks.

Writing score distribution plot

  • For our writing score, it's also not a perfect bell curve. The mean of the writing score is at around 70 marks. So far so good, right? One more thing, let's look at the correlation between the three scores by use of a heatmap. Correlation basically means looking at the linear relationship between variables. If one variable changes, how does that affect the other?
corr = df.corr()
sns.heatmap(corr, annot=True, square=True)
plt.yticks(rotation=0)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Heatmap or the three variables

  • The 3 scores are highly correlated.
  • Reading score has a correlation coefficient of 0.95 with the writing score. Math score has a correlation coefficient of 0.82 with the reading score.

Bivariate analysis:Understand the relationship between 2 variables on different subsets of the dataset. We can try to understand the relationship between the math score and the writing score of students of different genders.

sns.relplot(x='math score', y='writing score', hue='gender', data=df)
Enter fullscreen mode Exit fullscreen mode

``
Image description
The graph shows a clear difference in scores between the male and female students.For the math score, female students are more likely to have a higher writing score than male students. For writing score, male students are expected to have a higher math score than female students.
Finally,let’s look at the impact of the test preparation course on students’ performance using a horizontal bar graph.

Image description
It is very evident that students who completed the test preparation course perfomed better than those who didn't.
That's the end Guys.
Thank you for following through.
YOU CAN DO IT!

Top comments (0)