DEV Community

Cover image for Article 2: Exploratory-Data-Analysis-using-Data-Visualization-Techniques
Simon Ngotho
Simon Ngotho

Posted on

Article 2: Exploratory-Data-Analysis-using-Data-Visualization-Techniques

Exploratory Data Analysis using Data Visualization Techniques

Preliminaries

Exploratory Data Analysis (EDA) is very crucial step in data analysis as it helps analysts and data scientists understand the datasets by looking at its structure and characteristics. It involves calculating summary statistics, identifying outliers, visualizing data distributions, exploring relationships between variables and performing hypothesis testing. This helps in uncovering insights that can help business make solid decisions of further data analysis. Data visualization tools helps present complex data in simpler and consumable manner even to no tech-savvy professionals.

This article explores difference data visualization techniques to help get deeper insights on their application and significance.

Common Data Visualization Techniques for EDA

1. Scatter Plots

They help visualize the relationship between two numerical variables. They key tools for identifying correlations, outliers, clusters or trends. A scatter plot with a regression line can also reveal linear relationships.

#using matplotlib

import matplotlib.pyplot as plt
import numpy as np

x = np.random.rand(50) # random data
y = x + 2 + np.random.randn(50)  # using linear relationship simulation

#Creating scatter plot

plt.scatter(x, y, c='yellow', edgecolor='red', marker='.', label='Points')

#title the scatter plots

plt.xlabel('X-Axis')
plt.ylabel('Y-Axis')
plt.title('Lux Tech Practice Scatter Plot1')


#using seaborn

import seaborn as sns
import numpy as np

x = np.random.rand(50) # random data
y = x + 2 + np.random.randn(50)  # using linear relationship simulation

#Creating scatter plot

Sns.scatterplot(x, y, color='yellow', edgecolor='red', marker='.', label='Points')

#title the scatter plots

plt.xlabel('X-Axis')
plt.ylabel('Y-Axis')
plt.title('Lux Tech Practice Scatter Plot2')
Enter fullscreen mode Exit fullscreen mode

2. Histograms

Histograms help understand the distribution of numerical data. They display the frequency or count of data points within specified ranges or bins. Histograms helps in checking whether the data is normally distributed or how skewed it is.
In Python, one can use Matplotlib or Seaborn as shown below;

#Using matplotlib

import matplotlib.pyplot as plt
import numpy as np

data = np.random.randn(1000)  # use a Sample data – this random data


#Creating histogram

plt.hist(data, bins=15, edgecolor='black', alpha=0.7, color='yellow')

#Title the Histogram

plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Lux Tech Practice Histogram1')


#Using seaborn

import seaborn as sns
import numpy as np

data = np.random.randn(500) # use a Sample data – this random data

#Creating histogram

sns.histplot (data, bins=15, color='yellow', edgecolor='red')

#Title the Histogram

plt.xlabel ('values')
plt.ylabel ('frequency')
plt.title ('Lux Tech Practice Histogram2')
Enter fullscreen mode Exit fullscreen mode

3. Bar Charts

They show the frequency or count of different categories for easy identification of the most common/rare values. Stacked bar charts are also used to show the distribution of categories within subgroups.

#Using matplotlib
import matplotlib.pyplot as plt
categories = ['Nakuru', 'Kisumu', 'Mombasa', 'Nairobi']
values = [1.9, 2.4, 3.2, 5.5] # Sample data

#Creating bar chart

plt.bar(categories, values, color='yellow', edgecolor='red')

#titling and labelling 

plt.xlabel('Cities')
plt.ylabel('Population in Millions')
plt.title('Lux Tech Practice Bar Chart1')
Enter fullscreen mode Exit fullscreen mode

4. Box Plots

Box plots shows the median, quartiles, and potential outliers and are useful for comparing distributions between different categories or groups within the data.

#Using matplotlib

import matplotlib.pyplot as plt
import numpy as np

data = [np.random.normal(0, 1, 50), np.random.normal(2, 1, 50), np.random.normal(4, 1, 50)] # Sample data

#Creating a box plot

plt.boxplot(data, labels=['Category A', 'Category B', 'Category C'])

#Title and label

plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Lux Tech Practice Box Plot1')
Enter fullscreen mode Exit fullscreen mode

5. Time Series Plots

They are helpful to analysts as they help them understand how data evolves over time, identify trends, seasonality, and anomalies, and make predictions based on historical data.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

time_period = pd.date_range (start='2000-01-01', periods=30, freq='M')
data = np.cumsum(np.random.randn(30)) # Generate sample time series data


#Creating time series plot

plt.figure(figsize=(14, 7))
plt.plot(time_period, data, color='yellow', marker='d', linestyle='-')

#labels and title

plt.grid(True)
plt.xlabel('Time')
plt.ylabel('Values')
plt.title('Time Series Plot Example')
Enter fullscreen mode Exit fullscreen mode

6. Heatmaps

Heatmaps help visualize relationships between variables by use of colour intensity to represent the strength and direction of correlations between pairs of variables. This way, they make it easier to identify patterns in large datasets.

#using seaborn

import seaborn as sns
import numpy as np

data = np.random.rand(10, 5)  # sample data

#Creating a heatmap 

sns.heatmap(data, annot=True, cmap="YlGnBu")

#label and title

plt.xlabel('Variables')
plt.ylabel('Variables')
plt.title('Lux Tech Practice Heatmap1’)
Enter fullscreen mode Exit fullscreen mode

Conclusion

Data visualization techniques make complex data more interpretable, enabling the identification of patterns, trends, outliers, and data quality issues. It becomes easy for users to interpret regardless of their background and hence helping them make data-driven decisions to solve problems.

Top comments (0)