DEV Community

Tomoyuki Aota
Tomoyuki Aota

Posted on • Edited on

Visualizing the patterns of missing value occurrence with Python

(A Japanese translation is available here.)

During data analysis, we need to deal with missing values. Handling missing data is so profound that it will be an entire topic of a book. However, before doing anything to missing values, we need to know the pattern of occurrence of missing values. This article describes easy visualization techniques for missing value occurrence with Python. The techniques are useful in early stages of exploratory data analysis.

I've uploaded a Jupyter notebook in my GitHub repo. You can run it using Binder by clicking the badge below.

Binder

Prerequisite

I'm using the Titanic train dataset from Kaggle as an example. To begin with, following code is assumed to be executed.



import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


Enter fullscreen mode Exit fullscreen mode


df = pd.read_csv('train.csv')


Enter fullscreen mode Exit fullscreen mode


# Confirm the number of missing values in each column.
df.info()


Enter fullscreen mode Exit fullscreen mode


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Enter fullscreen mode Exit fullscreen mode

Method 1: seaborn.heatmap

The first method is by seaborn.heatmap. The next single-line code will visualize the location of missing values.



sns.heatmap(df.isnull(), cbar=False)


Enter fullscreen mode Exit fullscreen mode

seaborn_heatmap.png

Against Index, I can see that

  • Age column has missing values with variation in occurrence,
  • Cabin column are almost filled with missing values with variation in occurrence, and
  • Embarked column has few missing values in the beginning part.

This is not the case for this Titanic dataset, but especially in time series data, we need know if the occurrence of missing values are sparsely located or located as a big chunk. This heatmap visualization immediately tells us such tendency. Also, if more than 2 columns have correlation in missing value locations, such correlation will be visualized. (Again, not the case for this dataset, but it is important to know the fact that there is no such correlation in this dataset.)

This single-line code tells us a lot of information of missing value occurrence.

Method 2: missingno module

If you want to proceed further, missingno module will be useful.
To begin with, install and import it.



pip install missingno


Enter fullscreen mode Exit fullscreen mode


import missingno as msno


Enter fullscreen mode Exit fullscreen mode

If you want the similar result to seaborn.heatmap described earlier, use missingno.matrix.



msno.matrix(df)


Enter fullscreen mode Exit fullscreen mode

missingno_matrix

In addition to the heatmap, there is a bar on the right side of this diagram. This is a line plot for each row's data completeness. In this dataset, all rows have 10 - 12 valid values and hence 0 - 2 missing values.

Also, missingno.heatmap visualizes the correlation matrix about the locations of missing values in columns.



msno.heatmap(df)


Enter fullscreen mode Exit fullscreen mode

missingno_heatmap

missingno module has more features, such as the bar chart of the number of missing values in each column and the dendrogram generated from the correlation of missing value locations. For more information, README is a good primer.

Closing

Two easy visualization methods are described in this article. seaborn.heatmap is the first choice as it requires seaborn only, but it you need more, missingno module will help you.

Top comments (5)

Collapse
 
ra312 profile image
Rauan Akylzhanov

You still need to call plt.show(), right ?

Collapse
 
balaranga33 profile image
balaranga33

Actually no, if you used this magic function in jupyter notebook "%matplotlib inline" then you don't need to call plt.show()

Collapse
 
radekpjanik profile image
radekpjanik

How would you plot the missingno package plots into 3 subplots? E.g. have 3 subplots, one with matrix, one with heatmap and one with dendogram?

Thanks!

Collapse
 
prateek2901 profile image
Prateek Srivastava

fig, ax = plt.subplots(figsize=(25, 15),nrows=1,ncols=2)

Visualize the number of missing values as a bar chart

msno.bar(df,ax=ax[0])

Visualize the correlation between the number of missing values in different columns as a heatmap

msno.heatmap(df,ax=ax[1])

Maybe you can try something like this..

Collapse
 
msalehsaudi profile image
Mohammad Saleh

on the seaborn.heatmap , is there a way to show only the index of null rows on the left side of the graph?