DEV Community

Aspersh Upadhyay
Aspersh Upadhyay

Posted on • Originally published at Medium on

Exploring and analyzing large datasets with Python and Pandas

Pandas makes our data analysis journey easier


Image source: realpython.comhttps://realpython.com/lessons/sorting-data-python-pandas-overview/

In today’s world, data is being generated at an unprecedented rate, and the ability to effectively analyze and understand this data is critical for making informed decisions.

In this blog post, we will explore how to use the Python Pandas library to analyze large datasets. We will be using the Titanic dataset, a well-known dataset that contains information about the passengers on the Titanic ship that sank in 1912.

The Pandas library is a powerful tool for data manipulation and analysis. It provides a data structure called a DataFrame, which allows us to manipulate and analyze large datasets with ease. In this post, we will go over some basic operations that can be performed on a DataFrame and how they can be used to analyze the Titanic dataset.

First, we will start by importing the necessary libraries and loading the Titanic dataset into a DataFrame.

# load required library 
import pandas as pd

# load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
Enter fullscreen mode Exit fullscreen mode

Once we have the dataset loaded into a DataFrame, we can start exploring the data. One of the first things we might want to do is to get a general overview of the data. We can use the head() method to display the first few rows of the data.

# Display the first few rows of the data
df.head()
Enter fullscreen mode Exit fullscreen mode

This will give us a general idea of the structure of the data and the type of information that is available. We can also use the describe() method to get some basic statistics about the numerical columns in the data.

# Get some basic statistics about the data
df.describe()
Enter fullscreen mode Exit fullscreen mode

We can also use the info() method to get information about the columns in the data, such as the data type and number of non-null values.

# Get information about the columns in the data
df.info()
Enter fullscreen mode Exit fullscreen mode

Next, we can start analyzing the data in more detail. One way to do this is by using the groupby() method to group the data by a certain column and then applying a function to the groups. For example, we can group the data by the 'Survived' column and then calculate the mean of the 'Age' column for each group.

# Group the data by the 'Survived' column and calculate the mean of the 'Age' column
df.groupby("Survived")["Age"].mean()
Enter fullscreen mode Exit fullscreen mode

We can also use the pivot_table() method to create a pivot table of the data. This is a useful way to quickly summarize the data and see the relationship between different columns.

# Create a pivot table of the data
pd.pivot_table(df, values='Age', index='Pclass', columns='Survived', aggfunc='mean')
Enter fullscreen mode Exit fullscreen mode

Another way to analyze and visualize the data is by using data visualization libraries like Matplotlib and Seaborn. For instance, we can use the matplotlib.pyplot module to create a histogram of the 'Age' column.

import matplotlib.pyplot as plt

# Create a histogram of the 'Age' column
plt.hist(df["Age"])
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Histogram of Age')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Similarly, we can use the seaborn.countplot() function to create a barplot of the number of passengers who survived and did not survive.

import seaborn as sns

# Create a bar plot of the number of passengers who survived and did not survive
sns.countplot(x="Survived", data=df)
plt.xlabel('Survived')
plt.ylabel('Count')
plt.title('Bar Plot of Survivors')
plt.show()
Enter fullscreen mode Exit fullscreen mode

We can also use the seaborn.boxplot() function to create a box plot of the 'Age' column grouped by the 'Survived' column. This can give us a better understanding of the distribution of ages among survivors and non-survivors.

# Create a box plot of the 'Age' column grouped by the 'Survived' column
sns.boxplot(x="Survived", y="Age", data=df)
plt.xlabel('Survived')
plt.ylabel('Age')
plt.title('Box Plot of Age by Survival')
plt.show()
Enter fullscreen mode Exit fullscreen mode

In this blog post, we have covered some basic operations that can be performed on a DataFrame using the Python Pandas library and how to use Python’s visualization library to analyze the titanic dataset. We have seen how to use Pandas to explore and analyze large datasets. With the help of Pandas, it becomes much easier to manipulate and visualize large datasets.

NOTE : It’s important to note that this is just a small subset of all the analysis that we can do with the titanic dataset, there’s much more to be discovered and visualized.

Top comments (0)