Aspersh Upadhyay

Posted on Aug 29, 2023 • Originally published at Medium on Jan 21, 2023

Exploring and analyzing large datasets with Python and Pandas

#data #datascience #python #datavisualization

Pandas makes our data analysis journey easier

Image source: realpython.comhttps://realpython.com/lessons/sorting-data-python-pandas-overview/

In today’s world, data is being generated at an unprecedented rate, and the ability to effectively analyze and understand this data is critical for making informed decisions.

In this blog post, we will explore how to use the Python Pandas library to analyze large datasets. We will be using the Titanic dataset, a well-known dataset that contains information about the passengers on the Titanic ship that sank in 1912.

The Pandas library is a powerful tool for data manipulation and analysis. It provides a data structure called a DataFrame, which allows us to manipulate and analyze large datasets with ease. In this post, we will go over some basic operations that can be performed on a DataFrame and how they can be used to analyze the Titanic dataset.

First, we will start by importing the necessary libraries and loading the Titanic dataset into a DataFrame.

# load required library 
import pandas as pd

# load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

Once we have the dataset loaded into a DataFrame, we can start exploring the data. One of the first things we might want to do is to get a general overview of the data. We can use the head() method to display the first few rows of the data.

# Display the first few rows of the data
df.head()

This will give us a general idea of the structure of the data and the type of information that is available. We can also use the describe() method to get some basic statistics about the numerical columns in the data.

# Get some basic statistics about the data
df.describe()

We can also use the info() method to get information about the columns in the data, such as the data type and number of non-null values.

# Get information about the columns in the data
df.info()

Next, we can start analyzing the data in more detail. One way to do this is by using the groupby() method to group the data by a certain column and then applying a function to the groups. For example, we can group the data by the 'Survived' column and then calculate the mean of the 'Age' column for each group.

# Group the data by the 'Survived' column and calculate the mean of the 'Age' column
df.groupby("Survived")["Age"].mean()

We can also use the pivot_table() method to create a pivot table of the data. This is a useful way to quickly summarize the data and see the relationship between different columns.

# Create a pivot table of the data
pd.pivot_table(df, values='Age', index='Pclass', columns='Survived', aggfunc='mean')

Another way to analyze and visualize the data is by using data visualization libraries like Matplotlib and Seaborn. For instance, we can use the matplotlib.pyplot module to create a histogram of the 'Age' column.

import matplotlib.pyplot as plt

# Create a histogram of the 'Age' column
plt.hist(df["Age"])
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Histogram of Age')
plt.show()

Similarly, we can use the seaborn.countplot() function to create a barplot of the number of passengers who survived and did not survive.

import seaborn as sns

# Create a bar plot of the number of passengers who survived and did not survive
sns.countplot(x="Survived", data=df)
plt.xlabel('Survived')
plt.ylabel('Count')
plt.title('Bar Plot of Survivors')
plt.show()

We can also use the seaborn.boxplot() function to create a box plot of the 'Age' column grouped by the 'Survived' column. This can give us a better understanding of the distribution of ages among survivors and non-survivors.

# Create a box plot of the 'Age' column grouped by the 'Survived' column
sns.boxplot(x="Survived", y="Age", data=df)
plt.xlabel('Survived')
plt.ylabel('Age')
plt.title('Box Plot of Age by Survival')
plt.show()

In this blog post, we have covered some basic operations that can be performed on a DataFrame using the Python Pandas library and how to use Python’s visualization library to analyze the titanic dataset. We have seen how to use Pandas to explore and analyze large datasets. With the help of Pandas, it becomes much easier to manipulate and visualize large datasets.

NOTE : It’s important to note that this is just a small subset of all the analysis that we can do with the titanic dataset, there’s much more to be discovered and visualized.

DEV Community

Exploring and analyzing large datasets with Python and Pandas

Top comments (0)

Read next

Advent of Code 2024 - Day 14 : Restroom Redoubt

Brain-Inspired Method Cuts Neural Networks by 90% Without Losing Accuracy

React JS vs Python: How to Choose the Best Fit for Your Project

Who should be your first data hire and when should you hire them?