DEV Community

Cover image for How to Use Pandas for Data Analysis
teri
teri

Posted on • Edited on

How to Use Pandas for Data Analysis

Pandas is an open-source library that helps you analyze and manipulate data.

You should note that Pandas is a tool you will use if you’re getting into machine learning and data science.

In this article, you will learn how to use the Pandas commands in Jupyter Notebook for data analysis and manipulation.

Everything I share in this blog post is my interpretation of my journey into being a data scientist and machine learning expert using the tools that will make me proficient.

Review the code for this project.

Why Pandas?

The following is
why you should consider using Pandas:

  • Simple to use: Like using functions to transform your data in a way that makes it usable
  • Integrated with many other data science and ML Python tools
  • It helps you get your data ready for machine learning

Installing and using Pandas

Using an environment like Conda will make Pandas and other packages available. Check this resource to get your computer ready with Conda.

In my previous post, I learned about the introduction of machine learning.

Let’s begin.

How to Import Pandas

To get started using Pandas, first import it in your jupyter notebook using the command:

import pandas as pd
Enter fullscreen mode Exit fullscreen mode

pd: is represented as an alias of the package pandas, making sure you have all the functionality to use

To confirm Pandas availability, check its version:

# print the version
print(f'Pandas version: {pd.__version__}')
Enter fullscreen mode Exit fullscreen mode

Pandas version: 2.1.1

To know more about pandas and read its documentation, type:

    # pandas documentation
    pd?
Enter fullscreen mode Exit fullscreen mode

pandas docs

Data Types

Pandas have two main data types:

  • Series - a 1-dimensional column of data
  • DataFrame - a 2-dimensional table of data with rows and columns, which is the most common

Let’s create some data using these data types.

Series

You can create a Series using pd.Series() and passing in a Python list.

# Creating a series of the primary colors
colors = pd.Series(["Red", "Yellow", "Blue"])
colors


0      Red
1    Yellow
2     Blue
dtype: object


# Creating a Series of branded cars
cars = pd.Series(["Mercedes", "Toyota", "Dodge"])
cars
Enter fullscreen mode Exit fullscreen mode

DataFrame: Remember that a Python dictionary is the component when using the pd.DataFrame().

# Creating a DataFrame of the cars and colors
car_data = pd.DataFrame({"Car type": cars, "Color": colors})
car_data
Enter fullscreen mode Exit fullscreen mode

The above command combines the created Series data types into the DataFrame type.

        Car type        Color
0        Mercedes       Red
1        Toyota         Yellow
2        Dodge          Blue
Enter fullscreen mode Exit fullscreen mode

Note: You are not limited to using only text; you can use any data type in your DataFrame, like integers, floats, dates, and more.

Importing Data

In a work environment, you will import data as a .csv (comma-separated value), a spreadsheet file, or something similar, such as an SQL database.

Pandas allow for data import using the functions pd.read_csv() and pd.read_excel() for Microsoft Excel files.

Download the car sales csv data and save it in the root directory of your working folder.

# import the car sales data
car_sales = pd.read_csv('car_sales.csv')
car_sales
Enter fullscreen mode Exit fullscreen mode

Note: the read_csv() also reads data via a URL.

car sales data frame

Anatomy of a DataFrame

As shown above, every row in a data frame starts from index 0. The row has an axis of 0 while the column has an axis of 1, which can be instrumental when you want to column from the table using the .drop(). Be careful when performing this action. Each row value in the table is known as the data.

anatomy of a dataframe

Alternatives to Pandas

These two tools are worth checking out.

  • Polars: DataFrame built by developers for the new era that is compatible with Python and Rust.
  • Ibis: The portable Python dataframe library

Conclusion

This article showed you the basic commands for using Pandas. To learn more about the possibilities of Pandas, check out this repository with the other code samples to get familiar with using this tool for your data.

Further Reading

Top comments (2)

Collapse
 
proteusiq profile image
Prayson Wilfred Daniel • Edited

If I were to initiate my Data Science endeavors today, I would opt for Polars rather than Pandas, given its superior speed and congruence with contemporary Python practices.

In the same vein, Ibis stands out as a preferable alternative to Pandas. Nonetheless, I persist with Pandas, as my comfort and proficiency with it are deeply ingrained.

Collapse
 
terieyenike profile image
teri

Hey Prayson,
I am learning so much from your contributions. This is duly noted, and would research more on the tools you have mentioned.

Thank you so much for your support.