gerry leo nugroho

Posted on Feb 5

Sample Super Store Analysis Using Python & Pandas

#python #pandas #datascience #beginners

1. Some Quick Intros

This tutorial provides a concise and foundational guide to exploring a dataset, specifically the Sample SuperStore dataset. This dataset, which appears to originate from a fictional e-commerce or online marketplace company's annual sales data, serves as an excellent example for learning and how to work with real-world data. The dataset includes a variety of data types, which demonstrate the full range of capabilities that the Pandas library, written in Python can offer.

Key data types in this dataset include:

Timestamp or Time Series: Enables the analysis of data over specific time intervals.
Categorical Data: Covers both category and subcategory elements, allowing for detailed segmentation and analysis.
Numerical Data: Supports various numerical analyses, such as summarization, aggregation, and statistical computations.
Feature Generation: Provides opportunities to create new features from existing data, enhancing the depth of analysis.

With these data types, we can fully utilize the various built-in functionalities of the Pandas library. In this tutorial, we will explore the dataset using Jupyter Notebook as our main integrated development environment (IDE). To get started, please download the CSV file used in this project from my GitHub repository, available here.

2. Requisite Installations

To begin this data exploration, you will need a functional installation of the Python programming language. Additionally, ensure that you have Jupyter Notebook and the Pandas library installed on your computer, as these tools will be essential throughout our exploration.

While this tutorial does not cover the installation process for these tools, you can find detailed instructions on their respective official websites. Please visit these sites to learn how to download, install, and configure Python, Jupyter Notebook, and Pandas for your specific operating system. Once you have completed the installation and setup, return to this guide to continue with the data exploration. If you're interested, I made an article already on how to setup your local environment for data crunching found here.

Fell free if you're an adventurer type and wish to venture the journey yourself, here are the links to get you started:

3. Importing Library

To start, we need to import the Pandas library into our Jupyter Notebook environment. This step allows us to utilize Pandas' powerful data manipulation functions directly within the notebook. Below is the code snippet to import Pandas into your Jupyter Notebook:

# Importing the Pandas library
import pandas as pd

With Pandas successfully imported, we can now proceed to the next step: reading datasets from a CSV file.

4. Reading Dataset

Now, that we have settled our first challenge, let's move further to read the datasets coming from a CSV file.

df_orders = pd.read_csv('data/superstore.csv')

If you notice form the above code, it's implying that we need to put our superstore.csv dataset on a directory called data. So once that you download the dataset, create a folder named data and put your superstore.csv file there on that particular directory or folder.

While the remaining code would imply these instructions :

df_orders = is the name of the variable, that will be using throughout the example of this tutorial.
pd = stands for Panda, it's the convention the community is using.
.read_csv = is a method within to read the CSV file.

Now, let's try to call the method that we've previously defined.

df_orders

By default, Panda's built-in functionality, only showing 20 columns and 10 rows for each dataset, every time time you try to display them in the view. If you notice from the tabular data above, the dataset get truncated with triple dots sign '...' both for the rows and the columns. And since this dataset has 10800 rows with 21 columns, it'll only show the first 10 records for the row, with only 20 columns to the right instead of 21. As a side note, you can scroll the dataset both to the right and to the bottom, that way you can see the actual dataset content.

5. Dropping The "Row ID"

The "Row ID" column is not really that informative, I think it would be safe enough for us to simply just delete them. That way, it would give us much more clarity over our dataset.

df_orders_ = is the name of the variable, that will be using throughout the example of this tutorial.
.drop() = the method being used to drop column.
axis=1 = This indicates that the operation is performed on columns. (axis=0 would refer to rows.)
inplace=True = we used them to keep the changes onward.

df_orders.drop("Row ID", axis=1, inplace=True)
# Let's call the previously defined data variable, the 'df_orders'
df_orders

As you can see from the above table, we don't have the 'ROW ID' no longer in place and instead, it's being replaced by the 'Order ID'.

6. Change "The Index" Column

By each time you're using pd.read_csv('somedata.csv'), that would yield the dataset's actual rows and columns, and we certainly have quite an extensive records of data, as being displayed from the previous function.

As you may notice, the first column isn't the actual "Row ID" column, rather it's the default built-in feature Panda's bringing into the dataset. Let's try to change that into something much more useful. Now, let's revisit our previous Panda's function, but this time we add another parameter, the "index_col" to be exact.

Since that we wish to redo them again over a clean dataset, let's just call them again one more time with the
'df_orders = pd.read_csv('data/superstore.csv', index_col='Order ID')' function.

Let's continue with the 'df_orders' variable again. So don't be surprised if you see the 'Row ID' column reappearing in the dataset since that would illustrate best our objective, but this time the 'index column' values have changed from the value coming from the 'Order ID' column instead.

# Let's try to read again from the superstore.csv
df_orders = pd.read_csv('data/superstore.csv', index_col='Order ID')
# added the index_col='Order ID', parameter.
# Let's call the previously defined data variable, the 'df_orders'
df_orders

Once that we tried to add the additional parameter, as you may notice, the first column have changed to 'Order ID' column, rather then the previous Panda's built-in index column, and the other thing was, the fine print below each table now have changed, from 21 columns, to only 20 columns instead.

7. Drop the Row ID & Change The Index

On to our another objective, what if we wish to combine both of the features, with dropping the 'Row ID' and to change the 'Index' columns at the same time, so that we could get even leaner dataset to work with. With that kind of objective, we might need to combine both of the syntax to achieve our objective.

df_orders = pd.read_csv('data/superstore.csv', index_col='Order ID')
df_orders.drop("Row ID", axis=1, inplace=True)

pd.read_csv() = This is a pandas function used to read a CSV file into a DataFrame.
'data/superstore.csv' = This is the file path to the CSV file being read.
index_col='Order ID' = This parameter specifies that the column "Order ID" should be used as the index of the DataFrame.

# Let's call the dataset again.
df_orders

After this operation, the "Order ID" column will no longer be a regular column but will instead become the row labels (index) of the DataFrame.

8. Default Number of Rows & Columns

Let's try to set the maximum column and row to display, since by default the pandas library would display 10 records of rows in total for a single dataset. The first 5 would coming from the top records, and the remaining would be coming from the last 5 records as a whole.

But that's a little too much of information anyone could digest in a short glimpse, why don't we just minimize them down to 5 records instead. The same thing with the columns view, whereas Pandas would display you 20 columns, but since our current dataset only have 19 of them, then that should be fine.

pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 5)

pd.set_option('display.max_columns', 20) = This sets the maximum number of columns that pandas will display when printing or showing a DataFrame.
pd.set_option('display.max_rows', 5) = This sets the maximum number of rows that pandas will display when printing or showing a DataFrame. If a DataFrame has more than 5 rows, only the first 5 rows will be displayed, and the rest will be truncated (hidden).

# Let's try to give it a go with the new setting.
df_orders

9. The Dataset First 5 Rows

Here's another Pandas built-in method that may come handy. When you fell like taking a quick peek of the first 5 records from the top, the following code would deliver you those outputs.

df_orders = is the name of the variable, that will be using throughout the example of this tutorial.
.head() = is the method to display the first five records of data coming from the dataset.

df_orders.head()

10. The Dataset Last 5 Rows

Much like the above previous syntax, the similar can be apply to the bottom 5 records coming from your dataset. And you guess it right, the syntax would be .tail() and that would give you the last 5 records from the dataset.

df_orders = is the name of the variable, that will be using throughout the example of this tutorial.
.tail() = is the method to display the last five records of data coming from the dataset.

df_orders.tail(5)

11. The Dataset Structure

Now that you have one finer understanding on the previous aspect of importing library, loading the dataset and manipulate the views of the rows and the columns, let's now move on to the Dataset structure aspect. Whereas it's also an important area, before continuing the journey of exploring the dataset further.

The dataset you get from the wild, might not always have the proper structure and data types you need. And before you could do further analysis and manipulation, let's make sure that both the structure and data types have been taken care of properly.

12. Rows & Columns

Following are both the built-in method to achive our next objective, as we go more deeper over the analysis part of the dataset. Let's try to understand further of what how many rows and columns are there, we know this information from the previous part, but lucky for us, Pandas also provide us with a method di display the information in hand.

df_orders = is the name of the variable, that will be using throughout the example of this tutorial.
.shape() = is the method to display the number of rows and column.

df_orders.shape

Aside from the fact, there are various other ways for us to know how many Rows and Columns available in your dataset, Pandas also has a builtin method to display those information. So now we understand that the dataset has the following total records of information.

10800 columns.
19 columns.

13. Dataset Columns

Imagine that you're working with a large dataset, and by large, not just merely on the amounts of rows that it'd produce. But also on the amount of columns spread from left to right. Good thing we're only working a 19 columns (from previously 21 columns in our dataset), now wouldn't it be nice to have a method to display all the columns available in our dataset? Well the good news is, Pandas shipped with a builtin method just to achive that.

df_orders = is the name of the variable, that will be using throughout the example of this tutorial.
.columns = is the method to display all the columns available in the dataset.
.dtypes() = is the method to display the data types from the dataset available.

# Let's print the columns (features) names.
df_orders.columns

And here are the output from the previous command :

Index(['Order Date', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name',
       'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region',
       'Product ID', 'Category', 'Sub-Category', 'Product Name', 'Sales',
       'Quantity', 'Discount', 'Profit'],
      dtype='object')

14. Renaming Columns

Now it's just my way of doing things over the EDA aspect, is trying to eliminate any white spaces in between the columns name. It gets better and you may benefit a lot from performing this method, and as we move on to something that's much more complicated within the data exploration, we'll benefit on the clarity aspect, too.

# Let's try to rename the column.
df_orders.columns = ['OrderDate', 'ShipDate', 'ShipMode', 'CustomerID', 'CustomerName', 'Segment' , 'Country', 'City', 'State', 'PostalCode', 'Region', 'ProductID', 'Category', 'SubCategory', 'ProductName' , 'Sales', 'Quantity', 'Discount', 'Profit']
df_orders

15. Columns Data Type

Many times before you wish to explore further your columns in a dataset with some operation, you may need to make sure it's the correct data type that you're working with. For example, you wouldn't be able to do a division operation over a Timestamp data format, or multiply a String with with another string for that matter.

df_orders = is the name of the variable, that will be using throughout the example of this tutorial.
.columns = is the method to display all the columns available in the dataset.
.info() = is the method to display the data types from the dataset available.

# Let's print the columns data types.
df_orders.info()

16. Columns Data Type

As we can see from the above snippets, we have noticed there are couple of columns data types that were set incorrectly. For instance, the OrderDate data type was set to object data type instead of datetime, or the PostalCode was set to float data type, though you wouldn't do any calculation on top of it. Somewhere down the line with that kind of flaws, will lead us to even bigger problem if we don't try to fix them now. Let's try to patch those data types with the following methods.

df_orders = is the name of the variable, that will be using throughout the example of this tutorial.
.astype = is the method to change the columns data type in the dataset.

# Let's try to change the datatypes of the following column in the dataset.
df_orders['OrderDate'] = df_orders['OrderDate'].astype('datetime64[ns]')
df_orders['ShipDate'] = df_orders['ShipDate'].astype('datetime64[ns]')
df_orders['PostalCode'] = df_orders['PostalCode'].astype('object')

And now let's try to recheck them again, to see if the codes have worked as intended.

# Let's print the columns data types.
df_orders.info()

17. Dataset Statistic Figures

If you're more into the statistician type, perhaps you may be also interested by the following methods to generate the figures with only a single line of .describe Pandas method.

df_orders = is the name of the variable, that will be using throughout the example of this tutorial.
.describe = is the method to pull out some statistics figures from the dataset.

Short note, the .describe method would only work for numerical column, and not categorical. While for the (include='all'), would work on both numerical & categorical values.

# Describing statistical information on the dataset
df_orders.describe()

# Describing more statistical information on the dataset
df_orders.describe(include='all')

18. Dataset Statistical Figures

The following code would imply these instructions

df_orders = is the name of the variable, that will be using throughout the example of this tutorial.
.count = is the count value to a specific column.
.mean = is the std value to a specific column.
.min = is the min value to a specific column.

`df_orders["Sales"].count()

Would give you the value of "9994".

df_orders["Sales"].mean()

Would give you the value of "229.8580008304938".

df_orders["Sales"].std()

Would give you the value of "623.2451005086818".

df_orders["Sales"].min()

Would give you the value of "0.444".

19. Exporting Dataset

Once that we've satisfied with our results, it's time to export them. So let's export them a new CSV dataset, so we could work with them on the next notebook tutorial.

df_orders.to_csv('data/df_orders_exported.csv', index =False)

df_orders = is the name of the variable, that will be using throughout the example of this tutorial. .to_csv = is the export method to a CSV dataset.
index = False = we need to define this index value set to False, since we don't want the index column.

Now go ahead and check your current working directory. you may find your df_orders_exported.csv there.

20. Bonus Stage

Now that we've come a long way of exploring our superstore.csv dataset, it's time to dive a little bit deeper of what, both Python and Pandas capable of delivering. Let's try to create a custom class in Python by leveraging our builtin Pandas method available in the library.

# Let's create a class named `display_all`, by which later we call on the next command.
def display_all
(df_orders):
with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000):
        display(df_orders)

What it does basically, it creates a class named display_all and called the df_orders variable that we've defined earlier at the top of this jupyter notebook tutorial. Next, we call the pd.option_context method that would provide us with the display.max_rows and the display.max_columns attributions. And lastly we combine them all in the display() method by the end of the class.

Now that we've combine them all together, let's put them into action, and see what it delivers back to us this time.

display_all(df_orders.head(10).T)

Since we've defined the display_allfrom the previous class, we can now use it to explore further and combine them with different methods available in Pandas, much like the .describe attribute.

display_all(df_orders.describe(include='all').T)

You've made it this far, congratulations on achieving your first essential scientist project. But there's more, if you wish to explore further or perhaps want to experiment, fork this Jupyter Notebook else even copy them to your working directory. I made everything available on my GitHub repository. Got any questions? Fell free to ask them down below in the comment section, I'll try to get back to any inquiries as soon as I could. Hope you enjoy this tutorial, and thank you for reading them.

DEV Community