In today’s data-driven world, the ability to efficiently clean and analyze large datasets is a key skill. This is where Pandas, one of Python’s most powerful libraries, comes into play. Whether you're handling time series data, numerical data, or categorical data, Pandas provides you with tools that make data manipulation easy and intuitive. Let's jump into Pandas and see how it can transform your approach to data analysis.
Installing pandas
To start using Pandas, you’ll need to install it. Like any other Python library, Pandas can be installed via pip by running the following command:
pip install pandas
Pandas Data Structures
Pandas have series and dataframe for data structure. They provide a solid foundation for a wide variety of data tasks.
1. Series
From Panda's documentation, a Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
import pandas as pd
# Creating a Series
s = pd.Series(data, index=index)
# Creating a Series from a list
data = pd.Series([10, 20, 30, 40])
# Creating a Series from a dictionary
data_dict = pd.Series({'a': 10, 'b': 20, 'c': 30})
2. DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different value types (numeric, string, Boolean, etc.). You can think of it like a spreadsheet SQL table or a dict of Series objects
import pandas as pd
data = {
'Name': ['Harry Potter', 'Hermione Granger', 'Ron Weasley', 'Draco Malfoy', 'Luna Lovegood'],
'House': ['Gryffindor', 'Gryffindor', 'Gryffindor', 'Slytherin', 'Ravenclaw'],
'Patronus': ['Stag', 'Otter', 'Jack Russell Terrier', 'None', 'Hare'],
'Favorite Subject': ['Defense Against the Dark Arts', 'Arithmancy', 'Divination', 'Potions', 'Charms'],
'Quidditch Position': ['Seeker', 'None', 'Keeper', 'None', 'None'],
'OWL Scores': [7, 11, 7, 8, 9]
}
df = pd.DataFrame(data)
print(df)
Data Manipulation with Pandas
Once you have your data in a DataFrame, Pandas provides powerful methods to explore, clean, and transform it. Let’s start with some of the most commonly used methods for exploring data.
1. Exploring Data
- head()
The head() method returns the headers and a specified number of rows, starting from the top. The default number of elements to display is five, but you may pass a custom number.
>>> df.head(3)
Name House Patronus Favorite Subject Quidditch Position OWL Scores
0 Harry Potter Gryffindor Stag Defense Against the Dark Arts Seeker 7
1 Hermione Granger Gryffindor Otter Arithmancy None 11
2 Ron Weasley Gryffindor Jack Russell Terrier Divination Keeper 7
- tail()
The tail() method returns the headers and a specified number of rows, starting from the bottom.
>>> df.tail(2)
Name House Patronus Favorite Subject Quidditch Position OWL Scores
3 Draco Malfoy Slytherin None Potions None 8
4 Luna Lovegood Ravenclaw Hare Charms None 9
- info()
The DataFrames object has a method called info(), that gives you more information about the data set.
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 House 5 non-null object
2 Patronus 5 non-null object
3 Favorite Subject 5 non-null object
4 Quidditch Position 5 non-null object
5 OWL Scores 5 non-null int64
dtypes: int64(1), object(5)
memory usage: 368.0 bytes
- describe()
The describe() methods give us the overall statistics of the dataset. It gives us values of min, max, mean, and standard deviation.
>>> df.describe()
OWL Scores
count 5.000000
mean 8.400000
std 1.673320
min 7.000000
25% 7.000000
50% 8.000000
75% 9.000000
max 11.000000
2.Filtering
In data analysis, filtering helps you narrow down the data you're interested in. Pandas have several ways to filter data. The most simple and straightforward is direct Boolean indexing, especially filtering based on specific conditions (e.g., filtering based on column values). Let’s look at a few examples. In the first example, we’re selecting rows where the house value is Gryffindor:
import pandas as pd
data = {
'Name': ['Harry Potter', 'Hermione Granger', 'Ron Weasley', 'Draco Malfoy', 'Luna Lovegood'],
'House': ['Gryffindor', 'Gryffindor', 'Gryffindor', 'Slytherin', 'Ravenclaw'],
'Patronus': ['Stag', 'Otter', 'Jack Russell Terrier', 'None', 'Hare'],
'Favorite Subject': ['Defense Against the Dark Arts', 'Arithmancy', 'Divination', 'Potions', 'Charms'],
'Quidditch Position': ['Seeker', 'None', 'Keeper', 'None', 'None'],
'OWL Scores': [7, 11, 7, 8, 9]
}
df = pd.DataFrame(data)
# Filter rows where the House is Gryffindor
gryffindor_students = df[df['House'] == 'Gryffindor']
print(gryffindor_students)
output
Name House Patronus Favorite Subject Quidditch Position OWL Scores
0 Harry Potter Gryffindor Stag Defense Against the Dark Arts Seeker 7.00
1 Hermione Granger Gryffindor Otter Arithmancy None 11.00
2 Ron Weasley Gryffindor Jack Russell Terrier Divination Keeper 7.00
In the second example, we’re filtering data where the OWL score (think of it as a magical equivalent to the SAT in the Harry Potter world) is greater than 8:
# Filter students with OWL Scores greater than 8
high_scorers = df[df['OWL Scores'] > 8]
print(high_scorers)
output
Name House Patronus Favorite Subject Quidditch Position OWL Scores
1 Hermione Granger Gryffindor Otter Arithmancy None 11.00
4 Luna Lovegood Ravenclaw Hare Charms None 8.25
Another way to filter data is by using the .loc method. This method allows you to filter using conditions and labels for both rows and columns. If the specified labels don’t exist, it will raise a KeyError:
# Use .loc[] to filter students who scored more than 8 OWLs
high_owl_scores_loc = df.loc[df['OWL Scores'] > 8]
print(high_owl_scores_loc)
output
Name House Patronus Favorite Subject Quidditch Position OWL Scores
1 Hermione Granger Gryffindor Otter Arithmancy None 11
4 Luna Lovegood Ravenclaw Hare Charms None 9
At first glance, this may look like direct Boolean indexing. Still, there’s a key difference: .loc provides finer control, letting you select both rows and columns simultaneously, while Boolean indexing primarily filters rows:
# Use .loc[] to filter and select specific columns
gryffindor_students = df.loc[df['House'] == 'Gryffindor', ['Name', 'OWL Scores']]
print(gryffindor_students)
output
Name OWL Scores
0 Harry Potter 7
1 Hermione Granger 11
2 Ron Weasley 7
Finally, we have the .iloc method. This is used for position-based filtering, meaning you select rows and columns by their index positions rather than their labels:
third_character = df.iloc[2]
print(third_character)
output
Name Ron Weasley
House Gryffindor
Patronus Jack Russell Terrier
Favorite Subject Divination
Quidditch Position Keeper
OWL Scores 7
Name: 2, dtype: object
Select the 1st and last rows (indexes 0 and 4) for columns "House" and "OWL Scores"
first_last_info = df.iloc[[0, 4], [1, 5]]
print(first_last_info)
output
House OWL Scores
0 Gryffindor 7
4 Ravenclaw 9
3. Sorting
Sorting data with pandas is straightforward and can be done using the sort_values() method. For example, you can sort a list of students by their OWL scores in ascending order:
# Sort by 'OWL Scores' in ascending order (default)
sorted_by_owl = df.sort_values(by='OWL Scores')
print(sorted_by_owl)
output:
Name House Patronus Favorite Subject Quidditch Position OWL Scores
0 Harry Potter Gryffindor Stag Defense Against the Dark Arts Seeker 7
2 Ron Weasley Gryffindor Jack Russell Terrier Divination Keeper 7
3 Draco Malfoy Slytherin None Potions None 8
4 Luna Lovegood Ravenclaw Hare Charms None 9
1 Hermione Granger Gryffindor Otter Arithmancy None 11
To sort in descending order, set the ascending parameter to False:
# Sort by 'OWL Scores' in descending order
sorted_by_owl_desc = df.sort_values(by='OWL Scores', ascending=False)
print(sorted_by_owl_desc)
output:
Name House Patronus Favorite Subject Quidditch Position OWL Scores
1 Hermione Granger Gryffindor Otter Arithmancy None 11
4 Luna Lovegood Ravenclaw Hare Charms None 9
3 Draco Malfoy Slytherin None Potions None 8
0 Harry Potter Gryffindor Stag Defense Against the Dark Arts Seeker 7
2 Ron Weasley Gryffindor Jack Russell Terrier Divination Keeper 7
One of the powerful features of sort_values() is that it allows you to sort by multiple columns. In the example below, students are sorted first by their OWL scores and then by their house:
# Sort by 'OWL Scores' first in descending order, then by 'House' in ascending order
sorted_by_owl_first = df.sort_values(by=['OWL Scores', 'House'], ascending=[False, True])
print(sorted_by_owl_first)
output:
Name House Patronus Favorite Subject Quidditch Position OWL Scores
1 Hermione Granger Gryffindor Otter Arithmancy None 11
4 Luna Lovegood Ravenclaw Hare Charms None 9
3 Draco Malfoy Slytherin None Potions None 8
0 Harry Potter Gryffindor Stag Defense Against the Dark Arts Seeker 7
2 Ron Weasley Gryffindor Jack Russell Terrier Divination Keeper 7
In this case, the OWL score is the primary criterion for sorting, meaning pandas will prioritize it. If two students have the same OWL score, the house value is used as the secondary criterion for sorting
Exploring, filtering, and sorting data is an essential first step before jumping into tasks like data cleaning or wrangling in the data analysis process. Pandas offers a range of built-in methods that help organize and accelerate these operations. Additionally, Pandas integrates seamlessly with other libraries, such as NumPy or SciPy for numerical computations, Matplotlib for data visualization, and analytical tools like Statsmodels and Scikit-learn. By learning Pandas, you can significantly boost your efficiency in handling and analyzing data, making it a valuable skill for any data professional. Happy coding!
Top comments (0)