Introduction
Data Cleaning is a very important part of data analysis. Clean Data ensures that your analysis is correct and your insights are reliable when used in decision making. It also improves quality of your data
Objectives:
- Ensure that all columns are clearly named
- Ensure that columns have the correct datatypes
- Remove all unnecessary information from the dataset Lets Get Started:
Import necessary python libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import the csv data set from the folder where its saved
df = pd.read_csv(r'...\fifa21 raw data v2.csv')
set your data frame such that it displays all columns since the dataset has many columns.
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)
The benefit is that not only all columns are well displayed, but also the printed rows can be larger than
the usual ~100 characters limit.
Display the first five rows
df.head()
Check how many rows and columns make up your dafaframe
df.shape
print('The number of rows are:', df.shape[0])
print('The number of columns are:', df.shape[1])
Display a list of all column names
for x in df.columns.tolist():
print(x)
Make a copy of your data set so as to retain an original copy
df1 = df.copy()
Remove unnecessary columns
df1 = df1.drop(['Name', 'photoUrl', 'playerUrl'], axis = 1)
Rename columns
df1 = df1.rename(columns = {
"LongName": "Name",
"↓OVA":"Overall Rating(%)",
"POT": "Potential(%)",
"BOV": "Best Overall(%)",
"BP": "Best Position",
"W/F": "Weak Foot",
"SM": "Skill Moves",
"A/W": "Attacking Work Rate",
"D/W": "Defensive Work Rate",
"IR": "International Reputation",
"PAC": "Pace",
"SHO": "Shooting",
"PAS": "Passing",
"DRI": "Dribbling",
"DEF": "Defense",
"PHY": "Physicality"
})
Remove whitespaces in the club column
#Remove whitespaces from the club column
df1['Club'] = df1['Club'].str.lstrip()
Clean contract column using a function
#define a function to change contract column values
def contract_status(value):
if 'On Loan' in value:
value = 'On Loan'
return value
elif '~' in value:
value = 'Active'
return value
else:
value = 'Free'
return value
#apply the function on contract column
df1['Contract'] = df1['Contract'].apply(contract_status).astype('category')
Rename Contract column
#Rename the contract column
df1 = df1.rename(columns = {'Contract': 'Contract Status'})
Process of cleaning contract column
- A function was defined to change the row values from '2004 ~ 2021', 'On Loan' and 'Free' to 'Active', 'On Loan' and 'Free' 'Contract' column was renamed to 'Contract Status
- Also, the data type was changed to category.
Check Height and Weight Columns
#Check for unique values
for column in df[['Height', 'Weight']]:
value = df1[column].unique()
print(f'{column}\n{value}.\n')
Create Function to convert height to cm
#Function to convert height to cm
def convert_height(value):
if 'cm' in value:
value = int(value[:-2])
return value
else:
feet, inches = value.split("'")
total_inches = int(feet) * 12 + int(inches[:-1])
height_cm = total_inches * 2.54
return round(height_cm, 2)
df1['Height'] = df1['Height'].apply(convert_height).astype('int64')
df1['Height'].unique()
Create function to convert weight to kg
def convert_weight(value):
if 'kg' in value:
value = value.strip('kg')
return value
else:
value = value.strip('lbs')
Weight = round((float(value) * 0.45359237), 2)
return Weight
df1['Weight'] = df1['Weight'].apply(convert_weight).astype('int64')
df1['Weight'].unique()
Process of Cleaning 'Height' and 'Weight' columns
- Create a function to convert Height
- values with cm remain the same except that we extract only value without 'CM'
- value in feet and inches we convert them to CM
- Lastly, convert Height column to int
Process of cleaning Weight column
- create a function to convert LBS to kgs
- If value is in "KGS" it remains the same
- If value is in "LBS" convert them to KGS by multiplying by 0.4535...
Conclusion
In this project, we embarked on a comprehensive data cleaning and transformation journey with the FIFA dataset. Our primary objective was to prepare the data for analysis by addressing inconsistencies, refining data types, and enhancing its usability.
In conclusion, data cleaning and transformation are foundational steps in any data analysis project.By addressing inconsistencies, refining data types, and enhancing data quality, we have set the stage
for more meaningful and insightful analyses. The clean and structured dataset is now well-equipped for advanced analytics, visualizations, and modeling.
Follow me on Github and Linked In for a more details on the project
Top comments (0)