allan-pg

Posted on Feb 19

From Messy to Masterpiece: Cleaning FIFA Data with Python ⚽🚀

#datascience #data #dataengineering #python

Introduction

Data Cleaning is a very important part of data analysis. Clean Data ensures that your analysis is correct and your insights are reliable when used in decision making. It also improves quality of your data

Objectives:

Ensure that all columns are clearly named
Ensure that columns have the correct datatypes
Remove all unnecessary information from the dataset Lets Get Started:

Import necessary python libraries

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

import the csv data set from the folder where its saved

df = pd.read_csv(r'...\fifa21 raw data v2.csv')

set your data frame such that it displays all columns since the dataset has many columns.

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)

The benefit is that not only all columns are well displayed, but also the printed rows can be larger than
the usual ~100 characters limit.

Display the first five rows

df.head()

Check how many rows and columns make up your dafaframe

df.shape
print('The number of rows are:', df.shape[0])
print('The number of columns are:', df.shape[1])

Display a list of all column names

for x in df.columns.tolist():
   print(x)

Make a copy of your data set so as to retain an original copy

df1 = df.copy()

Remove unnecessary columns

df1 = df1.drop(['Name', 'photoUrl', 'playerUrl'], axis = 1)

Rename columns

df1 = df1.rename(columns = {
"LongName": "Name",
"↓OVA":"Overall Rating(%)",
"POT": "Potential(%)",
"BOV": "Best Overall(%)",
"BP": "Best Position",
"W/F": "Weak Foot",
"SM": "Skill Moves",
"A/W": "Attacking Work Rate",
"D/W": "Defensive Work Rate",
"IR": "International Reputation",
"PAC": "Pace",
"SHO": "Shooting",
"PAS": "Passing",
"DRI": "Dribbling",
"DEF": "Defense",
"PHY": "Physicality"
})

Remove whitespaces in the club column

#Remove whitespaces from the club column
df1['Club'] = df1['Club'].str.lstrip()

Clean contract column using a function

#define a function to change contract column values
def contract_status(value):
     if 'On Loan' in value:
       value = 'On Loan'
       return value
     elif '~' in value:
       value = 'Active'
       return value
     else:
       value = 'Free'
       return value

#apply the function on contract column
df1['Contract'] = df1['Contract'].apply(contract_status).astype('category')

Rename Contract column

#Rename the contract column
df1 = df1.rename(columns = {'Contract': 'Contract Status'})

Process of cleaning contract column

A function was defined to change the row values from '2004 ~ 2021', 'On Loan' and 'Free' to 'Active', 'On Loan' and 'Free' 'Contract' column was renamed to 'Contract Status
Also, the data type was changed to category.

Check Height and Weight Columns

#Check for unique values
for column in df[['Height', 'Weight']]:
     value = df1[column].unique()
     print(f'{column}\n{value}.\n')

Create Function to convert height to cm

#Function to convert height to cm
def convert_height(value):
   if 'cm' in value:
       value = int(value[:-2])
       return value
   else:
       feet, inches = value.split("'")
       total_inches = int(feet) * 12 + int(inches[:-1])
       height_cm = total_inches * 2.54
       return round(height_cm, 2)

df1['Height'] = df1['Height'].apply(convert_height).astype('int64')
df1['Height'].unique()

Create function to convert weight to kg

def convert_weight(value):
    if 'kg' in value:
       value = value.strip('kg')
       return value
    else:
       value = value.strip('lbs')
       Weight = round((float(value) * 0.45359237), 2)
       return Weight

df1['Weight'] = df1['Weight'].apply(convert_weight).astype('int64')
df1['Weight'].unique()

Process of Cleaning 'Height' and 'Weight' columns

Create a function to convert Height
values with cm remain the same except that we extract only value without 'CM'
value in feet and inches we convert them to CM
Lastly, convert Height column to int

Process of cleaning Weight column

create a function to convert LBS to kgs
If value is in "KGS" it remains the same
If value is in "LBS" convert them to KGS by multiplying by 0.4535...

Conclusion

In this project, we embarked on a comprehensive data cleaning and transformation journey with the FIFA dataset. Our primary objective was to prepare the data for analysis by addressing inconsistencies, refining data types, and enhancing its usability.

In conclusion, data cleaning and transformation are foundational steps in any data analysis project.By addressing inconsistencies, refining data types, and enhancing data quality, we have set the stage
for more meaningful and insightful analyses. The clean and structured dataset is now well-equipped for advanced analytics, visualizations, and modeling.

Follow me on Github and Linked In for a more details on the project

DEV Community