DEV Community

Cover image for From Messy to Masterpiece: Cleaning FIFA Data with Python ⚽🚀
allan-pg
allan-pg

Posted on

From Messy to Masterpiece: Cleaning FIFA Data with Python ⚽🚀

Introduction

Data Cleaning is a very important part of data analysis. Clean Data ensures that your analysis is correct and your insights are reliable when used in decision making. It also improves quality of your data

Objectives:

  • Ensure that all columns are clearly named
  • Ensure that columns have the correct datatypes
  • Remove all unnecessary information from the dataset Lets Get Started:

Import necessary python libraries

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
Enter fullscreen mode Exit fullscreen mode

import the csv data set from the folder where its saved

df = pd.read_csv(r'...\fifa21 raw data v2.csv')
Enter fullscreen mode Exit fullscreen mode

set your data frame such that it displays all columns since the dataset has many columns.

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)
Enter fullscreen mode Exit fullscreen mode

The benefit is that not only all columns are well displayed, but also the printed rows can be larger than
the usual ~100 characters limit.

Display the first five rows

df.head()
Enter fullscreen mode Exit fullscreen mode

Check how many rows and columns make up your dafaframe

df.shape
print('The number of rows are:', df.shape[0])
print('The number of columns are:', df.shape[1])
Enter fullscreen mode Exit fullscreen mode

Display a list of all column names

for x in df.columns.tolist():
   print(x)
Enter fullscreen mode Exit fullscreen mode

Make a copy of your data set so as to retain an original copy

df1 = df.copy()
Enter fullscreen mode Exit fullscreen mode

Remove unnecessary columns

df1 = df1.drop(['Name', 'photoUrl', 'playerUrl'], axis = 1)
Enter fullscreen mode Exit fullscreen mode

Rename columns

df1 = df1.rename(columns = {
"LongName": "Name",
"↓OVA":"Overall Rating(%)",
"POT": "Potential(%)",
"BOV": "Best Overall(%)",
"BP": "Best Position",
"W/F": "Weak Foot",
"SM": "Skill Moves",
"A/W": "Attacking Work Rate",
"D/W": "Defensive Work Rate",
"IR": "International Reputation",
"PAC": "Pace",
"SHO": "Shooting",
"PAS": "Passing",
"DRI": "Dribbling",
"DEF": "Defense",
"PHY": "Physicality"
})
Enter fullscreen mode Exit fullscreen mode

Remove whitespaces in the club column

#Remove whitespaces from the club column
df1['Club'] = df1['Club'].str.lstrip()
Enter fullscreen mode Exit fullscreen mode

Clean contract column using a function

#define a function to change contract column values
def contract_status(value):
     if 'On Loan' in value:
       value = 'On Loan'
       return value
     elif '~' in value:
       value = 'Active'
       return value
     else:
       value = 'Free'
       return value

#apply the function on contract column
df1['Contract'] = df1['Contract'].apply(contract_status).astype('category')

Enter fullscreen mode Exit fullscreen mode

Rename Contract column

#Rename the contract column
df1 = df1.rename(columns = {'Contract': 'Contract Status'})
Enter fullscreen mode Exit fullscreen mode

Process of cleaning contract column

  • A function was defined to change the row values from '2004 ~ 2021', 'On Loan' and 'Free' to 'Active', 'On Loan' and 'Free' 'Contract' column was renamed to 'Contract Status
  • Also, the data type was changed to category.

Check Height and Weight Columns

#Check for unique values
for column in df[['Height', 'Weight']]:
     value = df1[column].unique()
     print(f'{column}\n{value}.\n')
Enter fullscreen mode Exit fullscreen mode

Create Function to convert height to cm

#Function to convert height to cm
def convert_height(value):
   if 'cm' in value:
       value = int(value[:-2])
       return value
   else:
       feet, inches = value.split("'")
       total_inches = int(feet) * 12 + int(inches[:-1])
       height_cm = total_inches * 2.54
       return round(height_cm, 2)

df1['Height'] = df1['Height'].apply(convert_height).astype('int64')
df1['Height'].unique()
Enter fullscreen mode Exit fullscreen mode

Create function to convert weight to kg

def convert_weight(value):
    if 'kg' in value:
       value = value.strip('kg')
       return value
    else:
       value = value.strip('lbs')
       Weight = round((float(value) * 0.45359237), 2)
       return Weight

df1['Weight'] = df1['Weight'].apply(convert_weight).astype('int64')
df1['Weight'].unique()
Enter fullscreen mode Exit fullscreen mode

Process of Cleaning 'Height' and 'Weight' columns

  1. Create a function to convert Height
  2. values with cm remain the same except that we extract only value without 'CM'
  3. value in feet and inches we convert them to CM
  4. Lastly, convert Height column to int

Process of cleaning Weight column

  1. create a function to convert LBS to kgs
  2. If value is in "KGS" it remains the same
  3. If value is in "LBS" convert them to KGS by multiplying by 0.4535...

Conclusion

In this project, we embarked on a comprehensive data cleaning and transformation journey with the FIFA dataset. Our primary objective was to prepare the data for analysis by addressing inconsistencies, refining data types, and enhancing its usability.

In conclusion, data cleaning and transformation are foundational steps in any data analysis project.By addressing inconsistencies, refining data types, and enhancing data quality, we have set the stage
for more meaningful and insightful analyses. The clean and structured dataset is now well-equipped for advanced analytics, visualizations, and modeling.

Follow me on Github and Linked In for a more details on the project

Top comments (0)