Forem

Cover image for How To Build a Machine Learning Model For Heart Failure Prediction From Scratch
luthfisauqi17
luthfisauqi17

Posted on

How To Build a Machine Learning Model For Heart Failure Prediction From Scratch

Hi everyone! Today I will show you how to build a Machine Learning model for heart failure prediction from scratch. For this tutorial, we will use a dataset from kaggle.com called Heart Failure Prediction Dataset.

You can learn more about this dataset from the following link:
https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

You can download the dataset to follow along with this tutorial.

Alright, open your Jupyter Notebook and let's get started!


Step 1: Data Loading

First of all, let's load the data using pandas from the file heart.csv, and check the data using df.head() function

import pandas as pd

df = pd.read_csv('heart.csv')
df.head()
Enter fullscreen mode Exit fullscreen mode

If you successfully load the data, the first five rows of data will be shown in your notebook.

Step 2: Data Inspection

Next, let's dig deeper into the data information using the function df.info().

df.info()
Enter fullscreen mode Exit fullscreen mode

Run this code, and you will get more important insight about the dataset. The following is the result of this code

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB
Enter fullscreen mode Exit fullscreen mode

From this, the information we can see that the columns such as Sex, ChestPainType, RestingECG, ExerciseAngina, and ST_Slope has the datatype of object and we have to handle that such as all columns will have a numerical datatype.

Step 3: Data Cleaning

At this step, we will handle the column with the object datatype. Keep in mind that the term object here is usually a Python string. To validate that, let's take the ChestPainType column and see all its unique value

df['ChestPainType'].value_counts()
Enter fullscreen mode Exit fullscreen mode

And you will get this result

count
ChestPainType   
ASY 496
NAP 203
ATA 173
TA  46

dtype: int64
Enter fullscreen mode Exit fullscreen mode

Based on that result, yes indeed all the values are in the form of string.

"Binary" column processor

Let's start processing from the column that has the binary amount of unique value, in this case, the column Sex and ExerciseAngina.

To process this, we can use pandas map function

# Binary: Sex, ExerciseAngina
df['Sex'] = df['Sex'].map({'F': 0, 'M': 1})
df['ExerciseAngina'] = df['ExerciseAngina'].map({'N': 0, 'Y': 1})
Enter fullscreen mode Exit fullscreen mode

One-hot encoding

For the column ChestPainType, RestingECG, and ST_Slope, a technique will be required called One-hot encoding. This technique is used to transform categorical variables into a binary format to enhance the performance of machine learning model training.

To process this, we can use pandas function called get_dummies() to generate new additional columns, join the columns into the existing dataframe and drop the original column

# One-hot encoding: ChestPainType, RestingECG, ST_Slope
df = df.join(pd.get_dummies(df['ChestPainType'], prefix='ChestPainType', dtype=int)).drop(['ChestPainType'], axis=1)
df = df.join(pd.get_dummies(df['RestingECG'], prefix='RestingECG', dtype=int)).drop(['RestingECG'], axis=1)
df = df.join(pd.get_dummies(df['ST_Slope'], prefix='ST_Slope', dtype=int)).drop(['ST_Slope'], axis=1)
Enter fullscreen mode Exit fullscreen mode

After that, let's see our "cleaned" dataset

df.info()
Enter fullscreen mode Exit fullscreen mode

And see that all columns will have numerical datatype, and you also see that some new additional columns are added.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Age                918 non-null    int64  
 1   Sex                918 non-null    int64  
 2   RestingBP          918 non-null    int64  
 3   Cholesterol        918 non-null    int64  
 4   FastingBS          918 non-null    int64  
 5   MaxHR              918 non-null    int64  
 6   ExerciseAngina     918 non-null    int64  
 7   Oldpeak            918 non-null    float64
 8   HeartDisease       918 non-null    int64  
 9   ChestPainType_ASY  918 non-null    int64  
 10  ChestPainType_ATA  918 non-null    int64  
 11  ChestPainType_NAP  918 non-null    int64  
 12  ChestPainType_TA   918 non-null    int64  
 13  RestingECG_LVH     918 non-null    int64  
 14  RestingECG_Normal  918 non-null    int64  
 15  RestingECG_ST      918 non-null    int64  
 16  ST_Slope_Down      918 non-null    int64  
 17  ST_Slope_Flat      918 non-null    int64  
 18  ST_Slope_Up        918 non-null    int64  
dtypes: float64(1), int64(18)
memory usage: 136.4 KB
Enter fullscreen mode Exit fullscreen mode

By the way, you can inspect the dataset visually using the following code

df.hist(figsize=(20, 15))
Enter fullscreen mode Exit fullscreen mode

Image description

Step 4: Train Machine Learning Model

Finally, let's create the machine learning model!

Define Feature & Target Data

First of all, we have to separate the "feature data" and the "target data"

X, y = df.drop(['HeartDisease'], axis=1), df['HeartDisease']
Enter fullscreen mode Exit fullscreen mode

Train Test Split

Then, each feature & target data needs to be split into a "train" dataset and a "test" dataset. The training dataset will be used to train the model, and the test dataset will be used to evaluate the performance of the trained model.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Enter fullscreen mode Exit fullscreen mode

Train Model

Next step, we will train the machine learning model using the train data that we have. Since the objective of this model is to classify whether the patient has heart failure or not, this can be called a classification problem. For a classification problem, there are some machine learning model algorithms and two of them are "Logistic Linear" and "Random Forrest Classifier". We will implement those two model algorithms and see the performance of each algorithm!

Logistic Regression

Let's start from logistic regression. To train this model you can use LogisticRegression from the sklearn.linear_model

from sklearn.linear_model import LogisticRegression

log_model = LogisticRegression()
log_model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Random Forrest Classifier

Next, let's see how random forest classifier implementation. We will use RandomForestClassifier from sklearn.ensemble

from sklearn.ensemble import RandomForestClassifier

rfc_model = RandomForestClassifier()
rfc_model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Models Comparison

Finally, let's see how those models perform

print(f"Logistic Regression Score: {log_model.score(X_test, y_test)}")
print(f"Random Forest Classifier Score: {rfc_model.score(X_test, y_test)}")
Enter fullscreen mode Exit fullscreen mode

After I run the code above, I get the following output:

Logistic Regression Score: 0.8369565217391305
Random Forest Classifier Score: 0.8532608695652174
Enter fullscreen mode Exit fullscreen mode

From the result, it can be seen that the Random Forrest Classifier scored better, around 85%.


Full code: https://github.com/luthfisauqi17/machine-learning-predictions/blob/main/heart_failure_prediction.ipynb


There you go, that is how you can make a machine-learning model to predict heart failure. You can tweak the code around, and let me know if you found a better solution to make a model with a better score!

Thanks for reading this article, and have a nice day!

Top comments (0)