Hi everyone! Today I will show you how to build a Machine Learning model for heart failure prediction from scratch. For this tutorial, we will use a dataset from kaggle.com called Heart Failure Prediction Dataset.
You can learn more about this dataset from the following link:
https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction
You can download the dataset to follow along with this tutorial.
Alright, open your Jupyter Notebook and let's get started!
Step 1: Data Loading
First of all, let's load the data using pandas
from the file heart.csv
, and check the data using df.head()
function
import pandas as pd
df = pd.read_csv('heart.csv')
df.head()
If you successfully load the data, the first five rows of data will be shown in your notebook.
Step 2: Data Inspection
Next, let's dig deeper into the data information using the function df.info()
.
df.info()
Run this code, and you will get more important insight about the dataset. The following is the result of this code
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 918 non-null int64
1 Sex 918 non-null object
2 ChestPainType 918 non-null object
3 RestingBP 918 non-null int64
4 Cholesterol 918 non-null int64
5 FastingBS 918 non-null int64
6 RestingECG 918 non-null object
7 MaxHR 918 non-null int64
8 ExerciseAngina 918 non-null object
9 Oldpeak 918 non-null float64
10 ST_Slope 918 non-null object
11 HeartDisease 918 non-null int64
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB
From this, the information we can see that the columns such as Sex
, ChestPainType
, RestingECG
, ExerciseAngina
, and ST_Slope
has the datatype of object
and we have to handle that such as all columns will have a numerical datatype.
Step 3: Data Cleaning
At this step, we will handle the column with the object
datatype. Keep in mind that the term object
here is usually a Python string. To validate that, let's take the ChestPainType
column and see all its unique value
df['ChestPainType'].value_counts()
And you will get this result
count
ChestPainType
ASY 496
NAP 203
ATA 173
TA 46
dtype: int64
Based on that result, yes indeed all the values are in the form of string.
"Binary" column processor
Let's start processing from the column that has the binary amount of unique value, in this case, the column Sex
and ExerciseAngina
.
To process this, we can use pandas
map function
# Binary: Sex, ExerciseAngina
df['Sex'] = df['Sex'].map({'F': 0, 'M': 1})
df['ExerciseAngina'] = df['ExerciseAngina'].map({'N': 0, 'Y': 1})
One-hot encoding
For the column ChestPainType
, RestingECG
, and ST_Slope
, a technique will be required called One-hot encoding. This technique is used to transform categorical variables into a binary format to enhance the performance of machine learning model training.
To process this, we can use pandas
function called get_dummies()
to generate new additional columns, join the columns into the existing dataframe and drop the original column
# One-hot encoding: ChestPainType, RestingECG, ST_Slope
df = df.join(pd.get_dummies(df['ChestPainType'], prefix='ChestPainType', dtype=int)).drop(['ChestPainType'], axis=1)
df = df.join(pd.get_dummies(df['RestingECG'], prefix='RestingECG', dtype=int)).drop(['RestingECG'], axis=1)
df = df.join(pd.get_dummies(df['ST_Slope'], prefix='ST_Slope', dtype=int)).drop(['ST_Slope'], axis=1)
After that, let's see our "cleaned" dataset
df.info()
And see that all columns will have numerical datatype, and you also see that some new additional columns are added.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 918 non-null int64
1 Sex 918 non-null int64
2 RestingBP 918 non-null int64
3 Cholesterol 918 non-null int64
4 FastingBS 918 non-null int64
5 MaxHR 918 non-null int64
6 ExerciseAngina 918 non-null int64
7 Oldpeak 918 non-null float64
8 HeartDisease 918 non-null int64
9 ChestPainType_ASY 918 non-null int64
10 ChestPainType_ATA 918 non-null int64
11 ChestPainType_NAP 918 non-null int64
12 ChestPainType_TA 918 non-null int64
13 RestingECG_LVH 918 non-null int64
14 RestingECG_Normal 918 non-null int64
15 RestingECG_ST 918 non-null int64
16 ST_Slope_Down 918 non-null int64
17 ST_Slope_Flat 918 non-null int64
18 ST_Slope_Up 918 non-null int64
dtypes: float64(1), int64(18)
memory usage: 136.4 KB
By the way, you can inspect the dataset visually using the following code
df.hist(figsize=(20, 15))
Step 4: Train Machine Learning Model
Finally, let's create the machine learning model!
Define Feature & Target Data
First of all, we have to separate the "feature data" and the "target data"
X, y = df.drop(['HeartDisease'], axis=1), df['HeartDisease']
Train Test Split
Then, each feature & target data needs to be split into a "train" dataset and a "test" dataset. The training dataset will be used to train the model, and the test dataset will be used to evaluate the performance of the trained model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Train Model
Next step, we will train the machine learning model using the train data that we have. Since the objective of this model is to classify whether the patient has heart failure or not, this can be called a classification problem. For a classification problem, there are some machine learning model algorithms and two of them are "Logistic Linear" and "Random Forrest Classifier". We will implement those two model algorithms and see the performance of each algorithm!
Logistic Regression
Let's start from logistic regression. To train this model you can use LogisticRegression
from the sklearn.linear_model
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
Random Forrest Classifier
Next, let's see how random forest classifier implementation. We will use RandomForestClassifier
from sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier
rfc_model = RandomForestClassifier()
rfc_model.fit(X_train, y_train)
Models Comparison
Finally, let's see how those models perform
print(f"Logistic Regression Score: {log_model.score(X_test, y_test)}")
print(f"Random Forest Classifier Score: {rfc_model.score(X_test, y_test)}")
After I run the code above, I get the following output:
Logistic Regression Score: 0.8369565217391305
Random Forest Classifier Score: 0.8532608695652174
From the result, it can be seen that the Random Forrest Classifier scored better, around 85%.
There you go, that is how you can make a machine-learning model to predict heart failure. You can tweak the code around, and let me know if you found a better solution to make a model with a better score!
Thanks for reading this article, and have a nice day!
Top comments (0)