Kristijan Pajtasev

Posted on Jan 26, 2020 • Edited on Jan 9, 2021

Machine Learning - Data preprocessing

#machinelearning #datascience #data #preprocessing

As part of my trying to complete Machine Learning A-Z Udemy course, this series of posts, starting with this one, will contain note I gather from it.

Dependent vs independent variables

Dependent – variable being tested and measured – predicted result
Independent – variable being changed or controlled – features

Used libraries(python):

numpy, a library containing mathematical tools
matplotlib.pyplot, plotting library
pandas – importing datasets
sklearn.preprocessing – library for processing data

Importing dataset with pandas:

import pandas as pd
pd.read_csv(FILE_NAME)

Missing data

Option 1:

remove rows with missing data
dangerous because we might be losing valuable information Option 2:
set missing values to mean of that feature

Library used:

sklearn.preprocessing.Imputer

Categorical data

Labels need to be converted into numbers - Euclidean distance can’t be calculated on labels
Library:

sklearn.preprocessing.LabelEncoder

Problem with LabelEncoder: converting labels into numbers can lead to problems as numbers can be ordered. Labels not necessary
Solution: Creating feature per label
Library: sklearn.preprocessing.OneHotEncoder

Splitting data

For creating a model, data needs to be split into two sets, train and test. The train set is the one we use for creating a model, and the test is one we use to evaluate that mode's correctness.
Library: sklearn.model_selection.train_test_split
Usual ration: 70-80% for train data

Feature scaling

One feature, because of large values, can dominate the smaller number value feature. This is why all features should be scaled to the same scale.
Option 1, standardization:
Each value is reduced by the mean and divided by the standard deviation.

Option 2, normalization:
Reduce each x by minimal x value. Ather that, divide by the difference of the maximum and minimum value of x.

Library:

sklearn.preprocessing.StandardScaler

For more, you can follow me on Twitter, LinkedIn, GitHub, or Instagram.

DEV Community

Machine Learning - Data preprocessing

Dependent vs independent variables

Missing data

Categorical data

Splitting data

Feature scaling

Top comments (0)

Read next

LLMs' Overparameterization: Performance-Efficiency Trade-Off Uncovered

Microsoft Autogen Has Split in 2... Wait 3... No, 4 Parts

JSON vs. XML: The Advantages and Efficiency in Data Handling

A beginner's guide to the Flux-1.1-Pro model by Black-Forest-Labs on Replicate