DEV Community

Cover image for ꜱᴄᴀʟɪɴɢ ᴀɴᴅ ɴᴏʀᴍᴀʟɪᴢɪɴɢ ᴅᴀᴛᴀ ꜰᴏʀ ᴍᴀᴄʜɪɴᴇ ʟᴇᴀʀɴɪɴɢ ᴍᴏᴅᴇʟꜱ
Anand
Anand

Posted on

ꜱᴄᴀʟɪɴɢ ᴀɴᴅ ɴᴏʀᴍᴀʟɪᴢɪɴɢ ᴅᴀᴛᴀ ꜰᴏʀ ᴍᴀᴄʜɪɴᴇ ʟᴇᴀʀɴɪɴɢ ᴍᴏᴅᴇʟꜱ

Scaling and Normalizing Data for Machine Learning Models 🐍🤖

In the world of machine learning, scaling and normalizing your data are crucial preprocessing steps before feeding it into models. Proper scaling ensures that each feature contributes equally to the result, while normalization often improves the performance of the algorithm. In this post, we'll explore these concepts in detail, focusing on methods provided by the scikit-learn module. We'll also provide code snippets and formulas for clarity.

Image

Why Scale and Normalize ❓

  1. Improves Model Performance: Many machine learning algorithms perform better when features are on a similar scale. For instance, algorithms like SVM and KNN are sensitive to the scales of the features.
  2. Faster Convergence: Gradient descent converges faster with scaled features.
  3. Reduces Bias: Unscaled features can cause bias in the model towards features with a larger range.

Scaling Techniques

Standardization (Z-score Normalization)

Standardization scales the data to have a mean of zero and a standard deviation of one.

The formula is: z = (x - μ) / σ

Where:

  • x is the original value
  • μ is the mean of the feature
  • σ is the standard deviation of the feature

Code Example

from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Standardizing the data
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)

print("Standardized Data:\n", standardized_data)
Enter fullscreen mode Exit fullscreen mode

output

Standardized Data:
 [[-1.34164079 -1.34164079]
 [-0.4472136  -0.4472136 ]
 [ 0.4472136   0.4472136 ]
 [ 1.34164079  1.34164079]]
Enter fullscreen mode Exit fullscreen mode

Min-Max Scaling (Normalization)

Min-Max scaling scales the data to a fixed range, usually [0, 1].

The formula is: x' = (x - x_min) / (x_max - x_min)

Where:

  • x is the original value
  • x_min is the minimum value of the feature
  • x_max is the maximum value of the feature

Code Example


from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Normalizing the data
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)

print("Normalized Data:\n", normalized_data)
Enter fullscreen mode Exit fullscreen mode

output

Normalized Data:
 [[0.         0.        ]
 [0.33333333 0.33333333]
 [0.66666667 0.66666667]
 [1.         1.        ]]
Enter fullscreen mode Exit fullscreen mode

Normalization Techniques

L2 Normalization

L2 normalization scales each data point such that the Euclidean norm (L2 norm) of the feature vector is 1.

The formula is: x' = x / ||x||_2
Where ||x||_2 is the L2 norm of the feature vector.

Code Example


from sklearn.preprocessing import Normalizer

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Normalizing the data using L2 norm
normalizer = Normalizer(norm='l2')
l2_normalized_data = normalizer.fit_transform(data)

print("L2 Normalized Data:\n", l2_normalized_data)
Enter fullscreen mode Exit fullscreen mode

output

L2 Normalized Data:
 [[0.4472136  0.89442719]
 [0.6        0.8       ]
 [0.6401844  0.76822128]
 [0.65850461 0.75257669]]
Enter fullscreen mode Exit fullscreen mode

L1 Normalization

L1 normalization scales each data point such that the Manhattan norm (L1 norm) of the feature vector is 1. The formula is:

formula : x' = x / ||x||_1

Where ||x||_1 is the L1 norm of the feature vector.

Code Example


from sklearn.preprocessing import Normalizer

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Normalizing the data using L1 norm
normalizer = Normalizer(norm='l1')
l1_normalized_data = normalizer.fit_transform(data)

print("L1 Normalized Data:\n", l1_normalized_data)
Enter fullscreen mode Exit fullscreen mode

output

L1 Normalized Data:
 [[0.33333333 0.66666667]
 [0.42857143 0.57142857]
 [0.45454545 0.54545455]
 [0.46666667 0.53333333]]
Enter fullscreen mode Exit fullscreen mode

example code :

from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load the iris dataset
data = load_iris()
X, y = data.data, data.target

# Normalize the features
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.3, random_state=42)

# Fit a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate the model
score = model.score(X_test, y_test)
print(f"Model Accuracy: {score:.2f}")
Enter fullscreen mode Exit fullscreen mode

output : Model Accuracy : 0.91

Conclusion
Scaling and normalizing your data are fundamental steps in preparing it for machine learning models. scikit-learn provides convenient and efficient tools for both scaling and normalization. Here’s a quick summary of the methods discussed:

  • Standardization: Adjusts the data to have a mean of 0 and a standard deviation of 1.
  • Min-Max Scaling: Scales the data to a fixed range, usually [0, 1].
  • L2 Normalization: Scales the data so that the L2 norm of each row is 1.
  • L1 Normalization: Scales the data so that the L1 norm of each row is 1.

→ By correctly applying these techniques, you can improve the performance and convergence of your machine learning models.


About Me:
🖇️LinkedIn
🧑‍💻GitHub

Top comments (0)