DEV Community

Cover image for Linear Regression, Regression: Supervised Machine Learning
Harsh Mishra
Harsh Mishra

Posted on • Updated on

Linear Regression, Regression: Supervised Machine Learning

What is Regression?

Definition and Purpose

Regression is a statistical method used in machine learning and data science to understand relationships between variables. It involves modeling the relationship between a dependent variable (target) and one or more independent variables (predictors). The main purpose of regression is to predict or estimate the value of the dependent variable based on the values of the independent variables.

Key Objectives:

  • Prediction: Forecasting future values based on historical data.
  • Estimation: Determining the strength and form of the relationship between variables.
  • Understanding Relationships: Identifying which independent variables are significant predictors of the dependent variable.

Types of Regression

1. Linear Regression

  • Simple Linear Regression: Models the relationship between two variables by fitting a linear equation to observed data.

    • Equation: y = mx + b
    • Purpose: Predicts the dependent variable y based on the independent variable x.
  • Multiple Linear Regression: Extends simple linear regression to include multiple independent variables.

    • Equation: y = b0 + b1x1 + b2x2 + ... + bnxn
    • Purpose: Predicts the dependent variable y based on several independent variables x1, x2, ..., xn.

2. Polynomial Regression

  • Description: Models the relationship between the dependent and independent variables as an nth degree polynomial.
    • Equation: y = b0 + b1x + b2x^2 + ... + bnx^n
    • Purpose: Captures the non-linear relationship between variables.

Ordinary Least Squares (OLS) Method

OLS is a method for estimating the unknown parameters in a linear regression model. It minimizes the sum of the squared differences between observed and predicted values.

Equation: The linear model for OLS can be represented as:
y = w0 + w1x1 + w2x2 + ... + wnxn

where:

  • y is the dependent variable
  • x1, x2, ..., xn are the independent variables
  • w0, w1, w2, ..., wn are the coefficients (parameters) to be estimated

Objective: Minimize the cost function:
Cost(OLS) = Σ(yi - ŷi)^2

where:

  • yi is the actual value
  • ŷi is the predicted value

Cost Function and Loss Minimization in Linear Regression

Cost Function

The cost function in linear regression quantifies the error between the predicted values and the actual values of the dependent variable. It measures how well the model's predictions align with the actual data. The most commonly used cost function in linear regression is the Mean Squared Error (MSE), but there are other cost functions that can also be applied.

1. Mean Squared Error (MSE):
The MSE is the average of the squared differences between the actual and predicted values. It is defined as:

MSE = (1/n) * Σ (yi - ŷi)^2

where:

  • n is the number of data points,
  • yi is the actual value,
  • ŷi is the predicted value.

The MSE penalizes larger errors more significantly due to the squaring of the differences, making it sensitive to outliers. The goal of linear regression is to find the model parameters (coefficients) that minimize this cost function.

2. Root Mean Squared Error (RMSE):
The RMSE is the square root of the MSE, providing an error metric in the same units as the dependent variable. It is defined as:

RMSE = √(MSE)

This metric is also sensitive to outliers and is commonly used for model evaluation.

3. Mean Absolute Error (MAE):
The MAE measures the average magnitude of the errors in a set of predictions, without considering their direction (i.e., whether the predictions are above or below the actual values). It is defined as:

MAE = (1/n) * Σ |yi - ŷi|

The MAE is less sensitive to outliers compared to MSE and RMSE, making it a robust alternative for certain datasets.

Loss Minimization (Optimization)

Loss minimization involves finding the values of the model parameters that result in the lowest possible cost function value. This process is also known as optimization. The most common method for loss minimization in linear regression is the Gradient Descent algorithm.

Gradient Descent

Gradient Descent is an iterative optimization algorithm used to minimize the cost function. It adjusts the model parameters in the direction of the steepest descent of the cost function.

Steps of Gradient Descent:

  1. Initialize Parameters: Start with initial values for the model parameters (e.g., coefficients b0, b1, ..., bn).

  2. Calculate Gradient: Compute the gradient of the cost function with respect to each parameter. The gradient is the partial derivative of the cost function.

  3. Update Parameters: Adjust the parameters in the opposite direction of the gradient. The adjustment is controlled by the learning rate (α), which determines the size of the steps taken towards the minimum.

  4. Repeat: Iterate the process until the cost function converges to a minimum value (or a pre-defined number of iterations is reached).

Parameter Update Rule:
For each parameter bj:
bj = bj - α * (∂/∂bj) MSE

where:

  • α is the learning rate
  • (∂/∂bj) MSE is the partial derivative of the MSE with respect to bj

The partial derivative of the MSE with respect to bj is calculated as:
(∂/∂bj) MSE = -(2/n) * Σ (yi - ŷi) * xij

where:

  • xij is the value of the jth independent variable for the ith data point

Overfitting vs. Underfitting

Overfitting

  • Definition: Overfitting occurs when a model learns the training data too well, capturing noise and outliers rather than the underlying pattern. As a result, the model performs exceptionally well on training data but poorly on unseen or validation data.
  • Characteristics:
    • High accuracy on training data.
    • Poor generalization to new data.
    • Complexity of the model is too high (e.g., too many parameters or a very flexible model).
  • Causes:
    • Too many features relative to the number of observations.
    • Excessive model complexity (e.g., high-degree polynomial regression).
    • Insufficient training data.
  • Solutions:
    • Use simpler models (reduce complexity).
    • Employ regularization techniques (e.g., Lasso, Ridge).
    • Use cross-validation to tune hyperparameters.
    • Increase training data if possible.

Underfitting

  • Definition: Underfitting occurs when a model is too simple to capture the underlying trend in the data. This leads to poor performance on both training and validation datasets.
  • Characteristics:
    • Low accuracy on both training and validation data.
    • The model fails to learn the relationships in the data.
  • Causes:
    • Insufficient model complexity (e.g., linear model for a non-linear relationship).
    • Too few features used in the model.
    • Poor feature selection or engineering.
  • Solutions:
    • Increase model complexity (e.g., use a higher-degree polynomial or more sophisticated algorithms).
    • Add relevant features or perform feature engineering.
    • Remove overly simplistic assumptions in the model.

Bias-Variance Trade-Off

The bias-variance trade-off is a fundamental concept in machine learning that describes the trade-off between two sources of error that affect the performance of predictive models: bias and variance.

Bias

  • Definition: Bias refers to the error due to overly simplistic assumptions in the learning algorithm. It represents the model's inability to capture the underlying patterns of the data.
  • Characteristics:
    • High bias can lead to underfitting, where the model is too simple to capture the complexity of the data.
    • Models with high bias tend to have consistent errors across different datasets.
  • Examples: Linear regression on non-linear data.

Variance

  • Definition: Variance refers to the error due to excessive sensitivity to fluctuations in the training data. It captures how much the model's predictions would vary if it were trained on different datasets.
  • Characteristics:
    • High variance can lead to overfitting, where the model learns noise and outliers in the training data instead of the underlying distribution.
    • Models with high variance perform well on training data but poorly on unseen data.
  • Examples: High-degree polynomial regression on a small dataset.

The Trade-Off

  • Balancing Act: The challenge in machine learning is to find a model that minimizes both bias and variance. A model with low bias and low variance is ideal but often hard to achieve.
  • Effect of Complexity:
    • As model complexity increases, bias decreases and variance increases.
    • Conversely, as model complexity decreases, bias increases and variance decreases.

The goal is to achieve a balance where the total error (comprised of bias, variance, and irreducible error due to noise in the data) is minimized. This often involves techniques such as cross-validation, regularization, and careful feature selection to tune the model appropriately for the given dataset.

Simple Linear Regression

Simple linear regression is a statistical method that models the relationship between two variables by fitting a linear equation to observed data. This example uses a simulated dataset to represent the relationship between the size of a house (in square feet) and its price (in thousands of dollars), incorporating natural variations to reflect real-life scenarios.

Python Code Example

1. Import Libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Enter fullscreen mode Exit fullscreen mode

This block imports the necessary libraries for data manipulation, plotting, and machine learning.

2. Generate Sample Data

np.random.seed(42)  # For reproducibility
square_footage = np.array([1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500, 2600, 2700])
price = np.array([300, 320, 340, 360, 380, 400, 420, 440, 460, 480, 500, 520, 540]) + np.random.normal(0, 20, 13)  # Adding noise
Enter fullscreen mode Exit fullscreen mode

This block generates sample data for house sizes and prices, introducing random noise to simulate real-world pricing variations.

3. Prepare Features and Target Variables

X = square_footage.reshape(-1, 1)  # Square footage
y = price  # Price in thousands
Enter fullscreen mode Exit fullscreen mode

This block prepares the features (square footage) and the target variable (house price).

4. Print Features and Target Variables

print("Square Footage (X):", X)
print("House Price (y):", y)
Enter fullscreen mode Exit fullscreen mode

Output:

Square Footage (X): [[1500]
 [1600]
 [1700]
 [1800]
 [1900]
 [2000]
 [2100]
 [2200]
 [2300]
 [2400]
 [2500]
 [2600]
 [2700]]
House Price (y): [309.93428306 317.23471398 352.95377076 390.46059713 375.31693251
 395.31726086 451.58425631 455.34869458 450.61051228 490.85120087
 490.73164614 510.68540493 544.83924543]
Enter fullscreen mode Exit fullscreen mode

5. Split the Dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

This block splits the dataset into training and testing sets for model evaluation.

6. Create and Train the Model

model = LinearRegression()
model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

This block initializes the linear regression model and trains it using the training dataset.

7. Make Predictions

y_pred = model.predict(X_test)
Enter fullscreen mode Exit fullscreen mode

This block uses the trained model to make predictions on the test set.

8. Evaluate the Model

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')
Enter fullscreen mode Exit fullscreen mode

Output:

Mean Squared Error: 57.99
R-squared: 0.99
Enter fullscreen mode Exit fullscreen mode

9. Plot the Results

plt.scatter(X, y, color='blue', label='Actual Prices')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Fitted Line')
plt.title('Simple Linear Regression: House Price Prediction')
plt.xlabel('Square Footage (sq ft)')
plt.ylabel('Price (in thousands)')
plt.legend()
plt.grid()
plt.show()
Enter fullscreen mode Exit fullscreen mode

This block creates a scatter plot of the actual prices versus the predicted prices to visualize the fit of the model.

Output:

Simple linear regression

This structured approach provides a comprehensive understanding of how to implement and evaluate simple linear regression, using a realistic dataset that accounts for variations in housing prices based on square footage.

Multiple Linear Regression

Multiple linear regression is a statistical technique that models the relationship between a dependent variable and multiple independent variables. This example incorporates two features: the size of a house (in square feet) and the number of bathrooms. We analyze how both factors influence house prices.

Python Code Example

1. Import Libraries

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Enter fullscreen mode Exit fullscreen mode

This block imports the necessary libraries for data manipulation and machine learning.

2. Generate Sample Data

np.random.seed(42)  # For reproducibility
square_footage = np.array([1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500, 2600, 2700])
num_bathrooms = np.array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5])
price = np.array([300, 320, 340, 360, 380, 400, 420, 440, 460, 480, 500, 520, 540]) + np.random.normal(0, 20, 13)  # Adding noise
Enter fullscreen mode Exit fullscreen mode

This block generates sample data for house sizes, number of bathrooms, and prices, introducing random noise to simulate real-world pricing variations.

3. Prepare Features and Target Variables

X = np.column_stack((square_footage, num_bathrooms))  # Features: square footage and number of bathrooms
y = price  # Price in thousands
Enter fullscreen mode Exit fullscreen mode

This block prepares the features (square footage and number of bathrooms) and the target variable (house price).

4. Print Features and Target Variables

print("Features (X):", X)
print("House Price (y):", y)
Enter fullscreen mode Exit fullscreen mode

Output:

Features (X): [[1500  1]
 [1600  1]
 [1700  2]
 [1800  2]
 [1900  2]
 [2000  3]
 [2100  3]
 [2200  3]
 [2300  4]
 [2400  4]
 [2500  4]
 [2600  5]
 [2700  5]]
House Price (y): [309.93428306 317.23471398 352.95377076 390.46059713 375.31693251
 395.31726086 451.58425631 455.34869458 450.61051228 490.85120087
 490.73164614 510.68540493 544.83924543]
Enter fullscreen mode Exit fullscreen mode

5. Split the Dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

This block splits the dataset into training and testing sets for model evaluation.

6. Create and Train the Model

model = LinearRegression()
model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

This block initializes the multiple linear regression model and trains it using the training dataset.

7. Make Predictions

y_pred = model.predict(X_test)
Enter fullscreen mode Exit fullscreen mode

This block uses the trained model to make predictions on the test set.

8. Evaluate the Model

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')
Enter fullscreen mode Exit fullscreen mode

Output:

Mean Squared Error: 64.16
R-squared: 0.99
Enter fullscreen mode Exit fullscreen mode

This structured approach demonstrates how to implement and evaluate multiple linear regression, using a realistic dataset that accounts for variations in housing prices based on both square footage and the number of bathrooms.

Polynomial Regression

Polynomial regression is a regression analysis technique where the relationship between the independent variable and the dependent variable is modeled as an nth degree polynomial. In this example, we will model the relationship between the size of a house (in square feet) and its price (in thousands of dollars) using a 3rd degree polynomial.

Python Code Example

1. Import Libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Enter fullscreen mode Exit fullscreen mode

This block imports the necessary libraries for data manipulation, plotting, and machine learning.

2. Generate Sample Data

np.random.seed(42)  # For reproducibility
square_footage = np.array([1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500, 2600, 2700])
price = np.array([300, 320, 340, 360, 380, 400, 420, 440, 460, 480, 500, 520, 540]) + np.random.normal(0, 20, 13)  # Adding noise
Enter fullscreen mode Exit fullscreen mode

This block generates sample data for house sizes and prices, introducing random noise to simulate real-world pricing variations.

3. Prepare Features and Target Variables

X = square_footage.reshape(-1, 1)  # Reshape for sklearn
y = price  # Price in thousands
Enter fullscreen mode Exit fullscreen mode

This block prepares the features (square footage) and the target variable (house price).

4. Split the Dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

This block splits the dataset into training and testing sets for model evaluation.

5. Create Polynomial Features

poly = PolynomialFeatures(degree=3)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)

print("Polynomial Features (X_poly_train):", X_poly_train)
print("Polynomial Features (X_poly_test):", X_poly_test)
Enter fullscreen mode Exit fullscreen mode

Output:

Polynomial Features (X_poly_train): [[1.0000e+00 2.3000e+03 5.2900e+06 1.2167e+10]
 [1.0000e+00 2.0000e+03 4.0000e+06 8.0000e+09]
 [1.0000e+00 1.7000e+03 2.8900e+06 4.9130e+09]
 [1.0000e+00 1.6000e+03 2.5600e+06 4.0960e+09]
 [1.0000e+00 2.7000e+03 7.2900e+06 1.9683e+10]
 [1.0000e+00 1.9000e+03 3.6100e+06 6.8590e+09]
 [1.0000e+00 2.2000e+03 4.8400e+06 1.0648e+10]
 [1.0000e+00 2.5000e+03 6.2500e+06 1.5625e+10]
 [1.0000e+00 1.8000e+03 3.2400e+06 5.8320e+09]
 [1.0000e+00 2.1000e+03 4.4100e+06 9.2610e+09]]
Polynomial Features (X_poly_test): [[1.0000e+00 2.6000e+03 6.7600e+06 1.7576e+10]
 [1.0000e+00 2.4000e+03 5.7600e+06 1.3824e+10]
 [1.0000e+00 1.5000e+03 2.2500e+06 3.3750e+09]]
Enter fullscreen mode Exit fullscreen mode

6. Create and Train the Model

model = LinearRegression()
model.fit(X_poly_train, y_train)
Enter fullscreen mode Exit fullscreen mode

This block initializes the polynomial regression model and trains it using the transformed training dataset.

7. Make Predictions

y_pred = model.predict(X_poly_test)
Enter fullscreen mode Exit fullscreen mode

This block uses the trained model to make predictions on the test set.

8. Evaluate the Model

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')
Enter fullscreen mode Exit fullscreen mode

Output:

Mean Squared Error: 300.06
R-squared: 0.96
Enter fullscreen mode Exit fullscreen mode

9. Plot the Results

plt.scatter(X, y, color='blue', label='Actual Prices')
X_grid = np.arange(min(X), max(X), 1).reshape(-1, 1)
y_grid = model.predict(poly.transform(X_grid))
plt.plot(X_grid, y_grid, color='red', linewidth=2, label='Fitted Polynomial Curve')
plt.title('Polynomial Regression: House Price Prediction')
plt.xlabel('Square Footage (sq ft)')
plt.ylabel('Price (in thousands)')
plt.legend()
plt.grid()
plt.show()
Enter fullscreen mode Exit fullscreen mode

This block creates a scatter plot of the actual prices versus the predicted prices and visualizes the fitted polynomial curve.

Output:

Polynomial regression

This structured approach demonstrates how to implement and evaluate polynomial regression using a realistic dataset that captures the non-linear relationship between house size and price. By incorporating polynomial features, we enhance prediction accuracy and better model complex scenarios where simple linear regression may not suffice.

Combined Multiple Linear and Polynomial Regression

In this example, we will implement a combined approach where we use multiple linear regression for the size of the house (in square feet) and polynomial regression for the number of bathrooms, allowing us to model the relationship with price (in thousands of dollars) using polynomial features for the bathroom count up to degree 3.

Python Code Example

1. Import Libraries

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Enter fullscreen mode Exit fullscreen mode

This block imports the necessary libraries for data manipulation and machine learning.

2. Generate Sample Data

np.random.seed(42)  # For reproducibility
square_footage = np.array([1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500, 2600, 2700])
bathrooms = np.array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5])
price = np.array([300, 320, 340, 360, 380, 400, 420, 440, 460, 480, 500, 520, 540]) + np.random.normal(0, 20, 13)  # Adding noise
Enter fullscreen mode Exit fullscreen mode

This block generates sample data for house sizes, number of bathrooms, and prices, introducing random noise to simulate real-world pricing variations.

3. Prepare Features and Target Variables

X = np.column_stack((square_footage, bathrooms))  # Combine features
y = price  # Price in thousands

print("Features (X):", X)
Enter fullscreen mode Exit fullscreen mode

Output:

Features (X): [[1500    1]
 [1600    1]
 [1700    2]
 [1800    2]
 [1900    2]
 [2000    3]
 [2100    3]
 [2200    3]
 [2300    4]
 [2400    4]
 [2500    4]
 [2600    5]
 [2700    5]]
Enter fullscreen mode Exit fullscreen mode

4. Split the Dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

This block splits the dataset into training and testing sets for model evaluation.

5. Create Polynomial Features for Bathrooms

poly = PolynomialFeatures(degree=3, include_bias=False)
X_poly_bathrooms_train = poly.fit_transform(X_train[:, 1].reshape(-1, 1))  # Only bathrooms
X_poly_bathrooms_test = poly.transform(X_test[:, 1].reshape(-1, 1))

# Combine square footage with polynomial features of bathrooms
X_poly_train = np.column_stack((X_train[:, 0], X_poly_bathrooms_train))
X_poly_test = np.column_stack((X_test[:, 0], X_poly_bathrooms_test))

print("Polynomial Features (X_poly_train):", X_poly_train)
print("Polynomial Features (X_poly_test):", X_poly_test)
Enter fullscreen mode Exit fullscreen mode

Output:

Polynomial Features (X_poly_train): [[2.30e+03 4.00e+00 1.60e+01 6.40e+01]
 [2.00e+03 3.00e+00 9.00e+00 2.70e+01]
 [1.70e+03 2.00e+00 4.00e+00 8.00e+00]
 [1.60e+03 1.00e+00 1.00e+00 1.00e+00]
 [2.70e+03 5.00e+00 2.50e+01 1.25e+02]
 [1.90e+03 2.00e+00 4.00e+00 8.00e+00]
 [2.20e+03 3.00e+00 9.00e+00 2.70e+01]
 [2.50e+03 4.00e+00 1.60e+01 6.40e+01]
 [1.80e+03 2.00e+00 4.00e+00 8.00e+00]
 [2.10e+03 3.00e+00 9.00e+00 2.70e+01]]
Polynomial Features (X_poly_test): [[2.60e+03 5.00e+00 2.50e+01 1.25e+02]
 [2.40e+03 4.00e+00 1.60e+01 6.40e+01]
 [1.50e+03 1.00e+00 1.00e+00 1.00e+00]]
Enter fullscreen mode Exit fullscreen mode

6. Create and Train the Model

model = LinearRegression()
model.fit(X_poly_train, y_train)
Enter fullscreen mode Exit fullscreen mode

This block initializes the combined regression model and trains it using the transformed training dataset.

7. Make Predictions

y_pred = model.predict(X_poly_test)
Enter fullscreen mode Exit fullscreen mode

This block uses the trained model to make predictions on the test set.

8. Evaluate the Model

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')
Enter fullscreen mode Exit fullscreen mode

Output:

Mean Squared Error: 199.75
R-squared: 0.98
Enter fullscreen mode Exit fullscreen mode

This structured approach effectively combines multiple features with polynomial transformations, providing a comprehensive understanding of how to implement and evaluate the model.

Evaluating Linear Regression Model

Evaluating a linear regression model involves assessing how well it predicts the dependent variable using various metrics and techniques. Here are some key methods for evaluation:

1. Performance Metrics

  • Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values. Lower values indicate better model performance.
    • Formula: MSE = (1/n) * Σ (yi - ŷi)^2
  from sklearn.metrics import mean_squared_error

  mse = mean_squared_error(y_test, y_pred)
  print(f'Mean Squared Error: {mse}')
Enter fullscreen mode Exit fullscreen mode
  • Root Mean Squared Error (RMSE): The square root of MSE, providing an error metric in the same units as the dependent variable. It is also sensitive to outliers.
    • Formula: RMSE = √(MSE)
  import numpy as np

  rmse = np.sqrt(mse)
  print(f'Root Mean Squared Error: {rmse}')
Enter fullscreen mode Exit fullscreen mode
  • Mean Absolute Error (MAE): Measures the average absolute differences between predicted and actual values. It is less sensitive to outliers than MSE.
    • Formula: MAE = (1/n) * Σ |yi - ŷi|
  from sklearn.metrics import mean_absolute_error

  mae = mean_absolute_error(y_test, y_pred)
  print(f'Mean Absolute Error: {mae}')
Enter fullscreen mode Exit fullscreen mode

2. Cross-Validation

Cross-validation is a robust technique for assessing the performance of a machine learning model by splitting the dataset into multiple parts and validating the model on different subsets of the data. Here are common cross-validation techniques:

  • K-Fold Cross-Validation: The dataset is split into k subsets. The model is trained on k-1 subsets and validated on the remaining subset. This process is repeated k times, each time with a different subset as the validation set. The average performance metric over the k folds provides a more reliable evaluation.
  from sklearn.model_selection import KFold, cross_val_score

  kf = KFold(n_splits=5, shuffle=True, random_state=42)
  scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')
  print(f'Cross-Validation MSE: {np.mean(-scores)}')
Enter fullscreen mode Exit fullscreen mode
  • Leave-One-Out Cross-Validation (LOOCV): A special case of K-Fold cross-validation where k equals the number of data points. Each data point is used once as a validation set, and the remaining data points are used for training. This method is computationally intensive but useful for small datasets.
  from sklearn.model_selection import LeaveOneOut

  loo = LeaveOneOut()
  scores = cross_val_score(model, X, y, cv=loo, scoring='neg_mean_squared_error')
  print(f'Leave-One-Out Cross-Validation MSE: {np.mean(-scores)}')
Enter fullscreen mode Exit fullscreen mode
  • Stratified K-Fold Cross-Validation: Similar to K-Fold cross-validation but ensures that each fold is representative of the overall class distribution. This method is particularly useful for imbalanced datasets.
  from sklearn.model_selection import StratifiedKFold

  skf = StratifiedKFold(n_splits=5)
  scores = cross_val_score(model, X, y, cv=skf, scoring='neg_mean_squared_error')
  print(f'Stratified K-Fold Cross-Validation MSE: {np.mean(-scores)}')
Enter fullscreen mode Exit fullscreen mode

By using these evaluation methods and cross-validation techniques, practitioners can assess the effectiveness of their linear regression model, ensuring it generalizes well to unseen data.

Regularization in Regression

Regularization is a technique used in regression analysis to prevent overfitting and improve model generalization by adding a penalty term to the loss function. This penalty discourages overly complex models by constraining the size of the coefficients, which helps manage the bias-variance tradeoff. The two most common forms of regularization in regression are L1 regularization (Lasso) and L2 regularization (Ridge).

L2 Regularization (Ridge Regression)

Concept: L2 regularization adds a penalty equal to the square of the magnitude of coefficients to the loss function. This is known as the L2 norm.

Loss Function: The modified loss function for Ridge regression can be represented as:

Loss = Σ(yi - ŷi)^2 + λ * Σ(wj^2)

Where:

  • yi is the actual value.
  • ŷi is the predicted value.
  • wj are the model coefficients.
  • λ is the regularization parameter that controls the strength of the penalty.

Effects:

  • Ridge regression shrinks the coefficients towards zero but does not set them exactly to zero. As a result, all features remain in the model, making it suitable for situations with many predictors, especially when multicollinearity is present.
  • The quadratic penalty means that larger coefficients are penalized more heavily, promoting stability in predictions.

Coefficient Plotting: When visualizing coefficients, Ridge regression shows a smooth decrease in coefficient values as the regularization parameter increases, resulting in more balanced coefficients without dropping any variables.

L1 Regularization (Lasso Regression)

Concept: L1 regularization adds a penalty equal to the absolute value of the magnitude of coefficients to the loss function, known as the L1 norm.

Loss Function: The modified loss function for Lasso regression is expressed as:

Loss = Σ(yi - ŷi)^2 + λ * Σ|wj|

Where:

  • yi is the actual value.
  • ŷi is the predicted value.
  • wj are the model coefficients.
  • λ is the regularization parameter.

Effects:

  • Lasso regression can shrink some coefficients to exactly zero, effectively performing variable selection. This is beneficial in creating simpler, more interpretable models.
  • The linear penalty allows for certain coefficients to be excluded from the model, which can be especially useful when dealing with high-dimensional data.

Coefficient Plotting: In Lasso regression, as the regularization parameter increases, we typically observe that some coefficients drop to zero quickly, creating a sparse model where only the most significant features retain non-zero coefficients.

Top comments (0)