AryantKumar

Posted on Jan 7

Supervised learning

#machinelearning #supervised #ai

*Supervised learning *

Introduction to Supervised Learning

Supervised learning involves training a model using labeled datasets to predict outcomes for new inputs. It is analogous to learning under supervision, where the model is given examples of inputs and correct outputs during training.

Step1: Data collection
Step2: Training
Step3: Testing

Key Characteristics:
• Inputs (Features): Independent variables like age, weight, or hours studied.
• Outputs (Labels): Dependent variables, either continuous (regression) or categorical

(classification).
• Model Objective: Minimize the error between predicted outputs and actual outputs.

Types of Supervised Learning Tasks

2.1 Regression

Regression is used for predicting continuous values. The output variable can take any real value.

Regression is a type of supervised learning technique used to predict a continuous outcome or value based on one or more input features (variables). The goal of regression is to model the relationship between the input variables (often called independent variables or features) and the output variable (often called the dependent variable or target).

Regression is a statistical method that helps us understand and predict the relationship between variables.(Variable- is a quantitative data in which we measure some values)
Describes how one variable (dependent variable) (the data we want to predict -Jis data ko predict karna hai) changes as another variable (independent variable) (basis of prediction value- Jiske basis pe predict karna hai.) changes.
Dependent variable: We are trying to predict or explain(Y).
Independent variable: That are used to predict or explain the changes in the dependent variable (X)

Key Concepts:
1. Prediction of Continuous Values:
The main purpose of regression is to predict a numerical value. For example, predicting a house price based on its size, location, and number of rooms. The output is continuous, meaning it can take any value within a range.
2. The Relationship Between Variables:
Regression assumes that there is a relationship between the input features and the target variable. For example, the price of a house might depend on its square footage and the number of bedrooms. The model tries to find the best way to connect these input features to the predicted price.

Types of regression

Linear Regression
Multi - linear regression
Polynomial Regression

2.2 Classification

Classification assigns discrete class labels to inputs.

Classification is a type of supervised learning where the goal is to predict a discrete label or category for a given input. Unlike regression, which predicts continuous values, classification assigns inputs to one of several predefined classes. This is commonly used for problems where the output is a category, such as classifying an email as “spam” or “not spam,” predicting if a tumor is “malignant” or “benign,” or determining the type of animal in a photo (e.g., dog, cat, etc.).

Key Concepts in Classification

Labels (Classes): In classification, each input data point is assigned a label, which is a category. The model’s task is to predict these labels based on the input features. For example: • In a binary classification problem, there are two possible labels: “yes” or “no,” “spam” or “not spam.” • In multi-class classification, there are more than two possible categories. For example, classifying images of fruits as “apple,” “banana,” or “cherry.
Training Data: Classification algorithms are trained on a labeled dataset, where the input features and corresponding labels are known. The model uses this data to learn how to associate inputs with the correct labels.
Prediction: After training, the model is used to classify new, unseen data based on the patterns it learned from the training data. For example, after training a model to classify emails as spam or not, you can input a new email into the model, and it will predict whether it’s spam or not.

Example: Classifying emails as spam or not spam.

Key Algorithms in Supervised Learning

Linear models

3.1 Linear Regression

Linear Regression is a statistical method used to model the relationship between a dependent variable (also known as the target or output) and one or more independent variables (also known as predictors or features). The goal is to fit a linear equation to the observed data, so that we can predict the dependent variable based on the independent variables.

Equation of linear regression: Y=mX+b

Y represents the dependent variable.
X represents the independent variable.
m is the slope of the line(how much Y changes for a unit Change in X).
b is the intercept( the value of Y when X is 0).

Key Idea: Fit a straight line to the data to predict a continuous outcome.

Where w0 is the intercept, and w_1, w_2, ……..w_n are coefficients (slopes) learned during training.

The general expression for Linear Regression is:

Explanation:
1. y: The predicted output (dependent variable).
2. w0: The intercept or bias term, representing the value of y when all xi =0.
3. w1,w2,…….wn: The coefficients or weights for each feature x1,x2,…..,xn. These indicate the strength and direction of the relationship between the feature and the output.
4. x1,x2,…….,xn: The input features (independent variables).
5. : The error term, accounting for variability not captured by the model (assumed to be normally distributed).

Mathematical Objective:
Minimize the error between actual and predicted values.

3.2 Logistic Regression

Logistic Regression is a statistical method used for binary classification tasks, where the goal is to predict one of two possible outcomes based on one or more independent variables (features). Despite its name, logistic regression is used for classification, not regression, because its output is a probability that is transformed into a binary outcome (0 or 1).

Logistic regression is a powerful and widely-used classification algorithm for binary outcomes. By modeling the probability of an outcome using the logistic (sigmoid) function, logistic regression helps classify inputs into one of two categories based on their features. It’s particularly useful for problems where you need probabilistic predictions and can provide insights into the influence of each feature on the outcome.

   Key Idea: Predict probabilities for binary classification using the sigmoid function.

   Steps:
1.  Compute the linear combination  z = w_0 + w_1x_1 + …. + w_nx_n .
2.  Apply the sigmoid function to map  z  into the range (0, 1).
3.  Use a threshold (e.g., 0.5) to classify the input.

Sigmoid Function : The sigmoid function is a mathematical function that maps any real-valued number to a value between 0 and 1. It is often used in machine learning, especially in logistic regression, to model probabilities. The function has an “S” shaped curve, which is why it is also known as the logistic function.

   3.3  k-NN Algorithm 

   K-Nearest Neighbors (KNN) Algorithm:

The K-Nearest Neighbors (KNN) algorithm is a simple, instance-based learning algorithm used for classification and regression tasks. It makes predictions based on the similarity between the input data point and its nearest neighbors in the feature space. KNN is a non-parametric method, meaning it makes no assumptions about the underlying data distribution.

Key Concepts of KNN:

Instance-Based Learning: KNN does not explicitly learn a model during the training phase. Instead, it stores the entire dataset and makes decisions at the time of prediction based on the stored instances.
Distance Metric: KNN uses a distance metric (typically Euclidean distance) to measure the similarity between data points. The algorithm calculates the distance between the input point and all the points in the training dataset, then selects the nearest ones.
K: The number of neighbors to consider when making a prediction is defined by the parameter . The choice of affects the performance of the model: - Small : More sensitive to noise, prone to overfitting. - Large : More robust, but may lead to underfitting if too large.
Voting (for Classification): In classification, KNN assigns the most frequent class label among the nearest neighbors. This is called majority voting. If and two of the nearest neighbors belong to class 1 and one belongs to class 0, the input will be classified as class 1
Averaging (for Regression): In regression, KNN predicts the average of the values of the nearest neighbors.

How KNN Works (Steps):
1. Choose the number of neighbors :
Select a value for , the number of neighbors to look at.

Calculate the distance:
For a given data point (test point), calculate the distance between the test point and every other point in the training dataset. Common distance metrics include:
• Euclidean distance:

• Manhattan distance (L1 norm), etc.
Identify the nearest neighbors:
Sort all points in the training set by their distance to the test point and select the closest points.
Assign a label (classification) or predict the output (regression):
• For classification, assign the most common class among the neighbors.
• For regression, compute the average of the target values of the neighbors.
Return the prediction:
Based on the majority class or average value, return the predicted output for the test point.

Example of KNN (Classification):

Let’s consider a simple example where we want to classify whether a fruit is an apple or an orange based on its weight and size.

Fruit Weight Size Label
Apple 150 7 Apple
Apple 160 7.5 Apple
Orange 130 6.5 Orange
Orange 120 6 Orange
Apple 170 7.2 Apple
Orange 140 6.8 Orange
Now, suppose we have a new fruit with the following characteristics:
• Weight: 160g
• Size: 7.1cm

We want to classify it using KNN with .
1. Step 1: Calculate the distance between the new fruit and each of the training points using the Euclidean distance formula.
2. Step 2: Sort the distances and find the 3 nearest neighbors.
After calculating the distances, we find that the 3 nearest neighbors are:
• Nearest neighbor 1: Apple (160g, 7.5cm)
• Nearest neighbor 2: Apple (150g, 7cm)
• Nearest neighbor 3: Apple (170g, 7.2cm)
3. Step 3: Apply majority voting (for classification).
Since 3 out of the 3 nearest neighbors are labeled Apple, the new fruit will be classified as an Apple.

Advantages of KNN:

Simple to understand and implement.
No training phase: KNN does not require a model to be trained, which makes it easy to use with minimal setup.
Versatile: It can be used for both classification and regression tasks.

Disadvantages of KNN:

Computationally expensive: KNN requires storing all training data and calculating distances for each prediction, which can be slow, especially for large datasets.
Memory-intensive: The algorithm requires a lot of memory to store the entire training dataset.
Sensitive to irrelevant features: If there are many irrelevant features, KNN’s performance can degrade.
Performance degrades with high-dimensional data: KNN can suffer from the curse of dimensionality when there are many features.

Choosing the Best :

The value of plays a significant role in the performance of the model:
• Small values (e.g., ) might be overly sensitive to noise and outliers, leading to overfitting.
• Large values might smooth out the boundaries too much, leading to underfitting.

One common way to select is through cross-validation, where the model is trained and tested on various subsets of the dataset to find the optimal value for .

Conclusion:

The K-Nearest Neighbors (KNN) algorithm is a simple and effective method for classification and regression tasks. It works by predicting the class or output value based on the closest neighbors in the feature space. While it’s intuitive and versatile, KNN can be computationally expensive for large datasets and is sensitive to irrelevant or redundant features.

3.4 Naiive Byes

Naïve Bayes is a probabilistic classification algorithm based on Bayes’ Theorem. It assumes that the features used to make predictions are independent of each other, given the target class, which is a “naïve” assumption in real-world scenarios.

Definition:

Naïve Bayes is a simple and efficient algorithm that predicts the class of a data point based on the likelihood of the features occurring within each class. It calculates the posterior probability of each class using Bayes’ Theorem and assigns the class with the highest probability to the data point.

Bayes’ Theorem:

P(B/A) = P(B) * P(A/B)/P(A)

Where:
• : Posterior probability (probability of class given the data ).
• : Likelihood (probability of data given class ).
• : Prior probability of class .
• : Marginal probability of (normalizing constant).

Naïve Bayes is widely used in text classification, spam detection, and sentiment analysis due to its simplicity and efficiency.

Here’s an example of Naïve Bayes applied to a spam email classification problem:

Problem Statement:

Classify whether an email is spam or not spam based on the occurrence of certain words.

Decision Tree (Brief Explanation)

A Decision Tree is a supervised learning algorithm used for both classification and regression tasks. It models decisions and their possible consequences in the form of a tree-like structure.

Key Components of a Decision Tree:
1. Root Node: The topmost node representing the entire dataset. It is split into child nodes based on a feature that best separates the data.
2. Decision Nodes: Intermediate nodes where decisions are made based on feature values.
3. Leaf Nodes: Terminal nodes that represent the final output (class label in classification or a value in regression).
4. Splits: The decision points where the dataset is divided based on feature thresholds.

How it Works:
1. Splitting: The dataset is split recursively into subsets based on features that maximize the separation between classes (for classification) or minimize variance (for regression).
2. Stopping Criteria: The process continues until:
• A pre-defined depth is reached.
• Further splitting doesn’t improve the results.
• All data points belong to the same class (pure node).
3. Prediction:
• For classification, the tree predicts the majority class in the leaf node.
• For regression, it predicts the average value of data points in the leaf node.

Advantages:
• Simple to understand and interpret.
• Handles both numerical and categorical data.
• No need for scaling or normalization.

Disadvantages:
• Prone to overfitting, especially with deep trees.
• Sensitive to small changes in data, which can lead to different splits.

Example:
• Imagine a tree predicting whether someone will buy a product based on their age and income.
• Root Node: “Is age > 30?”
• Decision Node: “Is income > $50k?”
• Leaf Nodes: “Yes, they will buy” or “No, they won’t buy.”

This step-by-step structure makes decision trees intuitive and effective.

Decision Tree with Entropy and Information Gain

A Decision Tree uses measures like entropy and information gain to decide where to split the data at each step. These concepts help the algorithm identify the feature that provides the most significant separation of the data.

Key Concepts

Entropy

Entropy measures the impurity or uncertainty in a dataset.
• If all data points belong to a single class, entropy is 0 (pure node).
• If the data points are evenly distributed among classes, entropy is 1 (maximum impurity).

Formula for Entropy:

Where:
•S : Dataset.
•Pi : Proportion of data points belonging to class .

Information Gain (IG)

Information Gain is the reduction in entropy after a dataset is split on a feature. It measures how well a feature separates the data into distinct classes. The goal is to maximize information gain at each split.

Formula for Information Gain:

Where:
•S : Dataset.
•A : Feature used for splitting.
•HS: Entropy of the dataset before splitting.
•HSv : Entropy of subset after splitting based on value of feature .
•Sv/S : Proportion of data points in subset .

Step-by-Step Process of Splitting Using Entropy and Information Gain

Example Dataset:

Outlook Temperature Humidity Windy play?
Sunny Hot High False No
sunny Hot High True No
overcast Hot High False Yes
Rain Mild High False Yes
Rain Cool Normal False Yes

Step 1: Calculate Initial Entropy

Step 2: Calculate Entropy for Each Feature

Step 3: Calculate Information Gain

Step 4: Choose the Feature with the Highest IG

Repeat the process for all features and select the one with the highest information gain as the splitting criterion.

Advantages of Using Entropy and Information Gain
1. Helps the tree identify the most informative features.
2. Makes splits that reduce uncertainty in the dataset.

Conclusion

Using entropy and information gain allows a decision tree to find the best splits, resulting in a structure that separates the data effectively and reduces prediction errors.

decision tree

Random Forest: An Overview

Random Forest is a supervised learning algorithm that is used for both classification and regression tasks. It builds a collection (or “forest”) of decision trees during training and makes predictions by aggregating their outputs. It is a type of ensemble learning method, which combines multiple models to improve overall performance and reduce overfitting.

Key Characteristics of Random Forest
1. Ensemble of Trees:
Random Forest consists of multiple decision trees, each trained on a different subset of the dataset.
2. Bagging (Bootstrap Aggregation):
Each tree is trained on a random sample (with replacement) of the training data. This helps reduce variance by averaging predictions from multiple trees.
3. Random Feature Selection:
During training, each tree considers a random subset of features for splitting at each node. This introduces diversity among the trees, reducing the likelihood of overfitting.
4. Voting/Averaging for Predictions:
• Classification: The final output is the class with the majority vote from all trees.
• Regression: The final prediction is the average of all tree outputs.

How Random Forest Works

Step 1: Create Multiple Decision Trees
• Randomly sample the data (with replacement) to create multiple subsets (bootstrap samples).
• Train a decision tree on each subset. Each tree uses a random subset of features for splitting.

Step 2: Make Predictions
• For classification, each tree votes for a class, and the class with the most votes becomes the final prediction.
• For regression, the predictions of all trees are averaged to produce the final output.

Advantages of Random Forest
1. Improved Accuracy: Combines the strengths of multiple decision trees to improve prediction accuracy.
2. Robustness: Reduces overfitting by averaging multiple trees.
3. Handles Missing Data: Can maintain performance even with incomplete datasets.
4. Works with Large Datasets: Efficient for high-dimensional data and large feature sets.
5. Feature Importance: Provides insights into the relative importance of different features.

Disadvantages of Random Forest
1. Computationally Intensive: Building and aggregating multiple trees can be resource-intensive.
2. Less Interpretability: Harder to interpret compared to a single decision tree.
3. Overfitting: While less prone to overfitting, it can still occur with excessively deep trees or a high number of trees.

Example Use Cases
1. Classification: Spam detection, fraud detection, image recognition.
2. Regression: Predicting house prices, stock market trends, or weather patterns.

Why Use Random Forest?

Random Forest is widely used due to its balance of simplicity, accuracy, and robustness. By combining multiple trees and introducing randomness, it overcomes the limitations of individual decision trees and is effective for a variety of real-world applications.

DEV Community

Supervised learning

Top comments (0)

Read next

Top Generative AI-Based Testing Tools in the Market

My first Dev.to rant

Let AI Do Code Review For You

Tiny AI Safety Guard Matches Larger Models with 98% Accuracy, Runs on Phones