Dipsan Kadariya

Posted on Feb 20

Machine Learning

Supervised Learning

Supervised learning involves learning through data with input and output, finding the relationship between them, and using that relationship to predict outputs for new inputs.

In this case, we have both input and output. In supervised learning, we train the machine to learn from the given data and predict whether, for certain IQ and CGPA values, there will be a placement or not.

Types of Supervised Learning:

Regression
Classification

Data Types

Data is generally of two types:

Numerical: Age, weight, CGPA, IQ
Categorical: Gender, nationality, brand, etc.

Regression (Supervised Learning)

If the output is numerical, it is called regression.
Example: For a certain IQ and CGPA, let's say we predict a salary package of 50,000. Since the output column (package) is numerical, this is regression.

Classification (Supervised Learning)

If the output is categorical, it is called classification.
Example: Given certain CGPA and IQ values, we predict whether a student will get placed or not. Since the output is categorical (Yes/No), this is classification.

Unsupervised Learning

Unsupervised learning is used when we have only input data and no corresponding output.

Example:

In this example, we only have inputs and no output. We don't know what to predict.

In unsupervised learning, we perform one of the following:

Clustering
Dimensionality Reduction
Anomaly Detection
Association Rule Learning

Clustering

Let's say we plot IQ and CGPA.
A clustering algorithm detects which students belong to the same group.
Example: We can categorize students into different groups such as:
- High IQ, low CGPA
- Low IQ, low CGPA
- High IQ, high CGPA

Dimensionality Reduction

If there are a lot of input columns (e.g., 1000), the algorithm runs slow, and at some point, adding more columns does not improve results.
Dimensionality Reduction (DR) removes unnecessary columns to improve efficiency.
Example: If we need to predict house price based on the number of rooms and washrooms, DR combines these into a single feature, reducing the number of input columns.
DR also helps in visualizing high-dimensional data by reducing it to 2D or 3D.

Anomaly Detection

Used for finding errors and detecting outliers in data.
Helps in fraud detection, network security, etc.

Association Rule Learning

A technique in unsupervised learning used to find relationships or patterns between variables in large datasets.
Example: Market Basket Analysis (identifying which products are frequently bought together).

Semi-Supervised Learning

Semi-supervised learning lies between supervised and unsupervised learning.
It involves a small amount of labeled data and a large amount of unlabeled data.
The labeled data helps guide the learning process of the unlabeled data.

Reinforcement Learning

Reinforcement learning is where an agent learns how to make decisions by interacting with an environment.
The agent performs actions and receives feedback in the form of rewards or penalties.
The goal is to maximize cumulative rewards over time.

Categories Based on Production

Batch Machine Learning

Code runs on a server (offline learning).
The model is trained once using the "entire dataset" and then deployed.
convential way of training model.
Process: data → model → train → test → server → run

Problems with Batch Learning:

The model is static and does not evolve with new data,since it is trained offline.
Requires periodic retraining with merged new and old data.

Online Machine Learning

Done incrementally (model learns continuously).
Data is fed in small batches (mini-batches) sequentially.
Model is trained online.
The model improves with interactions and new data.

Process: small data → model → train → test → server

<--- Continuous new data

Predictions on New Data

The model continues learning from new data.
Examples: Chatbots like GPT, YouTube video recommendations.

When to Use Online Learning?

Concept Drift: When concepts change over time (e.g., e-commerce trends).
Cost-Effective: Continuous learning without expensive retraining.
Faster Solution: Adaptability in real-time applications.

Out-of-Core Learning

Used when the dataset is too large to fit in memory.
The dataset is split into batches and processed in chunks.
Though performed offline, it follows an online learning approach.

Disadvantage:

More complex and can be risky if not implemented correctly.

Based on Learning

Machine learning can be categorized based on how it learns:

By Memorizing (Instance-Based Learning)
By Generalizing (Understanding concepts, Model-Based Learning)

1. Instance-Based Learning

Instance-based learning does not actually learn patterns but stores training data and responds based on the nearest neighbors.

Example:

IQ	CGPA	Placement
8	8	Yes
7.0	7.3	No

In instance-based learning, the model does not learn anything. It just stores the data.
When a new query comes, it looks at the nearest data points to decide.
If the nearby points mostly have placements, the answer is "Yes"; otherwise, it's "No".
There is no training or learning—just pattern matching.

Key Point:

It focuses on simple pattern matching, where the model stores examples and instantly answers based on the nearest matching data point.

2. Model-Based Learning

The model learns from data using algorithms.
It understands the pattern and draws a boundary.
The boundary helps predict answers for new inputs.
Unlike instance-based learning, model-based learning finds a mathematical relation between input and output.
Even if we don’t have training data points, we can predict using the learned boundary.

Key Point:

This refers to building an internal model of the relationships in the data, which can then be used to make predictions.

Differences Between Instance-Based and Model-Based Learning

Usual/Conventional Machine Learning	Instance-Based Learning
Prepare data for model training.	No model training.
Train models to generalize patterns.	No training—only stores data.
Can make predictions using learned models.	Predictions based on stored examples.
Results in a generalizable model.	No generalization—just stores past data.
Missing attributes are handled better.	Every new input needs complete data.

Challenges in Machine Learning

Data collection issues
Insufficient labeled data
Non-representative data
Poor quality data
Irrelevant features
Overfitting
Underfitting
Software integration issues
High costs involved