ML | Active Learning

Active Learning in Machine Learning

Active Learning is a subset of machine learning where the algorithm can actively query a user or some other information source to label data points with the desired outputs. It’s particularly useful in scenarios where labeled data is scarce or expensive to obtain.

Active learning aims to improve the model's performance by selectively choosing the most "informative" data points to label rather than labeling the entire dataset.

Key Concepts in Active Learning

Labeled vs. Unlabeled Data:
- Labeled Data: Data with known outputs (e.g., categories, numbers).
- Unlabeled Data: Data without any associated labels, which is cheaper and more abundant.
Query Strategy:
The core of active learning is deciding which data points are most useful to label next. Popular strategies include:
- Uncertainty Sampling: Selecting samples where the model is least confident.
- Query by Committee (QBC): A group of models (committee) votes on the label. Samples with the most disagreement are chosen.
- Expected Model Change: Choosing samples that would most change the model if labeled.
- Diversity Sampling: Selecting samples that represent diverse points in the feature space.
Oracle:
The human expert or automated system that provides the true label for queried data points.

Example of Active Learning Workflow

Let’s look at an example of implementing active learning for image classification:

Problem:

You want to classify images of cats and dogs, but labeling thousands of images manually is expensive.

Steps:

Start with a Small Labeled Dataset:
Label a small number of images (e.g., 100).
Train an Initial Model:
Train a classifier (e.g., a neural network) on this small labeled dataset.
Use the Model to Evaluate Unlabeled Data:
Pass the unlabeled images through the model to predict their labels and measure its uncertainty.
Select Informative Samples:
Use an active learning strategy (e.g., uncertainty sampling) to identify the 50 images where the model is least confident.
Label the Selected Images:
Manually label these 50 images.
Retrain the Model:
Add the newly labeled data to the training set and retrain the model.
Repeat Until Satisfied:
Continue querying the most informative samples until the model achieves the desired performance.

Easy to Understand Example

Imagine you're a teacher (the Oracle) with a class of students (the ML model). You have a large pool of questions (unlabeled data), but the students only need help with questions they find confusing. Instead of solving every question for them, you focus on the ones they struggle with (active learning). Over time, they get better with fewer examples because they're learning from their mistakes on challenging problems.

Advantages of Active Learning

Cost-Effective: Reduces the need for large labeled datasets.
Efficient: Focuses on the most useful data points.
Improves Model Performance: Faster improvement with fewer labels.

Applications of Active Learning

Medical Diagnosis: Labeling medical images like MRIs or X-rays.
Natural Language Processing (NLP): Annotating text for tasks like sentiment analysis or entity recognition.
Fraud Detection: Identifying suspicious transactions with limited labeled data.
Autonomous Vehicles: Identifying rare objects or situations on the road.

Python Code Example

Here’s an example using the modAL library:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from modAL.models import ActiveLearner

# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_pool = X[:50], X[50:]
y_train, y_pool = y[:50], y[50:]

# Create an ActiveLearner
learner = ActiveLearner(estimator=RandomForestClassifier(), X_training=X_train, y_training=y_train)

# Active learning loop
for i in range(10):
    query_idx, query_inst = learner.query(X_pool)
    # Simulate labeling by the oracle
    learner.teach(X_pool[query_idx].reshape(1, -1), y_pool[query_idx].reshape(1, ))
    # Remove queried instance from the pool
    X_pool, y_pool = np.delete(X_pool, query_idx, axis=0), np.delete(y_pool, query_idx, axis=0)

print("Active learning completed!")