DEV Community

Cover image for Unleash the Power of Bagging: A Practical Guide to Ensemble Learning - PART II
Seenivasa Ramadurai
Seenivasa Ramadurai

Posted on

Unleash the Power of Bagging: A Practical Guide to Ensemble Learning - PART II

Introduction to Bagging in Ensemble Learning

Ensemble learning is a powerful machine learning technique that combines the predictions of multiple models to achieve better accuracy and robustness. The idea behind ensemble methods is simple: instead of relying on a single model to make predictions, we combine the strengths of several models to improve overall performance. Think of it like a team of experts with different perspectives working together to solve a problem—each brings something valuable to the table.

In this blog, we are going to implement one of the most popular ensemble methods called bagging (Bootstrap Aggregating). Bagging aims to improve the stability and accuracy of machine learning models by training multiple models on different subsets of data and then aggregating their predictions. A key concept in bagging is sampling with replacement, where subsets of the data are randomly selected with the possibility of selecting the same data point multiple times. This helps create diverse training sets, allowing the model to learn different patterns and reduce overfitting.

By the end of this post, you'll understand how bagging works, why it's effective, and how we can use it to improve the performance of machine learning models. Let's dive in and explore the magic behind this ensemble technique!

What is Bagging? Boosting Prediction Accuracy with Ensemble Learning

In the world of machine learning, ensemble learning is a powerful method that combines multiple models to improve the predictive performance of a system. One of the most popular ensemble methods is Bagging, short for Bootstrap Aggregating. This method combines multiple models to generate more accurate and stable predictions by reducing variance and enhancing overall model reliability. Let’s dive into the mechanics of how bagging works, explore popular algorithms that leverage it, and discuss its advantages in real-world applications.

How Does Bagging Work?

At its core, Bagging aims to improve the predictive accuracy of a model by training multiple base models on different subsets of the training data and combining their predictions. Here's how bagging works in a step-by-step breakdown:

1.Creating Multiple Base Models:

Bagging starts by creating multiple base models. These models are typically weak learners (like decision trees) that would perform poorly on their own. The key to bagging lies in how these models are trained — each base model is trained on a different subset of the original training data. These subsets are created using bootstrap sampling, which means the data is sampled randomly with replacement. This means some data points may appear multiple times in a subset, while others might be missing.

2.Training Each Model Independently:

Each model in the ensemble is trained on its own subset of the data. By training the models independently, we ensure that they learn different patterns and nuances in the data, which increases the diversity among the models.

3.Making Predictions:

After training, each model makes predictions on new, unseen data. For regression tasks, these predictions are typically averaged, while for classification tasks, the final prediction is based on the majority vote from all base models.

4.Aggregating the Predictions:

The predictions from all models are aggregated to make the final prediction. In regression tasks, this is usually done by averaging the predictions, while in classification tasks, a voting system is employed where the class with the most votes from the base models is chosen as the final class.

By using this process of aggregation, bagging reduces the impact of any individual model’s errors, which results in a more stable and robust overall prediction.

Popular Algorithms that Use Bagging

While bagging can be applied to many different base models, a few algorithms stand out in their widespread use and popularity:

1. Random Forest:

Random Forest is perhaps the most well-known algorithm that uses bagging. It’s essentially a collection of decision trees, each trained on a random subset of the data with bootstrap sampling. Additionally, Random Forest introduces an extra layer of randomness by selecting a random subset of features at each node split, ensuring more diversity among the trees. Once all trees have made their predictions, the results are aggregated to provide the final output.

2. Bagged Decision Trees:

As the name suggests, Bagged Decision Trees use the decision tree model as the base learner and employ bagging to improve the model’s accuracy. It’s particularly useful for reducing variance and preventing overfitting when working with decision trees.

3. Bagged SVM and Neural Networks:

In addition to decision trees, bagging can also be applied to other base models like Support Vector Machines (SVM) and neural networks. These models benefit from the diversity introduced by bagging, resulting in improved performance and stability.

Benefits and Advantages of Bagging

Bagging has several compelling advantages that make it a go-to technique in machine learning. Let’s explore some of its key benefits:

Variance Reduction:

By training multiple models on different subsets of data, bagging helps to reduce the variance of the final model. This leads to more reliable and stable predictions, as the errors of individual models tend to cancel each other out.

Improved Robustness:

Bagging enhances robustness by making the model less sensitive to outliers and noisy data points. Since the base models are trained on different subsets, the impact of noise in the data is minimized, leading to better generalization.

Ensemble Generalization:

Combining the predictions of multiple models helps bagging capture complex patterns in the data, resulting in more accurate predictions on unseen data.

Flexibility with Base Models:

Bagging is versatile and can be used with various base models, including decision trees, support vector machines, and neural networks. This makes it applicable to a wide range of problems and domains.

Scalability:

Bagging can be easily parallelized because each base model is trained independently. This makes it highly scalable, allowing it to be used with large datasets and distributed computing environments.

Interpretability:

Despite being an ensemble technique, bagging retains the interpretability of the base models, particularly when decision trees are used. This allows practitioners to gain insights into the patterns and relationships in the data.

Real-world Use Cases of Bagging

Bagging is widely used across various industries for different tasks. Here are a few real-world applications:

  • Finance:

Bagging techniques are used in finance for risk assessment, credit scoring, and fraud detection. By combining multiple models, financial institutions can make more accurate predictions about creditworthiness and fraudulent activity.

  • Healthcare:

In healthcare, bagging helps with disease diagnosis and prognosis. By aggregating information from multiple sources, such as medical records and diagnostic tests, bagging models can predict the likelihood of diseases like cancer or heart conditions.

  • Marketing:

Bagging can be applied in marketing to predict customer churn and identify retention strategies. By analyzing customer data and behavior, businesses can tailor their marketing efforts to retain high-risk customers.

  • E-commerce:

Product recommendation systems often rely on bagging to improve prediction accuracy. By aggregating recommendations from multiple models, e-commerce platforms can better personalize recommendations for their users.

Python Code Example: Implementing Bagging with Scikit-Learn

Let’s see how we can implement bagging in Python using the Scikit-learn library.

"""
Iris Classification using Bagging Classifier

This script implements a machine learning model using the Bagging Classifier algorithm
to predict iris flower species. It includes both model training and a FastAPI web service
for making predictions.

The model is trained on the iris dataset and can predict three species:
- setosa
- versicolor
- virginica

Author: Sreeni Ramadurai
Date: 2025-March-04
"""

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import numpy as np
import uvicorn
from contextlib import contextmanager, asynccontextmanager
from typing import Generator
from fastapi import FastAPI, Depends
from pydantic import BaseModel, Field
import pandas as pd
#print the decision tree as a graph
from sklearn.tree import export_graphviz
import graphviz
import os

# Context manager for model loading
@contextmanager
def get_model() -> Generator:
    """
    Context manager for safely loading the trained model from disk.

    Yields:
        model: The loaded scikit-learn model object

    Example:
        with get_model() as model:
            prediction = model.predict(data)
    """
    try:
        with open('bagging_model.pkl', 'rb') as f:
            model = pickle.load(f)
            yield model
    finally:
        pass  # Clean up if needed

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Data preprocessing and visualization
print("Sample of first 5 rows of features:")
print(X[:5])
print("\nSample of first 5 target values:")
print(y[:5])

# Create a pandas DataFrame for better data visualization
df = pd.DataFrame(X, columns=iris.feature_names)
df['target'] = y
target_dict = dict(enumerate(iris.target_names))
df['target_names'] = df['target'].map(target_dict)
print("\nFirst 5 rows of the dataset:")
print(df.head())

# Model Training Pipeline
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the base Decision Tree classifier
tree = DecisionTreeClassifier(random_state=42)

# Initialize the Bagging classifier
bagging = BaggingClassifier(estimator=tree, n_estimators=100, random_state=42)

# Train the Bagging classifier
bagging.fit(X_train, y_train)

# Model Evaluation
y_pred = bagging.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nBagging classifier accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Visualize the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Save the trained model
with open('bagging_model.pkl', 'wb') as f:
    pickle.dump(bagging, f)

# Test model loading and prediction
with open('bagging_model.pkl', 'rb') as f:
    bagging = pickle.load(f)

# Sample prediction
new_data = np.array([[5.1, 3.5, 1.4, 0.2]])
prediction = bagging.predict(new_data)
print(f"\nSample Prediction: {prediction}")
print(f"Predicted Species: {iris.target_names[prediction]}")

# Visualize decision tree (if Graphviz is installed)
try:
    # Configure Graphviz executable path
    os.environ["PATH"] += os.pathsep + r"C:\Program Files\Graphviz\bin"

    # Export and save the decision tree directly to a file
    dot_file = "iris_decision_tree.dot"
    export_graphviz(bagging.estimators_[0],
                   out_file=dot_file,
                   feature_names=iris.feature_names,  
                   class_names=iris.target_names,
                   filled=True,
                   rounded=True,
                   special_characters=True)

    # Convert dot file to PNG using graphviz
    graph = graphviz.Source.from_file(dot_file)
    graph.render(filename="iris_decision_tree", format="png", cleanup=True)
    print("\nDecision tree visualization saved as 'iris_decision_tree.png'")
except Exception as e:
    print("\nNote: To visualize the decision tree, please install Graphviz:")
    print("1. Download from: https://graphviz.org/download/")
    print("2. Add to system PATH")
    print("3. Run: pip install graphviz")
    print(f"\nError details: {str(e)}")

# FastAPI Implementation
class IrisInput(BaseModel):
    """
    Pydantic model for input data validation.

    Attributes:
        sepal_length (float): Length of sepal in cm
        sepal_width (float): Width of sepal in cm
        petal_length (float): Length of petal in cm
        petal_width (float): Width of petal in cm
    """
    sepal_length: float = Field(..., gt=0, description="Length of sepal in cm")
    sepal_width: float = Field(..., gt=0, description="Width of sepal in cm")
    petal_length: float = Field(..., gt=0, description="Length of petal in cm")
    petal_width: float = Field(..., gt=0, description="Width of petal in cm")

    class Config:
        json_schema_extra = {
            "example": {
                "sepal_length": 5.1,
                "sepal_width": 3.5,
                "petal_length": 1.4,
                "petal_width": 0.2
            }
        }

app = FastAPI(
    title="Iris Species Prediction API",
    description="""
    This API predicts the species of iris flowers using a Bagging Classifier.

    The model accepts four measurements of an iris flower and returns the predicted species.
    Measurements required:
    - Sepal length (cm)
    - Sepal width (cm)
    - Petal length (cm)
    - Petal width (cm)
    """,
    version="1.0.0"
)

@asynccontextmanager
async def lifespan(app: FastAPI):
    """
    Lifecycle manager for the FastAPI application.
    Verifies model availability on startup.
    """
    # Verify model can be loaded
    with get_model() as model:
        pass
    yield
    # Cleanup code here if needed

app.router.lifespan_context = lifespan

@app.get("/", tags=["Status"])
def read_root():
    """
    Root endpoint to check API status.

    Returns:
        dict: Status message indicating the API is running
    """
    return {"message": "Iris Species Prediction API is running successfully"}

@app.post("/predict", tags=["Prediction"])
def predict(data: IrisInput):
    """
    Predict iris species based on flower measurements.

    Args:
        data (IrisInput): Input measurements of the iris flower

    Returns:
        dict: Predicted species of the iris flower
    """
    with get_model() as model:
        new_data = np.array([[
            data.sepal_length,
            data.sepal_width,
            data.petal_length,
            data.petal_width
        ]])
        prediction = model.predict(new_data)
        return {
            "prediction": iris.target_names[prediction[0]],
            "input_data": data.model_dump()
        }

# Run the application
if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=5018)


Enter fullscreen mode Exit fullscreen mode

Image description

[Note : Default Gini]

Gini Impurity:

  • Measures how "impure" or mixed a dataset is.
  • The goal is to choose a split that results in the lowest Gini value, indicating more homogeneity in the subsets.
  • A lower Gini value means fewer class mixtures, leading to a cleaner split.

Entropy (Information Gain):

  • Measures the uncertainty or disorder in a dataset.
  • The goal is to maximize Information Gain, which is the reduction in entropy after a split.
  • A higher Information Gain indicates a better split that reduces uncertainty in the data.

In Decision Trees:

  • Gini minimizes impurity, and Entropy maximizes the reduction in uncertainty.
  • Both are used to select the best feature to split the data at each node, with the aim of creating the purest possible subsets.

Evaluating Model Prediction via Exposed REST Endpoint

Image description

Taking first 5 rows to test the model

Image description

Input and output

Image description

Conclusion

Bagging, or Bootstrap Aggregating, is a powerful ensemble learning technique that helps improve predictive performance by combining multiple models trained on different data subsets. With its ability to reduce variance, enhance robustness, and handle complex datasets, bagging has become a staple in machine learning. Whether you're working with decision trees, support vector machines, or neural networks, bagging can elevate your model's accuracy and stability, making it a go-to method for many real-world problems. Now you have the knowledge and code to unleash the power of Bagging in your own machine learning projects!

Thanks
Sreeni Ramadorai

Top comments (0)