vfdev-5 for pytorch-ignite

Posted on Aug 10, 2021 • Originally published at labs.quansight.org

Introduction to PyTorch-Ignite

#deeplearning #machinelearning #pytorch #pytorchignite

This post is a general introduction of PyTorch-Ignite. It intends to give a brief but illustrative overview of what PyTorch-Ignite can offer for Deep Learning enthusiasts, professionals and researchers. Following the same philosophy as PyTorch, PyTorch-Ignite aims to keep it simple, flexible and extensible but performant and scalable.

Throughout this tutorial, we will introduce the basic concepts of PyTorch-Ignite with the training and evaluation of a MNIST classifier as a beginner application case. We also assume that the reader is familiar with PyTorch.

PyTorch-Ignite: What and Why ?

PyTorch-Ignite is a high-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

PyTorch-Ignite is designed to be at the crossroads of high-level Plug & Play features and under-the-hood expansion possibilities.
PyTorch-Ignite aims to improve the deep learning community's technical skills by promoting best practices.
Things are not hidden behind a divine tool that does everything, but remain within the reach of users.

PyTorch-Ignite takes a "Do-It-Yourself" approach as research is unpredictable and it is important to capture its requirements without blocking things.

🔥 PyTorch + Ignite 🔥

PyTorch-Ignite wraps native PyTorch abstractions such as Modules, Optimizers, and DataLoaders in thin abstractions which allow your models to be separated from their training framework completely. This is achieved by a way of inverting control using an abstraction known as the Engine. The Engine is responsible for running an arbitrary function - typically a training or evaluation function - and emitting events along the way.

A built-in event system represented by the Events class ensures Engine's flexibility, thus facilitating interaction on each step of the run.
With this approach users can completely customize the flow of events during the run.

In summary, PyTorch-Ignite is

Extremely simple engine and event system = Training loop abstraction
Out-of-the-box metrics to easily evaluate models
Built-in handlers to compose training pipelines, save artifacts and log parameters and metrics

Additional benefits of using PyTorch-Ignite are

Less code than pure PyTorch while ensuring maximum control and simplicity
More modular code

PyTorch-Ignite	PyTorch

About the design of PyTorch-Ignite

PyTorch-Ignite allows you to compose your application without being focused on a super multi-purpose object, but rather on weakly coupled components allowing advanced customization.

The design of the library is guided by:

Anticipating new software or use-cases to come in in the future without centralizing everything in a single class.
Avoiding configurations with a ton of parameters that are complicated to manage and maintain.
Providing tools targeted to maximizing cohesion and minimizing coupling.
Keeping it simple.

Quick-start example

In this section we will use PyTorch-Ignite to build and train a classifier of the well-known MNIST dataset. This simple example will introduce the principal concepts behind PyTorch-Ignite.

For additional information and details about the API, please, refer to the project's documentation.

pip install pytorch-ignite

Common PyTorch code

First, we define our model, training and validation datasets, optimizer and loss function:

import os
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import SGD
from torch.utils.data import DataLoader

from torchvision.transforms import Compose, ToTensor, Normalize
from torchvision.datasets import MNIST

# transform to normalize the data
transform = Compose([ToTensor(), Normalize((0.1307,), (0.3081,))])

# Download and load the training data
trainset = MNIST("data", download=True, train=True, transform=transform)
train_loader = DataLoader(trainset, batch_size=128, shuffle=True)

# Download and load the test data
validationset = MNIST("data", train=False, transform=transform)
val_loader = DataLoader(validationset, batch_size=256, shuffle=False)

# Define a class of CNN model (as you want)
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=-1)

device = "cuda"

# Define a model on move it on CUDA device
model = Net().to(device)

# Define a NLL loss
criterion = nn.NLLLoss()

# Define a SGD optimizer
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.8)

The above code is pure PyTorch and is typically user-defined and is required for any pipeline.

Trainer and evaluator's setup

model's trainer is an engine that loops multiple times over the training dataset and updates model parameters. Let's see how we define such a trainer using PyTorch-Ignite. To do this, PyTorch-Ignite introduces the generic class Engine that is an abstraction that loops over the provided data, executes a processing function and returns a result. The only argument needed to construct the trainer is a train_step function.

from ignite.engine import Engine

def train_step(engine, batch):
    x, y = batch
    x = x.to(device)
    y = y.to(device)

    model.train()
    y_pred = model(x)
    loss = criterion(y_pred, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss

# Define a trainer engine
trainer = Engine(train_step)

Please note that train_step function must accept engine and batch arguments. In the example above, engine is not used inside train_step, but we can easily imagine a use-case where we would like to fetch certain information like current iteration, epoch or custom variables from the engine.

Similarly, model evaluation can be done with an engine that runs a single time over the validation dataset and computes metrics.

def validation_step(engine, batch):
    model.eval()
    with torch.no_grad():
        x, y = batch[0], batch[1]
        x = x.to("cuda")
        y = y.to("cuda")

        y_pred = model(x)

        return y_pred, y

evaluator = Engine(validation_step)

This allows the construction of training logic from the simplest to the most complicated scenarios.

The type of output of the process functions (i.e. loss or y_pred, y in the above examples) is not restricted. These functions can return everything the user wants. Output is set to an engine's internal object engine.state.output and can be used further for any type of processing.

Events and Handers

To improve the engine’s flexibility, a configurable event system is introduced to facilitate the interaction on each step of the run. Namely, Engine allows to add handlers on various Events that are triggered during the run. When an event is triggered, attached handlers (named functions, lambdas, class functions) are executed. Here is a schema for when built-in events are triggered by default:

fire_event(Events.STARTED)
while epoch < max_epochs:
    fire_event(Events.EPOCH_STARTED)
    # run once on data
    for batch in data:
        fire_event(Events.ITERATION_STARTED)

        output = process_function(batch)

        fire_event(Events.ITERATION_COMPLETED)
    fire_event(Events.EPOCH_COMPLETED)
fire_event(Events.COMPLETED)

Note that each engine (i.e. trainer and evaluator) has its own event system which allows to define its own engine's process logic.

Using Events and handlers, it is possible to completely customize the engine's runs in a very intuitive way:

from ignite.engine import Events

# Show a message when the training begins
@trainer.on(Events.STARTED)
def start_message():
    print("Start training!")

# Handler can be want you want, here a lambda !
trainer.add_event_handler(
    Events.COMPLETED,
    lambda _: print("Training completed!")
)

# Run evaluator on val_loader every trainer's epoch completed
@trainer.on(Events.EPOCH_COMPLETED)
def run_validation():
    evaluator.run(val_loader)

In the code above, the run_validation function is attached to the trainer and will be triggered at each completed epoch to launch model's validation with evaluator. This shows that engines can be embedded to create complex pipelines.

Handlers offer unparalleled flexibility compared to callbacks as they can be any function: e.g., a lambda, a simple function, a class method, etc. Thus, we do not require to inherit from an interface and override its abstract methods which could unnecessarily bulk up your code and its complexity.

The possibilities of customization are endless as PyTorch-Ignite allows you to get hold of your application workflow. As mentioned before, there is no magic nor fully automatated things in PyTorch-Ignite.

Model evaluation metrics

Metrics are another nice example of what the handlers for PyTorch-Ignite are and how to use them. In our example, we use the built-in metrics Accuracy and Loss.

from ignite.metrics import Accuracy, Loss

# Accuracy and loss metrics are defined
val_metrics = {
  "accuracy": Accuracy(),
  "loss": Loss(criterion)
}

# Attach metrics to the evaluator
for name, metric in val_metrics.items():
    metric.attach(evaluator, name)

PyTorch-Ignite metrics can be elegantly combined with each other.

from ignite.metrics import Precision, Recall

# Build F1 score
precision = Precision(average=False)
recall = Recall(average=False)
F1 = (precision * recall * 2 / (precision + recall)).mean()

# and attach it to evaluator
F1.attach(evaluator, "f1")

To make general things even easier, helper methods are available for the creation of a supervised Engine as above. Thus, let's define another evaluator applied to the training dataset in this way.

from ignite.engine import create_supervised_evaluator

# Define another evaluator with default validation function and attach metrics
train_evaluator = create_supervised_evaluator(model, metrics=val_metrics, device="cuda")

# Run train_evaluator on train_loader every trainer's epoch completed
@trainer.on(Events.EPOCH_COMPLETED)
def run_train_validation():
    train_evaluator.run(train_loader)

The reason why we want to have two separate evaluators (evaluator and train_evaluator) is that they can have different attached handlers and logic to perform. For example, if we would like store the best model defined by the validation metric value, this role is delegated to evaluator which computes metrics over the validation dataset.

Common training handlers

From now on, we have trainer which will call evaluators evaluator and train_evaluator at every completed epoch. Thus, each evaluator will run and compute corresponding metrics. In addition, it would be very helpful to have a display of the results that shows those metrics.

Using the customization potential of the engine's system, we can add simple handlers for this logging purpose:

@evaluator.on(Events.COMPLETED)
def log_validation_results():
    metrics = evaluator.state.metrics
    print("Validation Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f} Avg F1: {:.2f}"
          .format(trainer.state.epoch, metrics["accuracy"], metrics["loss"], metrics["f1"]))

@train_evaluator.on(Events.COMPLETED)
def log_train_results():
    metrics = train_evaluator.state.metrics
    print("  Training Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f}"
          .format(trainer.state.epoch, metrics["accuracy"], metrics["loss"]))

Here we attached log_validation_results and log_train_results handlers on Events.COMPLETED since evaluator and train_evaluator will run a single epoch over the validation datasets.

Let's see how to add some others helpful features to our application.

PyTorch-Ignite provides a ProgressBar handler to show an engine's progression.

from ignite.contrib.handlers import ProgressBar

ProgressBar().attach(trainer, output_transform=lambda x: {'batch loss': x})

ModelCheckpoint handler can be used to periodically save objects which have an attribute state_dict.

from ignite.handlers import ModelCheckpoint, global_step_from_engine

# Score function to select relevant metric, here f1
def score_function(engine):
    return engine.state.metrics["f1"]

# Checkpoint to store n_saved best models wrt score function
model_checkpoint = ModelCheckpoint(
    "quick-start-mnist-output",
    n_saved=2,
    filename_prefix="best",
    score_function=score_function,
    score_name="f1",
    global_step_transform=global_step_from_engine(trainer),
)

# Save the model (if relevant) every epoch completed of evaluator
evaluator.add_event_handler(Events.COMPLETED, model_checkpoint, {"model": model})

PyTorch-Ignite provides wrappers to modern tools to track experiments. For example, TensorBoardLogger handler allows to log metric results, model's and optimizer's parameters, gradients, and more during the training and validation for TensorBoard.

from ignite.contrib.handlers import TensorboardLogger

# Define a Tensorboard logger
tb_logger = TensorboardLogger(log_dir="quick-start-mnist-output")

# Attach handler to plot trainer's loss every 100 iterations
tb_logger.attach_output_handler(
    trainer,
    event_name=Events.ITERATION_COMPLETED(every=100),
    tag="training",
    output_transform=lambda loss: {"batchloss": loss},
)

# Attach handler to dump evaluator's metrics every epoch completed
for tag, evaluator in [("training", train_evaluator), ("validation", evaluator)]:
    tb_logger.attach_output_handler(
        evaluator,
        event_name=Events.EPOCH_COMPLETED,
        tag=tag,
        metric_names="all",
        global_step_transform=global_step_from_engine(trainer),
    )

It is possible to extend the use of the TensorBoard logger very simply by integrating user-defined functions. For example, here is how to display images and predictions during training:

import matplotlib.pyplot as plt

# Store predictions and scores using matplotlib
def predictions_gt_images_handler(engine, logger, *args, **kwargs):
    x, _ = engine.state.batch
    y_pred, y = engine.state.output
    # y_pred is log softmax value
    num_x = num_y = 8
    le = num_x * num_y
    probs, preds = torch.max(torch.exp(y_pred[:le]), dim=1)
    fig = plt.figure(figsize=(20, 20))
    for idx in range(le):
        ax = fig.add_subplot(num_x, num_y, idx + 1, xticks=[], yticks=[])
        ax.imshow(x[idx].squeeze(), cmap="Greys")
        ax.set_title("{0} {1:.1f}% (label: {2})".format(
            preds[idx],
            probs[idx] * 100.0,
            y[idx]),
            color=("green" if preds[idx] == y[idx] else "red")
        )
    logger.writer.add_figure('predictions vs actuals', figure=fig, global_step=trainer.state.epoch)

# Attach custom function to evaluator at first iteration
tb_logger.attach(
    evaluator,
    log_handler=predictions_gt_images_handler,
    event_name=Events.ITERATION_COMPLETED(once=1),
)

All that is left to do now is to run the trainer on data from train_loader for a number of epochs.

trainer.run(train_loader, max_epochs=5)

# Once everything is done, let's close the logger
tb_logger.close()

Start training!

Validation Results - Epoch: 1  Avg accuracy: 0.94 Avg loss: 0.20 Avg F1: 0.94
  Training Results - Epoch: 1  Avg accuracy: 0.94 Avg loss: 0.21

Validation Results - Epoch: 2  Avg accuracy: 0.96 Avg loss: 0.12 Avg F1: 0.96
  Training Results - Epoch: 2  Avg accuracy: 0.96 Avg loss: 0.13

Validation Results - Epoch: 3  Avg accuracy: 0.97 Avg loss: 0.10 Avg F1: 0.97
  Training Results - Epoch: 3  Avg accuracy: 0.97 Avg loss: 0.10

Validation Results - Epoch: 4  Avg accuracy: 0.98 Avg loss: 0.07 Avg F1: 0.98
  Training Results - Epoch: 4  Avg accuracy: 0.97 Avg loss: 0.09

Validation Results - Epoch: 5  Avg accuracy: 0.98 Avg loss: 0.07 Avg F1: 0.98
  Training Results - Epoch: 5  Avg accuracy: 0.98 Avg loss: 0.08
Training completed!

We can inspect results using tensorboard. We can observe two tabs "Scalars" and "Images".

%load_ext tensorboard

%tensorboard --logdir=.

5 takeaways

Almost any training logic can be coded as a train_step method and a trainer built using this method.
The essence of the library is the Engine class that loops a given number of times over a dataset and executes a processing function.
A highly customizable event system simplifies interaction with the engine on each step of the run.
PyTorch-Ignite provides a set of built-in handlers and metrics for common tasks.
PyTorch-Ignite is easy to extend.

Advanced features

In this section we would like to present some advanced features of PyTorch-Ignite for experienced users. We will cover events, handlers and metrics in more detail, as well as distributed computations on GPUs and TPUs. Feel free to skip this section now and come back later if you are a beginner.

Power of Events & Handlers

We have seen throughout the quick-start example that events and handlers are perfect to execute any number of functions whenever you wish. In addition to that we provide several ways to extend it even more by

Built-in events filtering
Stacking events to share the action
Adding custom events to go beyond built-in standard events

Let's look at these features in more detail.

Built-in events filtering

Users can simply filter out events to skip triggering the handler. Let's create a dummy trainer:

from ignite.engine import Engine, Events

trainer = Engine(lambda e, batch: None)

Let's consider a use-case where we would like to train a model and periodically run its validation on several development datasets, e.g. devset1 and devset2:

# We run the validation on devset1 every 5 epochs
@trainer.on(Events.EPOCH_COMPLETED(every=5))
def run_validation1():
    print("Epoch {}: Validation on devset 1".format(trainer.state.epoch))
    # evaluator.run(devset1)  # commented out for demo purposes

# We run another validation on devset2 every 10 epochs
@trainer.on(Events.EPOCH_COMPLETED(every=10))
def run_validation2():
    print("Epoch {}: Validation on devset 2".format(trainer.state.epoch))
    # evaluator.run(devset2)  # commented out for demo purposes

train_data = [0, 1, 2, 3, 4]
trainer.run(train_data, max_epochs=50)

Epoch 5: Validation on devset 1
Epoch 10: Validation on devset 1
Epoch 10: Validation on devset 2
Epoch 15: Validation on devset 1
Epoch 20: Validation on devset 1
Epoch 20: Validation on devset 2
Epoch 25: Validation on devset 1
Epoch 30: Validation on devset 1
Epoch 30: Validation on devset 2
Epoch 35: Validation on devset 1
Epoch 40: Validation on devset 1
Epoch 40: Validation on devset 2
Epoch 45: Validation on devset 1
Epoch 50: Validation on devset 1
Epoch 50: Validation on devset 2

Let's now consider another situation where we would like to make a single change once we reach a certain epoch or iteration. For example, let's change the training dataset on the 5-th epoch from low resolution images to high resolution images:

def train_step(e, batch):
    print("Epoch {} - {} : batch={}".format(e.state.epoch, e.state.iteration, batch))

trainer = Engine(train_step)

small_res_data = [0, 1, 2, ]
high_res_data = [10, 11, 12]

# We run the following handler once on 5-th epoch started
@trainer.on(Events.EPOCH_STARTED(once=5))
def change_train_dataset():
    print("Epoch {}: Change training dataset".format(trainer.state.epoch))
    trainer.set_data(high_res_data)

trainer.run(small_res_data, max_epochs=10)

Epoch 1 - 1 : batch=0
Epoch 1 - 2 : batch=1
Epoch 1 - 3 : batch=2
Epoch 2 - 4 : batch=0
Epoch 2 - 5 : batch=1
Epoch 2 - 6 : batch=2
Epoch 3 - 7 : batch=0
Epoch 3 - 8 : batch=1
Epoch 3 - 9 : batch=2
Epoch 4 - 10 : batch=0
Epoch 4 - 11 : batch=1
Epoch 4 - 12 : batch=2
Epoch 5: Change training dataset
Epoch 5 - 13 : batch=10
Epoch 5 - 14 : batch=11
Epoch 5 - 15 : batch=12
Epoch 6 - 16 : batch=10
Epoch 6 - 17 : batch=11
Epoch 6 - 18 : batch=12
Epoch 7 - 19 : batch=10
Epoch 7 - 20 : batch=11
Epoch 7 - 21 : batch=12
Epoch 8 - 22 : batch=10
Epoch 8 - 23 : batch=11
Epoch 8 - 24 : batch=12
Epoch 9 - 25 : batch=10
Epoch 9 - 26 : batch=11
Epoch 9 - 27 : batch=12
Epoch 10 - 28 : batch=10
Epoch 10 - 29 : batch=11
Epoch 10 - 30 : batch=12

Let's now consider another situation where we would like to trigger a handler with completely custom logic. For example, we would like to dump model gradients if the training loss satisfies a certain condition:

# Let's predefine for simplicity training losses
train_losses = [2.0, 1.9, 1.7, 1.5, 1.6, 1.2, 0.9, 0.8, 1.0, 0.8, 0.7, 0.4, 0.2, 0.1, 0.1, 0.01]

trainer = Engine(lambda e, batch: train_losses[e.state.iteration - 1])

# We define our custom logic when to execute a handler
def custom_event_filter(trainer, event):
    if 0.1 < trainer.state.output < 1.0:
        return True
    return False

# We run the following handler every iteration completed under our custom_event_filter condition:
@trainer.on(Events.ITERATION_COMPLETED(event_filter=custom_event_filter))
def dump_model_grads():
    print("{} - loss={}: dump model grads".format(trainer.state.iteration, trainer.state.output))

train_data = [0, ]
trainer.run(train_data, max_epochs=len(train_losses))

7 - loss=0.9: dump model grads
8 - loss=0.8: dump model grads
10 - loss=0.8: dump model grads
11 - loss=0.7: dump model grads
12 - loss=0.4: dump model grads
13 - loss=0.2: dump model grads

Stack events to share the action

A user can trigger the same handler on events of differen types. For example, let's run a handler for model's validation every 3 epochs and when the training is completed:

trainer = Engine(lambda e, batch: None)

@trainer.on(Events.EPOCH_COMPLETED(every=3) | Events.COMPLETED)
def run_validation():
    print("Epoch {} - event={}: Validation".format(trainer.state.epoch, trainer.last_event_name))
    # evaluator.run(devset)

train_data = [0, 1, 2, 3, 4]
trainer.run(train_data, max_epochs=20)

Epoch 3 - event=epoch_completed: Validation
Epoch 6 - event=epoch_completed: Validation
Epoch 9 - event=epoch_completed: Validation
Epoch 12 - event=epoch_completed: Validation
Epoch 15 - event=epoch_completed: Validation
Epoch 18 - event=epoch_completed: Validation
Epoch 20 - event=completed: Validation

Add custom events

A user can add their own events to go beyond built-in standard events. For example, let's define new events related to backward and optimizer step calls. This can help us to attach specific handlers on these events in a configurable manner.

from ignite.engine import EventEnum


class BackpropEvents(EventEnum):
    BACKWARD_STARTED = 'backward_started'
    BACKWARD_COMPLETED = 'backward_completed'
    OPTIM_STEP_COMPLETED = 'optim_step_completed'


def update(engine, batch):
    # ...
    # loss = criterion(y_pred, y)
    engine.fire_event(BackpropEvents.BACKWARD_STARTED)
    # loss.backward()
    engine.fire_event(BackpropEvents.BACKWARD_COMPLETED)
    # optimizer.step()
    engine.fire_event(BackpropEvents.OPTIM_STEP_COMPLETED)
    # ...

trainer = Engine(update)
trainer.register_events(*BackpropEvents)

def function_before_backprop():
    print("{} - before backprop".format(trainer.state.iteration))

trainer.add_event_handler(BackpropEvents.BACKWARD_STARTED, function_before_backprop)

def function_after_backprop():
    print("{} - after backprop".format(trainer.state.iteration))

trainer.add_event_handler(BackpropEvents.BACKWARD_COMPLETED, function_after_backprop)

train_data = [0, 1, 2, 3, 4]
trainer.run(train_data, max_epochs=2)

1 - before backprop
1 - after backprop
2 - before backprop
2 - after backprop
3 - before backprop
3 - after backprop
4 - before backprop
4 - after backprop
5 - before backprop
5 - after backprop
6 - before backprop
6 - after backprop
7 - before backprop
7 - after backprop
8 - before backprop
8 - after backprop
9 - before backprop
9 - after backprop
10 - before backprop
10 - after backprop

Out-of-the-box metrics

PyTorch-Ignite provides an ensemble of metrics dedicated to many Deep Learning tasks (classification, regression, segmentation, etc.). Most of these metrics provide a way to compute various quantities of interest in an online fashion without having to store the entire output history of a model.

For classification : Precision, Recall, Accuracy, ConfusionMatrix and more!
For segmentation : DiceCoefficient, IoU, mIOU and more!
~20 regression metrics, e.g. MSE, MAE, MedianAbsoluteError, etc
Metrics that store the entire output history per epoch
- Possible to use with scikit-learn metrics, e.g. EpochMetric, AveragePrecision, ROC_AUC, etc
Easily composable to assemble a custom metric
Easily extendable to create custom metrics

Complete lists of metrics provided by PyTorch-Ignite can be found here for ignite.metrics and here for ignite.contrib.metrics.

Two kinds of public APIs are provided:

metric is attached to Engine
metric's reset, update, compute methods

More on the `reset`, `update`, `compute` public API

Let's demonstrate this API on a simple example using the Accuracy metric. The idea behind this API is that we accumulate internally certain counters on each update call. The metric's value is computed on each compute call and counters are reset on each reset call.

import torch
from ignite.metrics import Accuracy

acc = Accuracy()

# Start accumulation
acc.reset()

y_target = torch.tensor([0, 1, 2, 1,])
# y_pred is logits computed by the model
y_pred = torch.tensor([
    [10.0, 0.1, -1.0],  # correct
    [2.0, -1.0, -2.0],  # incorrect
    [1.0, -1.0, 4.0],   # correct
    [0.0, 5.0, -1.0],   # correct
])
acc.update((y_pred, y_target))

# Compute accuracy on 4 samples
print("After 1st update, accuracy=", acc.compute())

y_target = torch.tensor([1, 2, 0, 2])
# y_pred is logits computed by the model
y_pred = torch.tensor([
    [2.0, 1.0, -1.0],   # incorrect
    [0.0, 1.0, -2.0],   # incorrect
    [2.6, 1.0, -4.0],   # correct
    [1.0, -3.0, 2.0],   # correct
])
acc.update((y_pred, y_target))

# Compute accuracy on 8 samples
print("After 2nd update, accuracy=", acc.compute())

After 1st update, accuracy= 0.75
After 2nd update, accuracy= 0.625

Composable metrics

Users can compose their own metrics with ease from existing ones using arithmetic operations or PyTorch methods. For example, an error metric defined as 100 * (1.0 - accuracy) can be coded in a straightforward manner:

import torch
from ignite.metrics import Accuracy

acc = Accuracy()
error = 100.0 * (1.0 - acc)

# Start accumulation
acc.reset()

y_target = torch.tensor([0, 1, 2, 1,])
# y_pred is logits computed by the model
y_pred = torch.tensor([
    [10.0, 0.1, -1.0],  # correct
    [2.0, -1.0, -2.0],  # incorrect
    [1.0, -1.0, 4.0],   # correct
    [0.0, 5.0, -1.0],   # correct
])
acc.update((y_pred, y_target))

# Compute error on 4 samples
print("After 1st update, error=", error.compute())

y_target = torch.tensor([1, 2, 0, 2])
# y_pred is logits computed by the model
y_pred = torch.tensor([
    [2.0, 1.0, -1.0],   # incorrect
    [0.0, 1.0, -2.0],   # incorrect
    [2.6, 1.0, -4.0],   # correct
    [1.0, -3.0, 2.0],   # correct
])
acc.update((y_pred, y_target))

# Compute err on 8 samples
print("After 2nd update, error=", error.compute())

After 1st update, error= 25.0
After 2nd update, error= 37.5

In case a custom metric can not be expressed as arithmetic operations of base metrics, please follow this guide to implement the custom metric.

Out-of-the-box handlers

PyTorch-Ignite provides various commonly used handlers to simplify application code:

Common training handlers: Checkpoint, EarlyStopping, Timer, TerminateOnNan
Optimizer's parameter scheduling (learning rate, momentum, etc.)
- concatenate schedulers, add warm-up, cyclical scheduling, piecewise-linear scheduling, and more! See examples.
Time profiling
Logging to experiment tracking systems:
- Tensorboard, Visdom, MLflow, Polyaxon, Neptune, Trains, etc.

Complete lists of handlers provided by PyTorch-Ignite can be found here for ignite.handlers and here for ignite.contrib.handlers.

Common training handlers

With the out-of-the-box Checkpoint handler, a user can easily save the training state or best models to the filesystem or a cloud.

EarlyStopping and TerminateOnNan helps to stop the training if overfitting or diverging.

All those things can be easily added to the trainer one by one or with helper methods.

Let's consider an example of using helper methods.

import torch
import torch.nn as nn
import torch.optim as optim

from ignite.engine import create_supervised_trainer, create_supervised_evaluator, Events
from ignite.metrics import Accuracy
import ignite.contrib.engines.common as common

train_data = [[torch.rand(2, 4), torch.randint(0, 5, size=(2, ))] for _ in range(10)]
val_data = [[torch.rand(2, 4), torch.randint(0, 5, size=(2, ))] for _ in range(10)]
epoch_length = len(train_data)

model = nn.Linear(4, 5)
optimizer = optim.SGD(model.parameters(), lr=0.01)
# step_size is expressed in iterations
lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=epoch_length, gamma=0.88)

# Let's define some dummy trainer and evaluator
trainer = create_supervised_trainer(model, optimizer, nn.CrossEntropyLoss())
evaluator = create_supervised_evaluator(model, metrics={"accuracy": Accuracy()})


@trainer.on(Events.EPOCH_COMPLETED)
def run_validation():
    evaluator.run(val_data)

# training state to save
to_save = {
    "trainer": trainer, "model": model,
    "optimizer": optimizer, "lr_scheduler": lr_scheduler
}
metric_names = ["batch loss", ]

common.setup_common_training_handlers(
    trainer=trainer,
    to_save=to_save,
    output_path="checkpoints",
    save_every_iters=epoch_length,
    lr_scheduler=lr_scheduler,
    output_names=metric_names,
    with_pbars=True,
)

tb_logger = common.setup_tb_logging("tb_logs", trainer, optimizer, evaluators=evaluator)

common.save_best_model_by_val_score(
    "best_models",
    evaluator=evaluator,
    model=model,
    metric_name="accuracy",
    n_saved=2,
    trainer=trainer,
    tag="val",
)

trainer.run(train_data, max_epochs=5)

tb_logger.close()

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

ls -all "checkpoints"
ls -all "best_models"
ls -all "tb_logs"

total 12
drwxr-xr-x 2 root root 4096 Aug 31 11:27 .
drwxr-xr-x 1 root root 4096 Aug 31 11:27 ..
-rw------- 1 root root 1657 Aug 31 11:27 training_checkpoint_50.pt
total 16
drwxr-xr-x 2 root root 4096 Aug 31 11:27  .
drwxr-xr-x 1 root root 4096 Aug 31 11:27  ..
-rw------- 1 root root 1145 Aug 31 11:27 'best_model_2_val_accuracy=0.3000.pt'
-rw------- 1 root root 1145 Aug 31 11:27 'best_model_3_val_accuracy=0.3000.pt'
total 12
drwxr-xr-x 2 root root 4096 Aug 31 11:27 .
drwxr-xr-x 1 root root 4096 Aug 31 11:27 ..
-rw-r--r-- 1 root root  325 Aug 31 11:27 events.out.tfevents.1598873224.3aa7adc24d3d.115.1

In the above code, the common.setup_common_training_handlers method adds TerminateOnNan, adds a handler to use lr_scheduler (expressed in iterations), adds training state checkpointing, exposes batch loss output as exponential moving averaged metric for logging, and adds a progress bar to the trainer.

Next, the common.setup_tb_logging method returns a TensorBoard logger which is automatically configured to log trainer's metrics (i.e. batch loss), optimizer's learning rate and evaluator's metrics.

Finally, common.save_best_model_by_val_score sets up a handler to save the best two models according to the validation accuracy metric.

Distributed and XLA device support

PyTorch offers a distributed communication package for writing and running parallel applications on multiple devices and machines.
The native interface provides commonly used collective operations and allows to address multi-CPU and multi-GPU computations seamlessly using the torch DistributedDataParallel module and the well-known mpi, gloo and nccl backends.
Recently, users can also run PyTorch on XLA devices, like TPUs, with the torch_xla package.

However, writing distributed training code working on GPUs and TPUs is not a trivial task due to some API specificities.
The purpose of the PyTorch-Ignite ignite.distributed package introduced in version 0.4 is to unify the code for native torch.distributed API, torch_xla API on XLA devices and also supporting other distributed frameworks (e.g. Horovod).
To make distributed configuration setup easier, the Parallel context manager has been introduced:

import ignite.distributed as idist

def training(local_rank, config, **kwargs):
    print(idist.get_rank(), ': run with config:', config, '- backend=', idist.backend())
    # do the training ...

backend = 'gloo' # or "nccl" or "xla-tpu"
dist_configs = {'nproc_per_node': 2}
# dist_configs["start_method"] = "fork"  # If using Jupyter Notebook
config = {'c': 12345}

with idist.Parallel(backend=backend, **dist_configs) as parallel:
    parallel.run(training, config, a=1, b=2)

2020-08-31 11:27:07,128 ignite.distributed.launcher.Parallel INFO: Initialized distributed launcher with backend: 'gloo'
2020-08-31 11:27:07,128 ignite.distributed.launcher.Parallel INFO: - Parameters to spawn processes: 
    nproc_per_node: 2
    nnodes: 1
    node_rank: 0
2020-08-31 11:27:07,128 ignite.distributed.launcher.Parallel INFO: Spawn function '<function training at 0x7f32b8ac9d08>' in 2 processes
0 : run with config: {'c': 12345} - backend= gloo
1 : run with config: {'c': 12345} - backend= gloo
2020-08-31 11:27:09,959 ignite.distributed.launcher.Parallel INFO: End of run

The above code with a single modification can run on a GPU, single-node multiple GPUs, single or multiple TPUs etc. It can be executed with the torch.distributed.launch tool or by Python and spawning the required number of processes. For more details, see the documentation.

In addition, methods like auto_model(), auto_optim() and auto_dataloader() help to adapt in a transparent way the provided model, optimizer and data loaders to an existing configuration:

# main.py

import ignite.distributed as idist

def training(local_rank, config, **kwargs):

    print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend())

    train_loader = idist.auto_dataloader(dataset, batch_size=32, num_workers=12, shuffle=True, **kwargs)
    # batch size, num_workers and sampler are automatically adapted to existing configuration
    # ...
    model = resnet50()
    optimizer = optim.SGD(model.parameters(), lr=0.01)

    # if training with Nvidia/Apex for Automatic Mixed Precision (AMP)
    # model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)

    model = idist.auto_model(model)
    # model is DDP or DP or just itself according to existing configuration
    # ...
    optimizer = idist.auto_optim(optimizer)
    # optimizer is itself, except XLA configuration and overrides `step()` method.
    # User can safely call `optimizer.step()` (behind `xm.optimizer_step(optimizier)` is performed)

backend = "nccl"  # torch native distributed configuration on multiple GPUs
# backend = "xla-tpu"  # XLA TPUs distributed configuration
# backend = None  # no distributed configuration
with idist.Parallel(backend=backend, **dist_configs) as parallel:
    parallel.run(training, config, a=1, b=2)

Please note that these auto_* methods are optional; a user is free use some of them and manually set up certain parts of the code if required. The advantage of this approach is that there is no under the hood inevitable objects' patching and overriding.

More details about distributed helpers provided by PyTorch-Ignite can be found in the documentation.
A complete example of training on CIFAR10 can be found here.

A detailed tutorial with distributed helpers is published here.

Next steps

To learn more about PyTorch-Ignite, please check out our website: https://pytorch-ignite.ai and our tutorials and how-to guides.

We also provide PyTorch-Ignite code-generator application: https://code-generator.pytorch-ignite.ai/ to start working on tasks without rewriting everything from scratch.

PyTorch-Ignite's code is available on the GitHub: https://github.com/pytorch/ignite . The project is a community effort, and everyone is welcome to contribute and join the contributors community no matter your background and skills !

Keep updated with all PyTorch-Ignite news by following us on Twitter and Facebook.

DEV Community

Introduction to PyTorch-Ignite

PyTorch-Ignite: What and Why ?

🔥 PyTorch + Ignite 🔥

About the design of PyTorch-Ignite

Quick-start example

Common PyTorch code

Trainer and evaluator's setup

Events and Handers

Model evaluation metrics

Common training handlers

5 takeaways

Advanced features

Power of Events & Handlers

Built-in events filtering

Stack events to share the action

Add custom events

Out-of-the-box metrics

More on the `reset`, `update`, `compute` public API

Composable metrics

Out-of-the-box handlers

Common training handlers

Distributed and XLA device support

Next steps

Top comments (0)

Read next

Review: The New NVIDIA Jetson Orin Nano

The Limitations of Machine Learning: What We Still Can't Teach Machines

Predicting House Rent with Linear Regression in Python

Enhancing Generative AI with Persistent Memory

PyTorch-Ignite: What and Why ?

🔥 PyTorch + Ignite 🔥

About the design of PyTorch-Ignite

Quick-start example

Common PyTorch code

Trainer and evaluator's setup

Events and Handers

Model evaluation metrics

Common training handlers

5 takeaways

Advanced features

Power of Events & Handlers

Built-in events filtering

Stack events to share the action

Add custom events

Out-of-the-box metrics

More on the reset, update, compute public API

Composable metrics

Out-of-the-box handlers

Common training handlers

Distributed and XLA device support

Next steps

Read next

Review: The New NVIDIA Jetson Orin Nano

The Limitations of Machine Learning: What We Still Can't Teach Machines

Predicting House Rent with Linear Regression in Python

Enhancing Generative AI with Persistent Memory

More on the `reset`, `update`, `compute` public API