Shortcuts

FAQ#

In this section we grouped answers on frequently asked questions and some best practices of using ignite.

Each engine has its own Events#

It is important to understand that engines have their own events. For example, we defined a trainer and an evaluator:

@trainer.on(Events.EPOCH_COMPLETED)
def in_training_loop_on_epoch_completed(engine):
    evaluator.run(val_loader) # this starts another loop on validation dataset to compute metrics

@evaluator.on(Events.COMPLETED)
def when_validation_loop_is_done(engine):
    # do something with computed metrics etc
    # -> early stopping or reduce LR on plateau
    # or just log them

Trainer engine has its own loop and runs multiple times over the training dataset. When a training epoch is over we launch evaluator engine and run a single time of over the validation dataset. Evaluator has its own loop. Therefore, it runs only one epoch and Events.EPOCH_COMPLETED is equivalent to Events.COMPLETED. As a consequence, the following code is correct too:

handler = EarlyStopping(patience=10, score_function=score_function, trainer=trainer)
evaluator.add_event_handler(Events.COMPLETED, handler)

best_model_saver = ModelCheckpoint('/tmp/models', 'best', score_function=score_function)
evaluator.add_event_handler(Events.COMPLETED, best_model_saver, {'mymodel': model})

More details Events and Handlers.

Creating Custom Events based on Forward/Backward Pass#

There are cases where the user might want to add events based on the loss calculation and backward pass. Ignite provides flexibility to the user to allow for this:

class BackpropEvents(EventEnum):
    """
    Events based on back propagation
    """
    BACKWARD_STARTED = 'backward_started'
    BACKWARD_COMPLETED = 'backward_completed'
    OPTIM_STEP_COMPLETED = 'optim_step_completed'

def update(engine, batch):
    model.train()
    opitmizer.zero_grad()
    x, y = process_batch(batch)
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    engine.fire_event(BackpropEvents.BACKWARD_STARTED)
    loss.backward()
    engine.fire_event(BackpropEvents.BACKWARD_COMPLETED)
    optimizer.step()
    engine.fire_event(BackpropEvents.OPTIM_STEP_COMPLETED)

    return loss.item()

trainer = Engine(update)
trainer.register_events(*BackpropEvents)

@trainer.on(BackpropEvents.BACKWARD_STARTED)
def function_before_backprop(engine):
    # insert custom function here

More detailed implementation can be found in TBPTT Trainer.

Gradients accumulation#

A best practice to use if we need to increase effectively the batch size on limited GPU resources. There several ways to do this, the most simple is the following:

accumulation_steps = 4

def update_fn(engine, batch):
    model.train()

    x, y = prepare_batch(batch, device=device, non_blocking=non_blocking)
    y_pred = model(x)
    loss = criterion(y_pred, y) / accumulation_steps
    loss.backward()

    if engine.state.iteration % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

    return loss.item()

trainer = Engine(update_fn)

Based on this blog article and this code.

Working with iterators#

If data provider for training or validation is an iterator (infinite or finite with known or unknown size), here are basic examples of how to setup trainer or evaluator.

Infinite iterator for training#

Let’s use an infinite data iterator as training dataflow

import torch
from ignite.engine import Engine, Events

torch.manual_seed(12)

def infinite_iterator(batch_size):
    while True:
        batch = torch.rand(batch_size, 3, 32, 32)
        yield batch

def train_step(trainer, batch):
    # ...
    s = trainer.state
    print(
        "{}/{} : {} - {:.3f}".format(s.epoch, s.max_epochs, s.iteration, batch.norm())
    )

trainer = Engine(train_step)
# We need to specify epoch_length to define the epoch
trainer.run(infinite_iterator(4), epoch_length=5, max_epochs=3)

In this case we will obtain the following output:

1/3 : 1 - 63.862
1/3 : 2 - 64.042
1/3 : 3 - 63.936
1/3 : 4 - 64.141
1/3 : 5 - 64.767
2/3 : 6 - 63.791
2/3 : 7 - 64.565
2/3 : 8 - 63.602
2/3 : 9 - 63.995
2/3 : 10 - 63.943
3/3 : 11 - 63.831
3/3 : 12 - 64.276
3/3 : 13 - 64.148
3/3 : 14 - 63.920
3/3 : 15 - 64.226

If we do not specify epoch_length, we can stop the training explicitly by calling terminate() In this case, there will be only a single epoch defined.

import torch
from ignite.engine import Engine, Events

torch.manual_seed(12)

def infinite_iterator(batch_size):
    while True:
        batch = torch.rand(batch_size, 3, 32, 32)
        yield batch

def train_step(trainer, batch):
    # ...
    s = trainer.state
    print(
        "{}/{} : {} - {:.3f}".format(s.epoch, s.max_epochs, s.iteration, batch.norm())
    )

trainer = Engine(train_step)

@trainer.on(Events.ITERATION_COMPLETED(once=15))
def stop_training():
    trainer.terminate()

trainer.run(infinite_iterator(4))

We obtain the following output:

1/1 : 1 - 63.862
1/1 : 2 - 64.042
1/1 : 3 - 63.936
1/1 : 4 - 64.141
1/1 : 5 - 64.767
1/1 : 6 - 63.791
1/1 : 7 - 64.565
1/1 : 8 - 63.602
1/1 : 9 - 63.995
1/1 : 10 - 63.943
1/1 : 11 - 63.831
1/1 : 12 - 64.276
1/1 : 13 - 64.148
1/1 : 14 - 63.920
1/1 : 15 - 64.226

Same code can be used for validating models.

Finite iterator with unknown length#

Let’s use a finite data iterator but with unknown length (for user). In case of training, we would like to perform several passes over the dataflow and thus we need to restart the data iterator when it is exhausted. In the code, we do not specify epoch_length which will be automatically determined.

import torch
from ignite.engine import Engine, Events

torch.manual_seed(12)

def finite_unk_size_data_iter():
    for i in range(11):
        yield i

def train_step(trainer, batch):
    # ...
    s = trainer.state
    print(
        "{}/{} : {} - {:.3f}".format(s.epoch, s.max_epochs, s.iteration, batch)
    )

trainer = Engine(train_step)

@trainer.on(Events.DATALOADER_STOP_ITERATION)
def restart_iter():
    trainer.state.dataloader = finite_unk_size_data_iter()

data_iter = finite_unk_size_data_iter()
trainer.run(data_iter, max_epochs=5)

In case of validation, the code is simply

import torch
from ignite.engine import Engine, Events

torch.manual_seed(12)

def finite_unk_size_data_iter():
    for i in range(11):
        yield i

def val_step(evaluator, batch):
    # ...
    s = evaluator.state
    print(
        "{}/{} : {} - {:.3f}".format(s.epoch, s.max_epochs, s.iteration, batch)
    )

evaluator = Engine(val_step)

data_iter = finite_unk_size_data_iter()
evaluator.run(data_iter)

Finite iterator with known length#

Let’s use a finite data iterator with known size for training or validation. If we need to restart the data iterator, we can do this either as in case of unknown size by attaching the restart handler on @trainer.on(Events.DATALOADER_STOP_ITERATION), but here we will do this explicitly on iteration:

import torch
from ignite.engine import Engine, Events

torch.manual_seed(12)

size = 11

def finite_size_data_iter(size):
    for i in range(size):
        yield i

def train_step(trainer, batch):
    # ...
    s = trainer.state
    print(
        "{}/{} : {} - {:.3f}".format(s.epoch, s.max_epochs, s.iteration, batch)
    )

trainer = Engine(train_step)

@trainer.on(Events.ITERATION_COMPLETED(every=size))
def restart_iter():
    trainer.state.dataloader = finite_size_data_iter(size)

data_iter = finite_size_data_iter(size)
trainer.run(data_iter, max_epochs=5)

In case of validation, the code is simply

import torch
from ignite.engine import Engine, Events

torch.manual_seed(12)

size = 11

def finite_size_data_iter(size):
    for i in range(size):
        yield i

def val_step(evaluator, batch):
    # ...
    s = evaluator.state
    print(
        "{}/{} : {} - {:.3f}".format(s.epoch, s.max_epochs, s.iteration, batch)
    )

evaluator = Engine(val_step)

data_iter = finite_size_data_iter(size)
evaluator.run(data_iter)

Switching data provider during the training#

User can easily switch data provider during the training using set_data(). See an example in the documentation of the method.

Time profiling during training#

User can fetch times in several manners depending on complexity of required time profiling:

Single epoch and total time#

Simpliest way to fetch time of single epoch and complete training is to use engine.state.times["EPOCH_COMPLETED"] and engine.state.times["COMPLETED"]:

trainer = ...

@trainer.on(Events.EPOCH_COMPLETED)
def log_epoch_time():
    print("{}: {}".format(trainer.state.epoch, trainer.state.times["EPOCH_COMPLETED"]))

@trainer.on(Events.COMPLETED)
def log_total_time():
    print("Total: {}".format(trainer.state.times["COMPLETED"]))

For details, see State.

Detailed profiling#

User can setup BasicTimeProfiler to fetch times spent in data processing, training step, event handlers:

from ignite.contrib.handlers import BasicTimeProfiler

trainer = ...

# Create an object of the profiler and attach an engine to it
profiler = BasicTimeProfiler()
profiler.attach(trainer)

@trainer.on(Events.EPOCH_COMPLETED(every=10))
def log_intermediate_results():
    profiler.print_results(profiler.get_results())

trainer.run(dataloader, max_epochs=3)

Typical output:

 ----------------------------------------------------
| Time profiling stats (in seconds):                 |
 ----------------------------------------------------
total  |  min/index  |  max/index  |  mean  |  std

Processing function:
157.46292 | 0.01452/1501 | 0.26905/0 | 0.07730 | 0.01258

Dataflow:
6.11384 | 0.00008/1935 | 0.28461/1551 | 0.00300 | 0.02693

Event handlers:
2.82721

- Events.STARTED: []
0.00000

- Events.EPOCH_STARTED: []
0.00006 | 0.00000/0 | 0.00000/17 | 0.00000 | 0.00000

- Events.ITERATION_STARTED: ['PiecewiseLinear']
0.03482 | 0.00001/188 | 0.00018/679 | 0.00002 | 0.00001

- Events.ITERATION_COMPLETED: ['TerminateOnNan']
0.20037 | 0.00006/866 | 0.00089/1943 | 0.00010 | 0.00003

- Events.EPOCH_COMPLETED: ['empty_cuda_cache', 'training.<locals>.log_elapsed_time', ]
2.57860 | 0.11529/0 | 0.14977/13 | 0.12893 | 0.00790

- Events.COMPLETED: []
not yet triggered

For details, see BasicTimeProfiler.

Custom time measures#

Custom time measures can be performed using Timer. See its docstring for details.

Other questions#

Other answers can be found on the github among the issues labeled by question.