Shortcuts

GarbageCollector

class torchtnt.framework.callbacks.GarbageCollector(step_interval: int)

A callback that performs periodic synchronous garbage collection.

In fully-synchronous distributed training, the same program is run across multiple processes. These processes need to communicate with each other, especially to communicate gradients to update model parameters. The overall program execution is therefore gated by the slowest running process. As a result, it’s important that each process takes roughly the same amount of time to execute its code: otherwise we run into straggler processes. By default, Python’s automatic garbage collection can be triggered at different points in each of these processes, creating the possibility of straggler processes. This callback makes it convenient to configure all processes performing garbage collection at the same time in the loop.

Synchronizing the garbage collection can lead to a performance improvement. The frequency of garbage collection must be tuned based on the application at hand.

By default, this callback does generation 1 collection every step. This can free up some objects to be reaped with minimal overhead compared to the full garbage collection.

Parameters:step_interval – number of steps to run before each collection
on_eval_end(state: State, unit: EvalUnit[TEvalData]) None

Hook called after evaluation ends.

on_eval_start(state: State, unit: EvalUnit[TEvalData]) None

Hook called before evaluation starts.

on_eval_step_end(state: State, unit: EvalUnit[TEvalData]) None

Hook called after an eval step ends.

on_predict_end(state: State, unit: PredictUnit[TPredictData]) None

Hook called after prediction ends.

on_predict_start(state: State, unit: PredictUnit[TPredictData]) None

Hook called before prediction starts.

on_predict_step_end(state: State, unit: PredictUnit[TPredictData]) None

Hook called after a predict step ends.

on_train_end(state: State, unit: TrainUnit[TTrainData]) None

Hook called after training ends.

on_train_start(state: State, unit: TrainUnit[TTrainData]) None

Hook called before training starts.

on_train_step_end(state: State, unit: TrainUnit[TTrainData]) None

Hook called after a train step ends.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources