GarbageCollector¶

class torchtnt.framework.callbacks.GarbageCollector(step_interval: int)¶

A callback that performs periodic synchronous garbage collection.

In fully-synchronous distributed training, the same program is run across multiple processes. These processes need to communicate with each other, especially to communicate gradients to update model parameters. The overall program execution is therefore gated by the slowest running process. As a result, it’s important that each process takes roughly the same amount of time to execute its code: otherwise we run into straggler processes. By default, Python’s automatic garbage collection can be triggered at different points in each of these processes, creating the possibility of straggler processes. This callback makes it convenient to configure all processes performing garbage collection at the same time in the loop.

Synchronizing the garbage collection can lead to a performance improvement. The frequency of garbage collection must be tuned based on the application at hand.

By default, this callback does generation 1 collection every step. This can free up some objects to be reaped with minimal overhead compared to the full garbage collection.

Parameters:	step_interval – number of steps to run before each collection

on_eval_end(state: State, unit: EvalUnit[TEvalData]) → None¶: Hook called after evaluation ends.

on_eval_start(state: State, unit: EvalUnit[TEvalData]) → None¶: Hook called before evaluation starts.

on_eval_step_end(state: State, unit: EvalUnit[TEvalData]) → None¶: Hook called after an eval step ends.

on_predict_end(state: State, unit: PredictUnit[TPredictData]) → None¶: Hook called after prediction ends.

on_predict_start(state: State, unit: PredictUnit[TPredictData]) → None¶: Hook called before prediction starts.

on_predict_step_end(state: State, unit: PredictUnit[TPredictData]) → None¶: Hook called after a predict step ends.

on_train_end(state: State, unit: TrainUnit[TTrainData]) → None¶: Hook called after training ends.

on_train_start(state: State, unit: TrainUnit[TTrainData]) → None¶: Hook called before training starts.

on_train_step_end(state: State, unit: TrainUnit[TTrainData]) → None¶: Hook called after a train step ends.

GarbageCollector¶

Docs

Tutorials

Resources