.. role:: hidden :class: hidden-section Distributed Checkpoint - torch.distributed.checkpoint ===================================================== Distributed Checkpoint (DCP) support loading and saving models from multiple ranks in parallel. It handles load-time resharding which enables saving in one cluster topology and loading into another. DCP is different than `torch.save` and `torch.load` in a few significant ways: * It produces multiple files per checkpoint, with at least one per rank. * It operates in place, meaning that the model should allocate its data first and DCP uses that storage instead. The entrypoints to load and save a checkpoint are the following: .. automodule:: torch.distributed.checkpoint .. currentmodule:: torch.distributed.checkpoint .. autofunction:: load .. autofunction:: save .. autofunction:: load_state_dict .. autofunction:: save_state_dict In addition to the above entrypoints, `Stateful` objects, as described below, provide additional customization during saving/loading .. automodule:: torch.distributed.checkpoint.stateful .. autoclass:: torch.distributed.checkpoint.stateful.Stateful :members: This `example `_ shows how to use Pytorch Distributed Checkpoint to save a FSDP model. The following types define the IO interface used during checkpoint: .. autoclass:: torch.distributed.checkpoint.StorageReader :members: .. autoclass:: torch.distributed.checkpoint.StorageWriter :members: The following types define the planner interface used during checkpoint: .. autoclass:: torch.distributed.checkpoint.LoadPlanner :members: .. autoclass:: torch.distributed.checkpoint.LoadPlan :members: .. autoclass:: torch.distributed.checkpoint.ReadItem :members: .. autoclass:: torch.distributed.checkpoint.SavePlanner :members: .. autoclass:: torch.distributed.checkpoint.SavePlan :members: .. autoclass:: torch.distributed.checkpoint.WriteItem :members: We provide a filesystem based storage layer: .. autoclass:: torch.distributed.checkpoint.FileSystemReader :members: .. autoclass:: torch.distributed.checkpoint.FileSystemWriter :members: We provide default implementations of `LoadPlanner` and `SavePlanner` that can handle all of torch.distributed constructs such as FSDP, DDP, ShardedTensor and DistributedTensor. .. autoclass:: torch.distributed.checkpoint.DefaultSavePlanner :members: .. autoclass:: torch.distributed.checkpoint.DefaultLoadPlanner :members: We provide a set of APIs to help users do get and set state_dict easily. This is an experimental feature and is subject to change. .. autofunction:: torch.distributed.checkpoint.state_dict.get_state_dict .. autofunction:: torch.distributed.checkpoint.state_dict.get_model_state_dict .. autofunction:: torch.distributed.checkpoint.state_dict.get_optimizer_state_dict .. autofunction:: torch.distributed.checkpoint.state_dict.set_state_dict .. autofunction:: torch.distributed.checkpoint.state_dict.set_model_state_dict .. autofunction:: torch.distributed.checkpoint.state_dict.set_optimizer_state_dict .. autoclass:: torch.distributed.checkpoint.state_dict.StateDictOptions :members: