Shortcuts

FullModelHFCheckpointer

class torchtune.utils.FullModelHFCheckpointer(checkpoint_dir: str, checkpoint_files: List[str], model_type: ModelType, output_dir: str, adapter_checkpoint: Optional[str] = None, recipe_checkpoint: Optional[str] = None, resume_from_checkpoint: bool = False)[source]

Checkpointer which reads and writes checkpoints in HF’s format. Example includes the Llama-2-7b-hf model from the meta-llama repo (https://huggingface.co/meta-llama/Llama-2-7b-hf)

A few notes about the checkpoint reading logic: - HF checkpoint names usually ordered by ID (eg: 0001_of_0003, 0002_of_0003, etc.) To ensure we read the files in the right order, we sort the checkpoint file names before reading - Checkpoint conversion to and from HF’s format requires access to model params which are read directly from the “config.json” file. This helps ensure we either load the weights correctly or error out in case of discrepancy between the HF checkpoint file and TorchTune’s model implementations.

Parameters:
  • checkpoint_dir (str) – Directory containing the checkpoint files

  • checkpoint_files (List[str]) – List of checkpoint files to load. Since the checkpointer takes care of sorting by file ID, the order in this list does not matter

  • model_type (ModelType) – Model type of the model for which the checkpointer is being loaded

  • output_dir (str) – Directory to save the checkpoint files

  • adapter_checkpoint (Optional[str]) – Path to the adapter weights. Default is None

  • recipe_checkpoint (Optional[str]) – Path to the recipe state checkpoint file. Default is None

  • resume_from_checkpoint (bool) – If True, the checkpointer will load the additional checkpoint files to resume training from a previous run. Default is False

Raises:

ValueError – If resume_from_checkpoint is True but recipe_checkpoint is None

load_checkpoint() Dict[str, Any][source]

Load TorchTune checkpoint from file.

The keys and weights from across all checkpoint files are merged into a single state_dict. We preserve the “state_dict key” <-> “checkpoint file mapping” in weight_map so we can write the state dict correctly in save_checkpoint.

Before returning, the model state dict is converted to a TorchTune compatible format using.

Returns:

TorchTune checkpoint state dict

Return type:

state_dict (Dict[str, Any])

Raises:

ValueError – If the values in the input state_dict are not Tensors

save_checkpoint(state_dict: Dict[str, Any], epoch: int, intermediate_checkpoint: bool = False) None[source]

Save TorchTune checkpoint to file. If intermediate_checkpoint is True, an additional checkpoint file recipe_state.pt is created in _output_dir which contains the recipe state.

The state_dict is first converted back to the HF format and then paritioned based on the _weight_map into separate checkpoint files.

Parameters:
  • state_dict (Dict[str, Any]) – Checkpoint state dict to be written out to file

  • epoch (int) – Epoch number. Used to create the checkpoint file name

  • intermediate_checkpoint (bool) – If True, an additional checkpoint files for recipe state and (if applicable) adapter weights are created. Default is False

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources