PackedDataset¶
- class torchtune.datasets.PackedDataset(ds: Dataset, *, max_seq_len: int, padding_idx: int = 0, max_packs: Optional[int] = None, split_across_pack: bool = False)[source]¶
Performs greedy sample packing on a provided dataset. This is done as a single preprocessing step before training begins. Shuffling is done outside of this class on packed samples with a
Sampler
as part of the dataloader. Currently, this only supports in-memory map-style datasets.The class loads, tokenizes, and packs examples on initialization - no tokenization is done during training.
The general flow on initialization is: load tokenized sample -> add to buffer -> when buffer is long enough, add to
self.packs
.During training, returns self.packs[idx] as input, label, attention mask, and position ids. The attention mask is a lower triangular block mask to prevent samples from cross-attending within a pack. The position ids indicate the position of each token relative to its sample within a pack. These are all padded to max sequence length, so a batch-wise collator is not needed.
A packed sample is made up of individual smaller sequence length samples jammed together within
max_seq_len
. For example, if max_seq_len is 6 and there are varied length samples:tokens = [ [S1, S1, S1, S2, S2, pad], [S3, S3, S4, S4, pad, pad], ..., ]
To prevent cross-contamination, the following mask would be returned for the first pack in the example:
mask = [ [1, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0], [0, 0, 0, 1, 1, 0], [0, 0, 0, 0, 0, 1], ]
The position ids would be:
input_pos = [ [0, 1, 2, 0, 1, 2], [0, 1, 0, 1, 2, 3], ..., ]
The identity matrix is used in the mask for pad tokens instead of a causal mask. For position ids for pad tokens, we simply continue to increment from the previous sample normally.
- Parameters:
ds (Dataset) – dataset to sample pack. This should return a dictionary with field “tokens” and “labels” containing the tokenized and label samples.
max_seq_len (int) – Maximum number of tokens to pack
padding_idx (int) – padding index for the tokenizer. Default is 0.
max_packs (Optional[int]) – Maximum number of packs. Default is None, which will create as many packs as possible.
split_across_pack (bool) – if the last sample in a pack does not fit in
max_seq_len
, split the sample into the next pack, or move it entirely to the beginning of the next pack. For pre-training, typically this is set to True for general text completion. For fine-tuning, typically this is set to False to avoid truncating sentences in instruct tuning. Default is False.