Shortcuts

slimorca_dataset

torchtune.datasets.slimorca_dataset(tokenizer: Tokenizer, source: str = 'Open-Orca/SlimOrca-Dedup', max_seq_len: int = 1024, train_on_input: bool = False) ChatDataset[source]

Support for SlimOrca-style family of conversational datasets.

The data is formatted to adhere to Llama2 Chat Format. This format is required if the base model is Llama2 Chat Model. The base Llama2 Model doesn’t prescribe a particular format.

The returned data is a tuple of input token id list and label token id list. If max_seq_len keyword argument is provided, the returned input token id list is ensured (by truncation if necessary) to be within that length.

Parameters:
  • tokenizer (Tokenizer) – Tokenizer used to encode data. Tokenize must implement an encode and decode method.

  • source (str) – path string of dataset, anything supported by Hugging Face’s load_dataset.

  • max_seq_len (int) – Maximum number of tokens in the returned input and label token id lists. This value needs to be at least 4 though it is generally set to max sequence length accepted by the model. Default is 1024.

  • train_on_input (bool) – Whether the model is trained on the prompt or not. Default is False.

Raises:

ValueError – If max_seq_len is less than 4.

Returns:

dataset configured with SlimOrca source data and LLaMA2 chat template

Return type:

ChatDataset

Example

>>> ds = slimorca_dataset(tokenizer=tokenizer, max_seq_len=10)
>>> for input, label in ds:
>>>     print(input)
>>>     print(label)
>>>
>>> Sample Output:
>>> [1, 351, 82, 391, 221, 220, 193, 12, 471, ..., 2]
>>> [-100, -100, -100, -100, -100, -100, -100, -100, 471, ..., 2]

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources