slimorca_dataset¶
- torchtune.datasets.slimorca_dataset(tokenizer: ModelTokenizer, *, source: str = 'Open-Orca/SlimOrca-Dedup', chat_format: Optional[str] = None, max_seq_len: int = 1024, train_on_input: bool = False, packed: bool = False) ChatDataset [source]¶
Support for SlimOrca-style family of conversational datasets.
Use a chat format if the base model requires it, such as Llama2 and Mistral. The Llama3 models do not prescribe a particular format.
The returned data is a tuple of input token id list and label token id list. If
max_seq_len
keyword argument is provided, the returned input token id list is ensured (by truncation if necessary) to be within that length.- Parameters:
tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the
tokenize_messages
method.source (str) – path string of dataset, anything supported by Hugging Face’s load_dataset.
chat_format (Optional[str]) – name of template used to format the chat. See the description in
ChatDataset
for more details. Default: Nonemax_seq_len (int) – Maximum number of tokens in the returned input and label token id lists. This value needs to be at least 4 though it is generally set to max sequence length accepted by the model. Default is 1024.
train_on_input (bool) – Whether the model is trained on the prompt or not. Default is False.
packed (bool) – Whether or not to pack the dataset to
max_seq_len
prior to training. Default is False.
- Raises:
ValueError – If max_seq_len is less than 4.
- Returns:
dataset configured with SlimOrca source data and Llama2 chat template
- Return type:
Example
>>> ds = slimorca_dataset(tokenizer=tokenizer, max_seq_len=10) >>> for input, label in ds: >>> print(input) >>> print(label) >>> >>> Sample Output: >>> [1, 351, 82, 391, 221, 220, 193, 12, 471, ..., 2] >>> [-100, -100, -100, -100, -100, -100, -100, -100, 471, ..., 2]