InputOutputToMessages

class torchtune.data.InputOutputToMessages(train_on_input: Optional[bool] = None, column_map: Optional[Dict[str, str]] = None, new_system_prompt: Optional[str] = None, image_dir: Optional[Path] = None, masking_strategy: Optional[str] = 'train_on_assistant')[source]

Message transform class that converts a single sample with “input” and “output” fields, (or equivalent fields specified in column_map) to user and assistant messages, respectively. This is useful for datasets that have two columns, one containing the user prompt string and the other containing the model response string:

|  input          |  output          |
|-----------------|------------------|
| "user prompt"   | "model response" |

Parameters:

train_on_input (Optional[bool]) – whether the model is trained on the user prompt or not. Deprecated parameter and will be removed in a future release. Default is None.
column_map (Optional[Dict[str, str]]) – a mapping to change the expected “input” and “output” column names to the actual column names in the dataset. Keys should be “input” and “output” and values should be the actual column names. Default is None, keeping the default “input” and “output” column names.
new_system_prompt (Optional[str]) – if specified, prepend a system message. This can serve as instructions to guide the model response. Default is None.
image_dir (Optional[Path]) – path to the directory containing the images that is prepended to all image paths in the dataset. For example, if image_dir="/home/user/dataset/"` and the sample image path was ``"images/1.jpg", the final image path that will be loaded is "/home/user/dataset/images/1.jpg". If None, assume images are available in current working directory or are located on a remote url. For text-only, leave as None. Default is None.
masking_strategy (Optional[str]) –
masking strategy to use for model training. Must be one of: train_on_all, train_on_assistant, train_on_last. Default is “train_on_assistant”.
- train_on_all: both user and assistant messages are unmasked
- train_on_assistant: user messages are masked, only assistant messages are unmasked
- train_on_last: only the last assistant message is unmasked
Note: Multimodal user messages are always masked.

Raises:

ValueError – If column_map is provided and input not in column_map, or output not in column_map, or if image_dir is provided but image not in column_map.

InputOutputToMessages

Docs

Tutorials

Resources