ShareGPTToMessages¶
- class torchtune.data.ShareGPTToMessages(train_on_input: bool = False, column_map: Optional[Dict[str, str]] = None, new_system_prompt: Optional[str] = None, image_dir: Optional[Path] = None, image_tag: Optional[str] = '<image>')[source]¶
Convert a single chat sample adhering to the ShareGPT JSON structure to torchtune’s
Message
structure.A single sample typically consists of a single optional system prompt and one or multiple turns of user and assistant messages.
ShareGPT follows:
{ "conversations": [ { "from": <system|human|gpt>, "value": <message>, }, ... ] }
Message
follows:[ { "role": <system|user|assistant>, "content": <message>, }, ... ]
- Parameters:
train_on_input (bool) – whether the prompt should remain unmasked. For multimodal datasets,
train_on_input
is always False and this value is ignored. Default: Falsecolumn_map (Optional[Dict[str, str]]) – a mapping from the expected columns (“conversations”) to the new column names in the dataset. Key should be “conversations” and value should be the new column name. If None, keep the default “conversations”. Default is None.
new_system_prompt (Optional[str]) – if specified, prepend a system message. This can serve as instructions to guide the model response. Setting this will OVERRIDE any system messages already present in the dataset. Default is None.
image_dir (Optional[Path]) – path to the directory containing the images that is prepended to all image paths in the dataset. For example, if
image_dir="/home/user/dataset/"` and the sample image path was ``"images/1.jpg"
, the final image path that will be loaded is"/home/user/dataset/images/1.jpg"
. If None, assume images are available in current working directory or are located on a remote url. For text-only, leave as None. Default is None.image_tag (Optional[str]) – placeholder tags in the text content of each message to be replaced by image special tokens. If images are present and this is None, then will prepend image tokens to the first user message in the sample by default. If text-only, this field is ignored. Default is
"<image>"
.
- Raises:
ValueError – If
column_map
is provided andconversations
not incolumn_map
.