TokenizedDatasetLoader¶
- class torchrl.data.TokenizedDatasetLoader(split, max_length, dataset_name, tokenizer_fn: Type[TensorDictTokenizer], pre_tokenization_hook=None, root_dir=None, from_disk=False, valid_size: int = 2000, num_workers: Optional[int] = None, tokenizer_class=None, tokenizer_model_name=None)[source]¶
Loads a tokenizes dataset, and caches a memory-mapped copy of it.
- Parameters:
split (str) – One of
"train"
or"valid"
.max_length (int) – the maximum sequence length.
dataset_name (str) – the name of the dataset.
tokenizer_fn (callable) – the tokeinizing method constructor, such as
torchrl.data.rlhf.TensorDictTokenizer
. When called, it should return atensordict.TensorDict
instance or a dictionary-like structure with the tokenized data.pre_tokenization_hook (callable, optional) – called on the Dataset before tokenization. It should return a modified Dataset object. The intended use is for carrying out tasks that require modifying the dataset as a whole as opposed to modifying individual datapoints, for example discarding certain datapoints based on a particular condition. Tokenization and other “elementwise” operations on the data are performed by the process function which is mapped over the dataset.
root_dir (path, optional) – the path where the datasets are stored. Defaults to
"$HOME/.cache/torchrl/data"
from_disk (bool, optional) – if
True
,datasets.load_from_disk()
will be used. Otherwise,datasets.load_dataset()
will be used. Defaults toFalse
.valid_size (int, optional) – the size of the validation dataset (if split starts with
"valid"
) will be truncated to this value. Defaults to 2000 items.num_workers (int, optional) – number of workers for
datasets.dataset.map()
which is called during tokenization. Defaults tomax(os.cpu_count() // 2, 1)
.tokenizer_class (type, optional) – A tokenizer class, such as
AutoTokenizer
(default).tokenizer_model_name (str, optional) – The model from which the vocabulary should be gathered. Defaults to
"gpt2"
.
The dataset will be stored in
<root_dir>/<split>/<max_length>/
.Examples
>>> from torchrl.data.rlhf import TensorDictTokenizer >>> from torchrl.data.rlhf.reward import pre_tokenization_hook >>> split = "train" >>> max_length = 550 >>> dataset_name = "CarperAI/openai_summarize_comparisons" >>> loader = TokenizedDatasetLoader( ... split, ... max_length, ... dataset_name, ... TensorDictTokenizer, ... pre_tokenization_hook=pre_tokenization_hook, ... ) >>> dataset = loader.load() >>> print(dataset) TensorDict( fields={ attention_mask: MemoryMappedTensor(shape=torch.Size([185068, 550]), device=cpu, dtype=torch.int64, is_shared=False), input_ids: MemoryMappedTensor(shape=torch.Size([185068, 550]), device=cpu, dtype=torch.int64, is_shared=False)}, batch_size=torch.Size([185068]), device=None, is_shared=False)
- static dataset_to_tensordict(dataset: 'datasets.Dataset' | TensorDict, data_dir: Path, prefix: NestedKey = None, features: Sequence[str] = None, batch_dims=1, valid_mask_key=None)[source]¶
Convers a dataset to a memory-mapped TensorDict.
If the dataset is already a
TensorDict
instance, it is simply converted to a memory-mapped TensorDict. Otherwise, the dataset is expected to have afeatures
attribute which is a sequence of strings indicating the features that can be found in the dataset. If it does not, thefeatures
must be passed explicitely to this function.- Parameters:
dataset (datasets.Dataset, TensorDict or equivalent) – a dataset to convert to a memory-mapped TensorDict. If
features
isNone
, it must have afeatures
attribute with the list of keys to write in the tensordict.data_dir (Path or equivalent) – directory where the data should be written.
prefix (NestedKey, optional) – the prefix of the dataset location. This can be used to differentiate several copies of a same dataset that have undergone different preprocessings.
features (sequence of str, optional) – a sequence of str indicating the features that can be found in the dataset.
batch_dims (int, optional) – the number of batch_dimensions of the data (ie number of dimensions along which the tensordict can be indexed). Defaults to 1.
valid_mask_key (NestedKey, optional) – if provided, this entry will be tentatively gathered and used to filder the data. Defaults to
None
(ie, no filter key).
Returns: a TensorDict containing memory-mapped tensors with the dataset.
Examples
>>> from datasets import Dataset >>> import tempfile >>> data = Dataset.from_dict({"tokens": torch.randint(20, (10, 11)), "labels": torch.zeros(10, 11)}) >>> with tempfile.TemporaryDirectory() as tmpdir: ... data_memmap = TokenizedDatasetLoader.dataset_to_tensordict( ... data, data_dir=tmpdir, prefix=("some", "prefix"), features=["tokens", "labels"] ... ) ... print(data_memmap) TensorDict( fields={ some: TensorDict( fields={ prefix: TensorDict( fields={ labels: MemoryMappedTensor(shape=torch.Size([10, 11]), device=cpu, dtype=torch.float32, is_shared=False), tokens: MemoryMappedTensor(shape=torch.Size([10, 11]), device=cpu, dtype=torch.int64, is_shared=False)}, batch_size=torch.Size([10]), device=None, is_shared=False)}, batch_size=torch.Size([]), device=None, is_shared=False)}, batch_size=torch.Size([]), device=None, is_shared=False)