Shortcuts

TokenizedDatasetLoader

class torchrl.data.TokenizedDatasetLoader(split, max_length, dataset_name, tokenizer_fn: Type[TensorDictTokenizer], pre_tokenization_hook=None, root_dir=None, from_disk=False, valid_size: int = 2000, num_workers: Optional[int] = None, tokenizer_class=None, tokenizer_model_name=None)[source]

Loads a tokenizes dataset, and caches a memory-mapped copy of it.

Parameters:
  • split (str) – One of "train" or "valid".

  • max_length (int) – the maximum sequence length.

  • dataset_name (str) – the name of the dataset.

  • tokenizer_fn (callable) – the tokeinizing method constructor, such as torchrl.data.rlhf.TensorDictTokenizer. When called, it should return a tensordict.TensorDict instance or a dictionary-like structure with the tokenized data.

  • pre_tokenization_hook (callable, optional) – called on the Dataset before tokenization. It should return a modified Dataset object. The intended use is for carrying out tasks that require modifying the dataset as a whole as opposed to modifying individual datapoints, for example discarding certain datapoints based on a particular condition. Tokenization and other “elementwise” operations on the data are performed by the process function which is mapped over the dataset.

  • root_dir (path, optional) – the path where the datasets are stored. Defaults to "$HOME/.cache/torchrl/data"

  • from_disk (bool, optional) – if True, datasets.load_from_disk() will be used. Otherwise, datasets.load_dataset() will be used. Defaults to False.

  • valid_size (int, optional) – the size of the validation dataset (if split starts with "valid") will be truncated to this value. Defaults to 2000 items.

  • num_workers (int, optional) – number of workers for datasets.dataset.map() which is called during tokenization. Defaults to max(os.cpu_count() // 2, 1).

  • tokenizer_class (type, optional) – A tokenizer class, such as AutoTokenizer (default).

  • tokenizer_model_name (str, optional) – The model from which the vocabulary should be gathered. Defaults to "gpt2".

The dataset will be stored in <root_dir>/<split>/<max_length>/.

Examples

>>> from torchrl.data.rlhf import TensorDictTokenizer
>>> from torchrl.data.rlhf.reward import  pre_tokenization_hook
>>> split = "train"
>>> max_length = 550
>>> dataset_name = "CarperAI/openai_summarize_comparisons"
>>> loader = TokenizedDatasetLoader(
...     split,
...     max_length,
...     dataset_name,
...     TensorDictTokenizer,
...     pre_tokenization_hook=pre_tokenization_hook,
... )
>>> dataset = loader.load()
>>> print(dataset)
TensorDict(
    fields={
        attention_mask: MemoryMappedTensor(shape=torch.Size([185068, 550]), device=cpu, dtype=torch.int64, is_shared=False),
        input_ids: MemoryMappedTensor(shape=torch.Size([185068, 550]), device=cpu, dtype=torch.int64, is_shared=False)},
    batch_size=torch.Size([185068]),
    device=None,
    is_shared=False)
static dataset_to_tensordict(dataset: 'datasets.Dataset' | TensorDict, data_dir: Path, prefix: NestedKey = None, features: Sequence[str] = None, batch_dims=1, valid_mask_key=None)[source]

Convers a dataset to a memory-mapped TensorDict.

If the dataset is already a TensorDict instance, it is simply converted to a memory-mapped TensorDict. Otherwise, the dataset is expected to have a features attribute which is a sequence of strings indicating the features that can be found in the dataset. If it does not, the features must be passed explicitely to this function.

Parameters:
  • dataset (datasets.Dataset, TensorDict or equivalent) – a dataset to convert to a memory-mapped TensorDict. If features is None, it must have a features attribute with the list of keys to write in the tensordict.

  • data_dir (Path or equivalent) – directory where the data should be written.

  • prefix (NestedKey, optional) – the prefix of the dataset location. This can be used to differentiate several copies of a same dataset that have undergone different preprocessings.

  • features (sequence of str, optional) – a sequence of str indicating the features that can be found in the dataset.

  • batch_dims (int, optional) – the number of batch_dimensions of the data (ie number of dimensions along which the tensordict can be indexed). Defaults to 1.

  • valid_mask_key (NestedKey, optional) – if provided, this entry will be tentatively gathered and used to filder the data. Defaults to None (ie, no filter key).

Returns: a TensorDict containing memory-mapped tensors with the dataset.

Examples

>>> from datasets import Dataset
>>> import tempfile
>>> data = Dataset.from_dict({"tokens": torch.randint(20, (10, 11)), "labels": torch.zeros(10, 11)})
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     data_memmap = TokenizedDatasetLoader.dataset_to_tensordict(
...         data, data_dir=tmpdir, prefix=("some", "prefix"), features=["tokens", "labels"]
...     )
...     print(data_memmap)
TensorDict(
    fields={
        some: TensorDict(
            fields={
                prefix: TensorDict(
                    fields={
                        labels: MemoryMappedTensor(shape=torch.Size([10, 11]), device=cpu, dtype=torch.float32, is_shared=False),
                        tokens: MemoryMappedTensor(shape=torch.Size([10, 11]), device=cpu, dtype=torch.int64, is_shared=False)},
                    batch_size=torch.Size([10]),
                    device=None,
                    is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)
load()[source]

Loads a pre-processed, memory-mapped dataset if it exists, and creates it otherwise.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources