TokenizedDatasetLoader

class torchrl.data.TokenizedDatasetLoader(split, max_length, dataset_name, tokenizer_fn: Type[TensorDictTokenizer], pre_tokenization_hook=None, root_dir=None, from_disk=False, valid_size: int = 2000, num_workers: Optional[int] = None, tokenizer_class=None, tokenizer_model_name=None)[source]

Loads a tokenizes dataset, and caches a memory-mapped copy of it.

Parameters:

split (str) – One of "train" or "valid".
max_length (int) – the maximum sequence length.
dataset_name (str) – the name of the dataset.
tokenizer_fn (callable) – the tokeinizing method constructor, such as torchrl.data.rlhf.TensorDictTokenizer. When called, it should return a tensordict.TensorDict instance or a dictionary-like structure with the tokenized data.
pre_tokenization_hook (callable, optional) – called on the Dataset before tokenization. It should return a modified Dataset object. The intended use is for carrying out tasks that require modifying the dataset as a whole as opposed to modifying individual datapoints, for example discarding certain datapoints based on a particular condition. Tokenization and other “elementwise” operations on the data are performed by the process function which is mapped over the dataset.
root_dir (path, optional) – the path where the datasets are stored. Defaults to "$HOME/.cache/torchrl/data"
from_disk (bool, optional) – if True, datasets.load_from_disk() will be used. Otherwise, datasets.load_dataset() will be used. Defaults to False.
valid_size (int, optional) – the size of the validation dataset (if split starts with "valid") will be truncated to this value. Defaults to 2000 items.
num_workers (int, optional) – number of workers for datasets.dataset.map() which is called during tokenization. Defaults to max(os.cpu_count() // 2, 1).
tokenizer_class (Type, optional) – A tokenizer class, such as AutoTokenizer (default).
tokenizer_model_name (str, optional) – The model from which the vocabulary should be gathered. Defaults to "gpt2".

The dataset will be stored in <root_dir>/<split>/<max_length>/.

Examples

>>> from torchrl.data.rlhf import TensorDictTokenizer
>>> from torchrl.data.rlhf.reward import  pre_tokenization_hook
>>> split = "train"
>>> max_length = 550
>>> dataset_name = "CarperAI/openai_summarize_comparisons"
>>> loader = TokenizedDatasetLoader(
...     split,
...     max_length,
...     dataset_name,
...     TensorDictTokenizer,
...     pre_tokenization_hook=pre_tokenization_hook,
... )
>>> dataset = loader.load()
>>> print(dataset)
TensorDict(
    fields={
        attention_mask: MemoryMappedTensor(shape=torch.Size([185068, 550]), device=cpu, dtype=torch.int64, is_shared=False),
        input_ids: MemoryMappedTensor(shape=torch.Size([185068, 550]), device=cpu, dtype=torch.int64, is_shared=False)},
    batch_size=torch.Size([185068]),
    device=None,
    is_shared=False)

static dataset_to_tensordict(dataset: 'datasets.Dataset' | TensorDict, data_dir: Path, prefix: NestedKey = None, features: Sequence[str] = None, batch_dims=1, valid_mask_key=None)[source]

Converts a dataset to a memory-mapped TensorDict.

If the dataset is already a TensorDict instance, it is simply converted to a memory-mapped TensorDict. Otherwise, the dataset is expected to have a features attribute which is a sequence of strings indicating the features that can be found in the dataset. If it does not, the features must be passed explicitly to this function.

Parameters:

dataset (datasets.Dataset, TensorDict or equivalent) – a dataset to convert to a memory-mapped TensorDict. If features is None, it must have a features attribute with the list of keys to write in the tensordict.
data_dir (Path or equivalent) – directory where the data should be written.
prefix (NestedKey, optional) – the prefix of the dataset location. This can be used to differentiate several copies of a same dataset that have undergone different preprocessings.
features (sequence of str, optional) – a sequence of str indicating the features that can be found in the dataset.
batch_dims (int, optional) – the number of batch_dimensions of the data (ie number of dimensions along which the tensordict can be indexed). Defaults to 1.
valid_mask_key (NestedKey, optional) – if provided, this entry will be tentatively gathered and used to filder the data. Defaults to None (ie, no filter key).

Returns: a TensorDict containing memory-mapped tensors with the dataset.

Examples

>>> from datasets import Dataset
>>> import tempfile
>>> data = Dataset.from_dict({"tokens": torch.randint(20, (10, 11)), "labels": torch.zeros(10, 11)})
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     data_memmap = TokenizedDatasetLoader.dataset_to_tensordict(
...         data, data_dir=tmpdir, prefix=("some", "prefix"), features=["tokens", "labels"]
...     )
...     print(data_memmap)
TensorDict(
    fields={
        some: TensorDict(
            fields={
                prefix: TensorDict(
                    fields={
                        labels: MemoryMappedTensor(shape=torch.Size([10, 11]), device=cpu, dtype=torch.float32, is_shared=False),
                        tokens: MemoryMappedTensor(shape=torch.Size([10, 11]), device=cpu, dtype=torch.int64, is_shared=False)},
                    batch_size=torch.Size([10]),
                    device=None,
                    is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)

load()[source]: Loads a pre-processed, memory-mapped dataset if it exists, and creates it otherwise.

TokenizedDatasetLoader

Docs

Tutorials

Resources