Shortcuts

WebDataset

class torchdata.datapipes.iter.WebDataset(source_datapipe: IterDataPipe[List[Union[Dict, List]]])

Iterable DataPipe that accepts stream of (path, data) tuples, usually, representing the pathnames and files of a tar archive (functional name: webdataset). This aggregates consecutive items with the same basename into a single dictionary, using the extensions as keys (WebDataset file convention). Any text after the first “.” in the filename is used as a key/extension.

File names that do not have an extension are ignored.

Parameters:

source_datapipe – a DataPipe yielding a stream of (path, data) pairs

Returns:

a DataPipe yielding a stream of dictionaries

Examples

>>> from torchdata.datapipes.iter import FileLister, FileOpener
>>>
>>> def decode(item):
>>>     key, value = item
>>>     if key.endswith(".txt"):
>>>         return key, value.read().decode("utf-8")
>>>     if key.endswith(".bin"):
>>>         return key, value.read().decode("utf-8")
>>>
>>> datapipe1 = FileLister("test/_fakedata", "wds*.tar")
>>> datapipe2 = FileOpener(datapipe1, mode="b")
>>> dataset = datapipe2.load_from_tar().map(decode).webdataset()
>>> for obj in dataset:
>>>     print(obj)

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources