WebDataset¶
- class torchdata.datapipes.iter.WebDataset(source_datapipe: IterDataPipe[List[Union[Dict, List]]])¶
Iterable DataPipe that accepts stream of (path, data) tuples, usually, representing the pathnames and files of a tar archive (functional name:
webdataset
). This aggregates consecutive items with the same basename into a single dictionary, using the extensions as keys (WebDataset file convention). Any text after the first “.” in the filename is used as a key/extension.File names that do not have an extension are ignored.
- Parameters:
source_datapipe – a DataPipe yielding a stream of (path, data) pairs
- Returns:
a DataPipe yielding a stream of dictionaries
Examples
>>> from torchdata.datapipes.iter import FileLister, FileOpener >>> >>> def decode(item): >>> key, value = item >>> if key.endswith(".txt"): >>> return key, value.read().decode("utf-8") >>> if key.endswith(".bin"): >>> return key, value.read().decode("utf-8") >>> >>> datapipe1 = FileLister("test/_fakedata", "wds*.tar") >>> datapipe2 = FileOpener(datapipe1, mode="b") >>> dataset = datapipe2.load_from_tar().map(decode).webdataset() >>> for obj in dataset: >>> print(obj)