ShardExpander¶
- class torchdata.datapipes.iter.ShardExpander(source_datapipe: IterDataPipe[str])¶
Expands incoming shard strings into shards.
Sharded data files are named using shell-like brace notation. For example, an ImageNet dataset sharded into 1200 shards and stored on a web server might be named imagenet-{000000..001199}.tar.
Note that shard names can be expanded without any server transactions; this makes shard_expand reproducible and storage system independent (unlike :class .FileLister etc.).
- Parameters:
source_datapipe – a DataPipe yielding a stream of pairs
- Returns:
a DataPipe yielding a stream of expanded pathnames.
Example
>>> from torchdata.datapipes.iter import IterableWrapper >>> source_dp = IterableWrapper(["ds-{00..05}.tar"]) >>> expand_dp = source_dp.shard_expand() >>> list(expand_dp) ['ds-00.tar', 'ds-01.tar', 'ds-02.tar', 'ds-03.tar', 'ds-04.tar', 'ds-05.tar'] >>> source_dp = IterableWrapper(["imgs_{00..05}.tar", "labels_{00..05}.tar"]) >>> expand_dp = source_dp.shard_expand() >>> list(expand_dp) ['imgs_00.tar', 'imgs_01.tar', 'imgs_02.tar', 'labels_00.tar', 'labels_01.tar', 'labels_02.tar']