A new, light-weight
DataLoader2 is introduced to decouple the overloaded data-manipulation functionalities from
DataPipe operations. Besides, certain features can only be achieved with
DataLoader2 like snapshotting and switching backend services to perform high-performant operations.
- class torchdata.dataloader2.DataLoader2(datapipe: Optional[Union[IterDataPipe, MapDataPipe]], datapipe_adapter_fn: Optional[Union[Iterable[Adapter], Adapter]] = None, reading_service: Optional[ReadingServiceInterface] = None)¶
DataLoader2is used to optimize and execute the given
DataPipegraph based on
Adapterfunctions, with support for
Dynamic sharding for multiprocess and distributed data loading
DataPipegraph in-place modification like shuffle control, memory pinning, etc.
Snapshot the state of data-preprocessing pipeline (WIP)
DataPipefrom which to load the data. A deepcopy of this datapipe will be made during initialization, allowing the input to be re-used in a different
DataLoader2without sharing states. Input
Nonecan only be used if
load_state_dictis called right after the creation of the DataLoader.
Adapter, optional) –
Adapterfunction(s) that will be applied to the DataPipe (default:
reading_service (ReadingServiceInterface, optional) – defines how
DataLoader2should execute operations over the
DataPipe, e.g. multiprocessing/distributed (default:
None). A deepcopy of this will be created during initialization, allowing the ReadingService to be re-used in a different
DataLoader2without sharing states.
- __iter__() DataLoader2Iterator[T_co] ¶
Return a singleton iterator from the
DataPipegraph adapted by
DataPipewill be restored if the serialized state is provided to construct
finalize_iteratorwill be invoked at the beginning and end of the iteration correspondingly.
- classmethod from_state(state: Dict[str, Any], reading_service: CheckpointableReadingServiceInterface) DataLoader2[T_co] ¶
ReadingServicerestored from the serialized state.
- load_state_dict(state_dict: Dict[str, Any]) None ¶
For the existing
DataLoader2, load serialized state to restore
DataPipegraph and reset the internal state of
- seed(seed: int) None ¶
Set random seed for DataLoader2 to control determinism.
seed – Random uint64 seed
- shutdown() None ¶
ReadingServiceand clean up iterator.
- state_dict() Dict[str, Any] ¶
Return a dictionary to represent the state of data-processing pipeline with keys:
reading_service_state: The state of
DataLoader2 doesn’t support
torch.utils.data.IterableDataset. Please wrap each of them with the corresponding
ReadingService specifies the execution backend for the data-processing graph. There are three types of
ReadingServices provided in TorchData:
Spawns multiple worker processes to load data from the
ReadingServices would take the
DataPipe graph and rewrite it to achieve a few features like dynamic sharding, sharing random seeds and snapshoting for multi-/distributed processes. For more detail about those features, please refer to the documentation.
Adapter is used to configure, modify and extend the
DataPipe graph in
DataLoader2. It allows in-place
modification or replace the pre-assembled
DataPipe graph provided by PyTorch domains. For example,
Shuffle(False) can be
DataLoader2, which would disable any
shuffle operations in the
- class torchdata.dataloader2.adapter.Adapter¶
Adapter Base Class that follows python Callable protocol.
- abstract __call__(datapipe: Union[IterDataPipe, MapDataPipe]) Union[IterDataPipe, MapDataPipe] ¶
Callable function that either runs in-place modification of the
DataPipegraph, or returns a new
DataPipethat needs to be adapted.
Here are the list of
Adapter provided by TorchData in
Shuffle DataPipes adapter allows control over all existing Shuffler (
CacheTimeout DataPipes adapter allows control over timeouts of all existing EndOnDiskCacheHolder (
And, we will provide more
Adapters to cover data-processing options:
PinMemory: Attach a
DataPipeat the end of the data-processing graph that coverts output data to
torch.Tensorin pinned memory.
FullSync: Attach a
DataPipeto make sure the data-processing graph synchronized between distributed processes to prevent hanging.
ShardingPolicy: Modify sharding policy if
sharding_filteris presented in the
If you have feature requests about the
Adapters you’d like to be provided, please open a GitHub issue. For specific
DataLoader2 also accepts any custom
Adapter as long as it inherits from the