A new, light-weight
DataLoader2 is introduced to decouple the overloaded data-manipulation functionalities from
DataPipe operations. Besides, certain features can only be achieved with
DataLoader2 like snapshotting and switching backend services to perform high-performant operations.
- class torchdata.dataloader2.DataLoader2(datapipe: Optional[Union[IterDataPipe, MapDataPipe]], datapipe_adapter_fn: Optional[Union[Iterable[Adapter], Adapter]] = None, reading_service: Optional[ReadingServiceInterface] = None)¶
DataLoader2is used to optimize and execute the given
DataPipegraph based on
Adapterfunctions, with support for
Dynamic sharding for multiprocess and distributed data loading
DataPipegraph in-place modification like shuffle control, memory pinning, etc.
Snapshot the state of data-preprocessing pipeline (WIP)
DataPipefrom which to load the data. A deepcopy of this datapipe will be made during initialization, allowing the input to be re-used in a different
DataLoader2without sharing states. Input
Nonecan only be used if
load_state_dictis called right after the creation of the DataLoader.
Adapter, optional) –
Adapterfunction(s) that will be applied to the DataPipe (default:
reading_service (ReadingServiceInterface, optional) – defines how
DataLoader2should execute operations over the
DataPipe, e.g. multiprocessing/distributed (default:
None). A deepcopy of this will be created during initialization, allowing the ReadingService to be re-used in a different
DataLoader2without sharing states.
MapDataPipeis passed into
DataLoader2, in order to iterate through the data,
DataLoader2will attempt to create an iterator via
iter(datapipe). If the object has a non-zero-indexed indices, this may fail. Consider using
- __iter__() DataLoader2Iterator[T_co] ¶
Return a singleton iterator from the
DataPipegraph adapted by
DataPipewill be restored if the serialized state is provided to construct
finalize_iteratorwill be invoked at the beginning and end of the iteration correspondingly.
- classmethod from_state(state: Dict[str, Any], reading_service: CheckpointableReadingServiceInterface) DataLoader2[T_co] ¶
ReadingServicerestored from the serialized state.
- load_state_dict(state_dict: Dict[str, Any]) None ¶
For the existing
DataLoader2, load serialized state to restore
DataPipegraph and reset the internal state of
- seed(seed: int) None ¶
Set random seed for DataLoader2 to control determinism.
seed – Random uint64 seed
- shutdown() None ¶
ReadingServiceand clean up iterator.
- state_dict() Dict[str, Any] ¶
Return a dictionary to represent the state of data-processing pipeline with keys:
reading_service_state: The state of
DataLoader2 doesn’t support
torch.utils.data.IterableDataset. Please wrap each of them with the corresponding
ReadingService specifies the execution backend for the data-processing graph. There are three types of
ReadingServices provided in TorchData:
Default ReadingService to serve the ``DataPipe` graph in the main process, and apply graph settings like determinism control to the graph.
Spawns multiple worker processes to load data from the
ReadingServices would take the
DataPipe graph and rewrite it to achieve a few features like dynamic sharding, sharing random seeds and snapshoting for multi-/distributed processes. For more detail about those features, please refer to the documentation.
Adapter is used to configure, modify and extend the
DataPipe graph in
DataLoader2. It allows in-place
modification or replace the pre-assembled
DataPipe graph provided by PyTorch domains. For example,
Shuffle(False) can be
DataLoader2, which would disable any
shuffle operations in the
- class torchdata.dataloader2.adapter.Adapter¶
Adapter Base Class that follows python Callable protocol.
- abstract __call__(datapipe: Union[IterDataPipe, MapDataPipe]) Union[IterDataPipe, MapDataPipe] ¶
Callable function that either runs in-place modification of the
DataPipegraph, or returns a new
DataPipethat needs to be adapted.
Here are the list of
Adapter provided by TorchData in
Shuffle DataPipes adapter allows control over all existing Shuffler (
CacheTimeout DataPipes adapter allows control over timeouts of all existing EndOnDiskCacheHolder (
And, we will provide more
Adapters to cover data-processing options:
PinMemory: Attach a
DataPipeat the end of the data-processing graph that coverts output data to
torch.Tensorin pinned memory.
FullSync: Attach a
DataPipeto make sure the data-processing graph synchronized between distributed processes to prevent hanging.
ShardingPolicy: Modify sharding policy if
sharding_filteris presented in the
If you have feature requests about the
Adapters you’d like to be provided, please open a GitHub issue. For specific
DataLoader2 also accepts any custom
Adapter as long as it inherits from the