Attention

June 2024 Status Update: Removing DataPipes and DataLoader V2

We are re-focusing the torchdata repo to be an iterative enhancement of torch.utils.data.DataLoader. We do not plan on continuing development or maintaining the [DataPipes] and [DataLoaderV2] solutions, and they will be removed from the torchdata repo. We’ll also be revisiting the DataPipes references in pytorch/pytorch. In release torchdata==0.8.0 (July 2024) they will be marked as deprecated, and in 0.9.0 (Oct 2024) they will be deleted. Existing users are advised to pin to torchdata==0.8.0 or an older version until they are able to migrate away. Subsequent releases will not include DataPipes or DataLoaderV2. Please reach out if you suggestions or comments (please use this issue for feedback)

EndOnDiskCacheHolder¶

class torchdata.datapipes.iter.EndOnDiskCacheHolder(datapipe, mode='wb', filepath_fn=None, *, same_filepath_fn=False, skip_read=False, timeout=300)¶

Indicates when the result of prior DataPipe will be saved local files specified by filepath_fn (functional name: end_caching). Moreover, the result of source DataPipe is required to be a tuple of metadata and data, or a tuple of metadata and file handle.

Parameters:

datapipe – IterDataPipe with at least one OnDiskCacheHolder in the graph.
mode – Mode in which the cached files are opened to write the data on disk. This is needed to be aligned with the type of data or file handle from datapipe. "wb" is used by default.
filepath_fn – Optional function to extract filepath from the metadata from datapipe. By default, it would directly use the ?metadata? as file path.
same_filepath_fn – Set to True to use same filepath_fn from the OnDiskCacheHolder.
skip_read – Boolean value to skip reading the file handle from datapipe. By default, reading is enabled and reading function is created based on the mode.
timeout – Integer value of seconds to wait for uncached item to be written to disk

Example

>>> from torchdata.datapipes.iter import IterableWrapper, HttpReader
>>> url = IterableWrapper(["https://path/to/filename", ])
>>> def _filepath_fn(url):
>>>     temp_dir = tempfile.gettempdir()
>>>     return os.path.join(temp_dir, os.path.basename(url))
>>> hash_dict = {"expected_filepath": expected_MD5_hash}
>>> # You must call ``.on_disk_cache`` at some point before ``.end_caching``
>>> cache_dp = url.on_disk_cache(filepath_fn=_filepath_fn, hash_dict=_hash_dict, hash_type="md5")
>>> # You must call ``.end_caching`` at a later point to stop tracing and save the results to local files.
>>> cache_dp = HttpReader(cache_dp).end_caching(mode="wb", filepath_fn=_filepath_fn)

EndOnDiskCacheHolder¶

Docs

Tutorials

Resources