Shortcuts

Attention

June 2024 Status Update: Removing DataPipes and DataLoader V2

We are re-focusing the torchdata repo to be an iterative enhancement of torch.utils.data.DataLoader. We do not plan on continuing development or maintaining the [DataPipes] and [DataLoaderV2] solutions, and they will be removed from the torchdata repo. We’ll also be revisiting the DataPipes references in pytorch/pytorch. In release torchdata==0.8.0 (July 2024) they will be marked as deprecated, and in 0.9.0 (Oct 2024) they will be deleted. Existing users are advised to pin to torchdata==0.8.0 or an older version until they are able to migrate away. Subsequent releases will not include DataPipes or DataLoaderV2. Please reach out if you suggestions or comments (please use this issue for feedback)

OnDiskCacheHolder

class torchdata.datapipes.iter.OnDiskCacheHolder(source_datapipe: IterDataPipe, filepath_fn: Optional[Callable] = None, hash_dict: Optional[Dict[str, str]] = None, hash_type: str = 'sha256', extra_check_fn: Optional[Callable[[str], bool]] = None)

Caches the outputs of multiple DataPipe operations to local files, which are typically performance bottleneck such download, decompress, and etc (functional name: on_disk_cache).

Must use .end_caching() to stop tracing the sequence of DataPipe operations and save the results to local files.

Parameters:
  • source_datapipe – IterDataPipe

  • filepath_fn – Given data from source_datapipe, returns file path(s) on local file system. Single file path is only allowed as output of the function. If resulted file name is different from the filename generated by the filename function of the end_cache original file name used to store list of yield files (and as cached items availability check)

  • hash_dict – A Dictionary mapping file names to their corresponding hashes. If hash_dict is specified, the extra hash check will be attached before saving data to local file system. If the data doesn’t meet the hash, the pipeline will raise an Error.

  • hash_type – The type of hash function to apply

  • extra_check_fn – Optional function to carry out extra validation on the given file path from filepath_fn.

Example

>>> from torchdata.datapipes.iter import IterableWrapper, HttpReader
>>> url = IterableWrapper(["https://path/to/filename", ])
>>> def _filepath_fn(url):
>>>     temp_dir = tempfile.gettempdir()
>>>     return os.path.join(temp_dir, os.path.basename(url))
>>> hash_dict = {"expected_filepath": expected_MD5_hash}
>>> cache_dp = url.on_disk_cache(filepath_fn=_filepath_fn, hash_dict=_hash_dict, hash_type="md5")
>>> # You must call ``.end_caching`` at a later point to stop tracing and save the results to local files.
>>> cache_dp = HttpReader(cache_dp).end_caching(mode="wb", filepath_fn=_filepath_fn)

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources