OnDiskCacheHolder¶
- class torchdata.datapipes.iter.OnDiskCacheHolder(source_datapipe: IterDataPipe, filepath_fn: Optional[Callable] = None, hash_dict: Optional[Dict[str, str]] = None, hash_type: str = 'sha256', extra_check_fn: Optional[Callable[[str], bool]] = None)¶
Caches the outputs of multiple DataPipe operations to local files, which are typically performance bottleneck such download, decompress, and etc (functional name:
on_disk_cache
).Must use
.end_caching()
to stop tracing the sequence of DataPipe operations and save the results to local files.- Parameters:
source_datapipe – IterDataPipe
filepath_fn – Given data from
source_datapipe
, returns file path(s) on local file system. Single file path, tuple or list of file paths is accepted as return type. And, generator function that yields file paths is also allowed. As default, data fromsource_datapipe
is directly used to determine the existency of cache.hash_dict – A Dictionary mapping file names to their corresponding hashes. If
hash_dict
is specified, the extra hash check will be attached before saving data to local file system. If the data doesn’t meet the hash, the pipeline will raise an Error.hash_type – The type of hash function to apply
extra_check_fn – Optional function to carry out extra validation on the given file path from
filepath_fn
.
Example
>>> from torchdata.datapipes.iter import IterableWrapper, HttpReader >>> url = IterableWrapper(["https://path/to/filename", ]) >>> def _filepath_fn(url): >>> temp_dir = tempfile.gettempdir() >>> return os.path.join(temp_dir, os.path.basename(url)) >>> hash_dict = {"expected_filepath": expected_MD5_hash} >>> cache_dp = url.on_disk_cache(filepath_fn=_filepath_fn, hash_dict=_hash_dict, hash_type="md5") >>> # You must call ``.end_caching`` at a later point to stop tracing and save the results to local files. >>> cache_dp = HttpReader(cache_dp).end_caching(mode="wb", filepath_fn=_filepath_fn)