S3FileLoader¶
- class torchdata.datapipes.iter.S3FileLoader(source_datapipe: IterDataPipe[str], request_timeout_ms=-1, region='', buffer_size=None, multi_part_download=None)¶
Iterable DataPipe that loads Amazon S3 files from the given S3 URLs (functional name:
load_files_by_s3
).S3FileLoader
iterates all given S3 URLs inBytesIO
format with(url, BytesIO)
tuples.Note
source_datapipe
must contain a list of valid S3 URLs.request_timeout_ms
andregion
will overwrite settings in the configuration file or environment variables.
- Parameters:
source_datapipe – a DataPipe that contains URLs to s3 files
request_timeout_ms – timeout setting for each reqeust (3,000ms by default)
region – region for access files (inferred from credentials by default)
buffer_size – buffer size of each chunk to download large files progressively (128Mb by default)
multi_part_download – flag to split each chunk into small packets and download those packets in parallel (enabled by default)
Example
>>> from torchdata.datapipes.iter import IterableWrapper, S3FileLoader >>> dp_s3_urls = IterableWrapper(['s3://bucket-name/folder/', ...]).list_files_by_s3() # In order to make sure data are shuffled and sharded in the # distributed environment, `shuffle` and `sharding_filter` # are required. For detail, please check our tutorial in: # https://pytorch.org/data/main/tutorial.html#working-with-dataloader >>> sharded_s3_urls = dp_s3_urls.shuffle().sharding_filter() >>> dp_s3_files = S3FileLoader(sharded_s3_urls) >>> for url, fd in dp_s3_files: # Start loading data ... data = fd.read() # Functional API >>> dp_s3_files = sharded_s3_urls.load_files_by_s3(buffer_size=256) >>> for url, fd in dp_s3_files: ... data = fd.read()