Shortcuts

S3FileLoader

class torchdata.datapipes.iter.S3FileLoader(source_datapipe: IterDataPipe[str], request_timeout_ms=-1, region='', buffer_size=None, multi_part_download=None)

Iterable DataPipe that loads Amazon S3 files from the given S3 URLs (functional name: load_files_by_s3). S3FileLoader iterates all given S3 URLs in BytesIO format with (url, BytesIO) tuples.

Note

  1. source_datapipe must contain a list of valid S3 URLs.

  2. request_timeout_ms and region will overwrite settings in the configuration file or environment variables.

Parameters:
  • source_datapipe – a DataPipe that contains URLs to s3 files

  • request_timeout_ms – timeout setting for each reqeust (3,000ms by default)

  • region – region for access files (inferred from credentials by default)

  • buffer_size – buffer size of each chunk to download large files progressively (128Mb by default)

  • multi_part_download – flag to split each chunk into small packets and download those packets in parallel (enabled by default)

Example

>>> from torchdata.datapipes.iter import IterableWrapper, S3FileLoader
>>> dp_s3_urls = IterableWrapper(['s3://bucket-name/folder/', ...]).list_files_by_s3()
# In order to make sure data are shuffled and sharded in the
# distributed environment, `shuffle`  and `sharding_filter`
# are required. For detail, please check our tutorial in:
# https://pytorch.org/data/main/tutorial.html#working-with-dataloader
>>> sharded_s3_urls = dp_s3_urls.shuffle().sharding_filter()
>>> dp_s3_files = S3FileLoader(sharded_s3_urls)
>>> for url, fd in dp_s3_files: # Start loading data
...     data = fd.read()
# Functional API
>>> dp_s3_files = sharded_s3_urls.load_files_by_s3(buffer_size=256)
>>> for url, fd in dp_s3_files:
...     data = fd.read()

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources