Shortcuts

S3FileLister

class torchdata.datapipes.iter.S3FileLister(source_datapipe: IterDataPipe[str], length: int = -1, request_timeout_ms=-1, region='')

Iterable DataPipe that lists Amazon S3 file URLs with the given prefixes (functional name: list_files_by_s3). Acceptable prefixes include s3://bucket-name, s3://bucket-name/, s3://bucket-name/folder.

Note

  1. source_datapipe must contain a list of valid S3 URLs

  2. length is -1 by default, and any call to __len__() is invalid, because the length is unknown until all files are iterated.

  3. request_timeout_ms and region will overwrite settings in the configuration file or environment variables.

Parameters:
  • source_datapipe – a DataPipe that contains URLs/URL prefixes to s3 files

  • length – Nominal length of the datapipe

  • request_timeout_ms – timeout setting for each reqeust (3,000ms by default)

  • region – region for access files (inferred from credentials by default)

Example

>>> from torchdata.datapipes.iter import IterableWrapper, S3FileLister
>>> s3_prefixes = IterableWrapper(['s3://bucket-name/folder/', ...])
>>> dp_s3_urls = S3FileLister(s3_prefixes)
>>> for d in dp_s3_urls:
...     pass
# Functional API
>>> dp_s3_urls = s3_prefixes.list_files_by_s3(request_timeout_ms=100)
>>> for d in dp_s3_urls:
...     pass

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources