S3FileLister¶
- class torchdata.datapipes.iter.S3FileLister(source_datapipe: IterDataPipe[str], length: int = -1, request_timeout_ms=-1, region='')¶
Iterable DataPipe that lists Amazon S3 file URLs with the given prefixes (functional name:
list_files_by_s3
). Acceptable prefixes includes3://bucket-name
,s3://bucket-name/
,s3://bucket-name/folder
.Note
source_datapipe
must contain a list of valid S3 URLslength
is -1 by default, and any call to__len__()
is invalid, because the length is unknown until all files are iterated.request_timeout_ms
andregion
will overwrite settings in the configuration file or environment variables.
- Parameters:
source_datapipe – a DataPipe that contains URLs/URL prefixes to s3 files
length – Nominal length of the datapipe
request_timeout_ms – timeout setting for each reqeust (3,000ms by default)
region – region for access files (inferred from credentials by default)
Example
>>> from torchdata.datapipes.iter import IterableWrapper, S3FileLister >>> s3_prefixes = IterableWrapper(['s3://bucket-name/folder/', ...]) >>> dp_s3_urls = S3FileLister(s3_prefixes) >>> for d in dp_s3_urls: ... pass # Functional API >>> dp_s3_urls = s3_prefixes.list_files_by_s3(request_timeout_ms=100) >>> for d in dp_s3_urls: ... pass