torchaudio.datasets¶

All datasets are subclasses of torch.utils.data.Dataset i.e, they have __getitem__ and __len__ methods implemented. Hence, they can all be passed to a torch.utils.data.DataLoader which can load multiple samples parallelly using torch.multiprocessing workers. For example:

yesno_data = torchaudio.datasets.YESNO('.', download=True)
data_loader = torch.utils.data.DataLoader(yesno_data,
                                          batch_size=1,
                                          shuffle=True,
                                          num_workers=args.nThreads)

The following datasets are available:

Datasets

CMUARCTIC
COMMONVOICE
GTZAN
LIBRISPEECH
LIBRITTS
LJSPEECH
SPEECHCOMMANDS
TEDLIUM
VCTK
VCTK_092
YESNO

All the datasets have almost similar API. They all have two common arguments: transform and target_transform to transform the input and target respectively.

CMUARCTIC ¶

class torchaudio.datasets.CMUARCTIC(root: str, url: str = 'aew', folder_in_archive: str = 'ARCTIC', download: bool = False)[source]¶

Create a Dataset for CMU_ARCTIC.

Parameters

root (str) – Path to the directory where the dataset is found or downloaded.
url (str, optional) – The URL to download the dataset from or the type of the dataset to dowload. (default: "aew") Allowed type values are "aew", "ahw", "aup", "awb", "axb", "bdl", "clb", "eey", "fem", "gka", "jmk", "ksp", "ljm", "lnh", "rms", "rxr", "slp" or "slt".
folder_in_archive (str, optional) – The top-level directory of the dataset. (default: "ARCTIC")
download (bool, optional) – Whether to download the dataset if it is not found at root path. (default: False).

__getitem__(n: int) → Tuple[torch.Tensor, int, str, str][source]¶

Load the n-th sample from the dataset.

Parameters: n (int) – The index of the sample to be loaded
Returns: (waveform, sample_rate, utterance, utterance_id)
Return type: tuple

COMMONVOICE ¶

class torchaudio.datasets.COMMONVOICE(root: str, tsv: str = 'train.tsv', url: str = 'english', folder_in_archive: str = 'CommonVoice', version: str = 'cv-corpus-4-2019-12-10', download: bool = False)[source]¶

Create a Dataset for CommonVoice.

Parameters

root (str) – Path to the directory where the dataset is found or downloaded.
tsv (str, optional) – The name of the tsv file used to construct the metadata. (default: "train.tsv")
url (str, optional) – The URL to download the dataset from, or the language of the dataset to download. (default: "english"). Allowed language values are "tatar", "english", "german", "french", "welsh", "breton", "chuvash", "turkish", "kyrgyz", "irish", "kabyle", "catalan", "taiwanese", "slovenian", "italian", "dutch", "hakha chin", "esperanto", "estonian", "persian", "portuguese", "basque", "spanish", "chinese", "mongolian", "sakha", "dhivehi", "kinyarwanda", "swedish", "russian", "indonesian", "arabic", "tamil", "interlingua", "latvian", "japanese", "votic", "abkhaz", "cantonese" and "romansh sursilvan".
folder_in_archive (str, optional) – The top-level directory of the dataset.
version (str) – Version string. (default: "cv-corpus-4-2019-12-10") For the other allowed values, Please checkout https://commonvoice.mozilla.org/en/datasets.
download (bool, optional) – Whether to download the dataset if it is not found at root path. (default: False).

__getitem__(n: int) → Tuple[torch.Tensor, int, Dict[str, str]][source]¶

Load the n-th sample from the dataset.

Parameters: n (int) – The index of the sample to be loaded
Returns: (waveform, sample_rate, dictionary), where dictionary is built from the TSV file with the following keys: client_id, path, sentence, up_votes, down_votes, age, gender and accent.
Return type: tuple

GTZAN ¶

class torchaudio.datasets.GTZAN(root: str, url: str = 'http://opihi.cs.uvic.ca/sound/genres.tar.gz', folder_in_archive: str = 'genres', download: bool = False, subset: Optional[str] = None)[source]¶

Create a Dataset for GTZAN.

Note

Please see http://marsyas.info/downloads/datasets.html if you are planning to use this dataset to publish results.

Parameters

root (str) – Path to the directory where the dataset is found or downloaded.
url (str, optional) – The URL to download the dataset from. (default: "http://opihi.cs.uvic.ca/sound/genres.tar.gz")
folder_in_archive (str, optional) – The top-level directory of the dataset.
download (bool, optional) – Whether to download the dataset if it is not found at root path. (default: False).
subset (str, optional) – Which subset of the dataset to use. One of "training", "validation", "testing" or None. If None, the entire dataset is used. (default: None).

__getitem__(n: int) → Tuple[torch.Tensor, int, str][source]¶

Load the n-th sample from the dataset.

Parameters: n (int) – The index of the sample to be loaded
Returns: (waveform, sample_rate, label)
Return type: tuple

LIBRISPEECH ¶

class torchaudio.datasets.LIBRISPEECH(root: str, url: str = 'train-clean-100', folder_in_archive: str = 'LibriSpeech', download: bool = False)[source]¶

Create a Dataset for LibriSpeech.

Parameters

root (str) – Path to the directory where the dataset is found or downloaded.
url (str, optional) – The URL to download the dataset from, or the type of the dataset to dowload. Allowed type values are "dev-clean", "dev-other", "test-clean", "test-other", "train-clean-100", "train-clean-360" and "train-other-500". (default: "train-clean-100")
folder_in_archive (str, optional) – The top-level directory of the dataset. (default: "LibriSpeech")
download (bool, optional) – Whether to download the dataset if it is not found at root path. (default: False).

__getitem__(n: int) → Tuple[torch.Tensor, int, str, int, int, int][source]¶

Load the n-th sample from the dataset.

Parameters: n (int) – The index of the sample to be loaded
Returns: (waveform, sample_rate, utterance, speaker_id, chapter_id, utterance_id)
Return type: tuple

LIBRITTS ¶

class torchaudio.datasets.LIBRITTS(root: str, url: str = 'train-clean-100', folder_in_archive: str = 'LibriTTS', download: bool = False)[source]¶

Create a Dataset for LibriTTS.

Parameters

root (str) – Path to the directory where the dataset is found or downloaded.
url (str, optional) – The URL to download the dataset from, or the type of the dataset to dowload. Allowed type values are "dev-clean", "dev-other", "test-clean", "test-other", "train-clean-100", "train-clean-360" and "train-other-500". (default: "train-clean-100")
folder_in_archive (str, optional) – The top-level directory of the dataset. (default: "LibriTTS")
download (bool, optional) – Whether to download the dataset if it is not found at root path. (default: False).

__getitem__(n: int) → Tuple[torch.Tensor, int, str, str, int, int, str][source]¶

Load the n-th sample from the dataset.

Parameters: n (int) – The index of the sample to be loaded
Returns: (waveform, sample_rate, original_text, normalized_text, speaker_id, chapter_id, utterance_id)
Return type: tuple

LJSPEECH ¶

class torchaudio.datasets.LJSPEECH(root: str, url: str = 'https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2', folder_in_archive: str = 'wavs', download: bool = False)[source]¶

Create a Dataset for LJSpeech-1.1.

Parameters

root (str) – Path to the directory where the dataset is found or downloaded.
url (str, optional) – The URL to download the dataset from. (default: "https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2")
folder_in_archive (str, optional) – The top-level directory of the dataset. (default: "wavs")
download (bool, optional) – Whether to download the dataset if it is not found at root path. (default: False).

__getitem__(n: int) → Tuple[torch.Tensor, int, str, str][source]¶

Load the n-th sample from the dataset.

Parameters: n (int) – The index of the sample to be loaded
Returns: (waveform, sample_rate, transcript, normalized_transcript)
Return type: tuple

SPEECHCOMMANDS ¶

class torchaudio.datasets.SPEECHCOMMANDS(root: str, url: str = 'speech_commands_v0.02', folder_in_archive: str = 'SpeechCommands', download: bool = False)[source]¶

Create a Dataset for Speech Commands.

Parameters

root (str) – Path to the directory where the dataset is found or downloaded.
url (str, optional) – The URL to download the dataset from, or the type of the dataset to dowload. Allowed type values are "speech_commands_v0.01" and "speech_commands_v0.02" (default: "speech_commands_v0.02")
folder_in_archive (str, optional) – The top-level directory of the dataset. (default: "SpeechCommands")
download (bool, optional) – Whether to download the dataset if it is not found at root path. (default: False).

__getitem__(n: int) → Tuple[torch.Tensor, int, str, str, int][source]¶

Load the n-th sample from the dataset.

Parameters: n (int) – The index of the sample to be loaded
Returns: (waveform, sample_rate, label, speaker_id, utterance_number)
Return type: tuple

TEDLIUM ¶

class torchaudio.datasets.TEDLIUM(root: str, release: str = 'release1', subset: str = None, download: bool = False, audio_ext='.sph')[source]¶

Create a Dataset for Tedlium. It supports releases 1,2 and 3.

Parameters

root (str) – Path to the directory where the dataset is found or downloaded.
release (str, optional) – Release version. Allowed values are "release1", "release2" or "release3". (default: "release1").
subset (str, optional) – The subset of dataset to use. Valid options are "train", "dev", and "test" for releases 1&2, None for release3. Defaults to "train" or None.
download (bool, optional) – Whether to download the dataset if it is not found at root path. (default: False).

__getitem__(n: int) → Tuple[torch.Tensor, int, str, int, int, int][source]¶

Load the n-th sample from the dataset.

Parameters: n (int) – The index of the sample to be loaded
Returns: (waveform, sample_rate, transcript, talk_id, speaker_id, identifier)
Return type: tuple

property phoneme_dict¶

Phonemes. Mapping from word to tuple of phonemes. Note that some words have empty phonemes.

Type: dict[str, tuple[str]]

VCTK ¶

class torchaudio.datasets.VCTK(root: str, url: str = 'https://datashare.is.ed.ac.uk/bitstream/handle/10283/3443/VCTK-Corpus-0.92.zip', folder_in_archive: str = 'VCTK-Corpus', download: bool = False, downsample: bool = False, transform: Any = None, target_transform: Any = None)[source]¶

Create a Dataset for VCTK.

Note

This dataset is no longer publicly available. Please use VCTK_092
Directory p315 is ignored because there is no corresponding text files. For more information about the dataset visit: https://datashare.is.ed.ac.uk/handle/10283/3443

Parameters

root (str) – Path to the directory where the dataset is found or downloaded.
url (str, optional) – Not used as the dataset is no longer publicly available.
folder_in_archive (str, optional) – The top-level directory of the dataset. (default: "VCTK-Corpus")
download (bool, optional) – Whether to download the dataset if it is not found at root path. (default: False). Giving download=True will result in error as the dataset is no longer publicly available.
downsample (bool, optional) – Not used.
transform (callable, optional) – Optional transform applied on waveform. (default: None)
target_transform (callable, optional) – Optional transform applied on utterance. (default: None)

__getitem__(n: int) → Tuple[torch.Tensor, int, str, str, str][source]¶

Load the n-th sample from the dataset.

Parameters: n (int) – The index of the sample to be loaded
Returns: (waveform, sample_rate, utterance, speaker_id, utterance_id)
Return type: tuple

VCTK_092 ¶

class torchaudio.datasets.VCTK_092(root: str, mic_id: str = 'mic2', download: bool = False, url: str = 'https://datashare.is.ed.ac.uk/bitstream/handle/10283/3443/VCTK-Corpus-0.92.zip', audio_ext='.flac')[source]¶

Create VCTK 0.92 Dataset

Parameters

root (str) – Root directory where the dataset’s top level directory is found.
mic_id (str) – Microphone ID. Either "mic1" or "mic2". (default: "mic2")
download (bool, optional) – Whether to download the dataset if it is not found at root path. (default: False).
url (str, optional) – The URL to download the dataset from. (default: "https://datashare.is.ed.ac.uk/bitstream/handle/10283/3443/VCTK-Corpus-0.92.zip")
audio_ext (str, optional) – Custom audio extension if dataset is converted to non-default audio format.

Note

All the speeches from speaker p315 will be skipped due to the lack of the corresponding text files.
All the speeches from p280 will be skipped for mic_id="mic2" due to the lack of the audio files.
Some of the speeches from speaker p362 will be skipped due to the lack of the audio files.
See Also: https://datashare.is.ed.ac.uk/handle/10283/3443

__getitem__(n: int) → Tuple[torch.Tensor, int, str, str, str][source]¶

Load the n-th sample from the dataset.

Parameters: n (int) – The index of the sample to be loaded
Returns: (waveform, sample_rate, utterance, speaker_id, utterance_id)
Return type: tuple

YESNO ¶

class torchaudio.datasets.YESNO(root: str, url: str = 'http://www.openslr.org/resources/1/waves_yesno.tar.gz', folder_in_archive: str = 'waves_yesno', download: bool = False, transform: Any = None, target_transform: Any = None)[source]¶

Create a Dataset for YesNo.

Parameters

root (str) – Path to the directory where the dataset is found or downloaded.
url (str, optional) – The URL to download the dataset from. (default: "http://www.openslr.org/resources/1/waves_yesno.tar.gz")
folder_in_archive (str, optional) – The top-level directory of the dataset. (default: "waves_yesno")
download (bool, optional) – Whether to download the dataset if it is not found at root path. (default: False).
transform (callable, optional) – Optional transform applied on waveform. (default: None)
target_transform (callable, optional) – Optional transform applied on utterance. (default: None)

__getitem__(n: int) → Tuple[torch.Tensor, int, List[int]][source]¶

Load the n-th sample from the dataset.

Parameters: n (int) – The index of the sample to be loaded
Returns: (waveform, sample_rate, labels)
Return type: tuple

torchaudio.datasets¶

CMUARCTIC ¶

COMMONVOICE ¶

GTZAN ¶

LIBRISPEECH ¶

LIBRITTS ¶

LJSPEECH ¶

SPEECHCOMMANDS ¶

TEDLIUM ¶

VCTK ¶

VCTK_092 ¶

YESNO ¶

Docs

Tutorials

Resources