wav2vec 2.0 model (“base” architecture), pre-trained on 10k hours of unlabeled audio from VoxPopuli dataset [Wang et al., 2021] (“10k” subset, consisting of 23 languages), and fine-tuned for ASR on 166 hours of transcribed audio from “es” subset.
Originally published by the authors of VoxPopuli [Wang et al., 2021] under CC BY-NC 4.0 and redistributed with the same license. [License, Source]
Please refer to
torchaudio.pipelines.Wav2Vec2ASRBundlefor the usage.