by Team PyTorch

Today, we are announcing a number of new features and improvements to PyTorch libraries, alongside the PyTorch 1.10 release. Some highlights include:

Some highlights include:

  • TorchX - a new SDK for quickly building and deploying ML applications from research & development to production.
  • TorchAudio - Added text-to-speech pipeline, self-supervised model support, multi-channel support and MVDR beamforming module, RNN transducer (RNNT) loss function, and batch and filterbank support to lfilter function. See the TorchAudio release notes here.
  • TorchVision - Added new RegNet and EfficientNet models, FX based feature extraction added to utilities, two new Automatic Augmentation techniques: Rand Augment and Trivial Augment, and updated training recipes. See the TorchVision release notes here.

Introducing TorchX

TorchX is a new SDK for quickly building and deploying ML applications from research & development to production. It offers various builtin components that encode MLOps best practices and make advanced features like distributed training and hyperparameter optimization accessible to all.

Users can get started with TorchX 0.1 with no added setup cost since it supports popular ML schedulers and pipeline orchestrators that are already widely adopted and deployed in production. No two production environments are the same. To comply with various use cases, TorchX’s core APIs allow tons of customization at well-defined extension points so that even the most unique applications can be serviced without customizing the whole vertical stack.

Read the documentation for more details and try out this feature using this quickstart tutorial.

TorchAudio 0.10

[Beta] Text-to-speech pipeline

TorchAudio now adds the Tacotron2 model and pretrained weights. It is now possible to build a text-to-speech pipeline with existing vocoder implementations like WaveRNN and Griffin-Lim. Building a TTS pipeline requires matching data processing and pretrained weights, which are often non-trivial to users. So TorchAudio introduces a bundle API so that constructing pipelines for specific pretrained weights is easy. The following example illustrates this.

>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>>
>>> # Build text processor, Tacotron2 and vocoder (WaveRNN) model
>>> processor = bundle.get_text_processor()
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> text = "Hello World!"
>>>
>>> # Encode text
>>> input, lengths = processor(text)
>>>
>>> # Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> # Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> # Save audio
>>> torchaudio.save('hello-world.wav', waveforms, vocoder.sample_rate)

For the details of this API please refer to the documentation. You can also try this from the tutorial.

(Beta) Self-Supervised Model Support

TorchAudio added HuBERT model architecture and pre-trained weight support for wav2vec 2.0 and HuBERT. HuBERT and wav2vec 2.0 are novel ways for audio representation learning and they yield high accuracy when fine-tuned on downstream tasks. These models can serve as baseline in future research, therefore, TorchAudio is providing a simple way to run the model. Similar to the TTS pipeline, the pretrained weights and associated information, such as expected sample rates and output class labels (for fine-tuned weights) are put together as a bundle, so that they can be used to build pipelines. The following example illustrates this.

>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
>>>
>>> # Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
>>> # Check the corresponding labels of the output.
>>> labels = bundle.get_labels()
>>> print(labels)
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
>>>
>>> # Infer the label probability distribution
>>> waveform, sample_rate = torchaudio.load(hello-world.wav')
>>>
>>> emissions, _ = model(waveform)
>>>
>>> # Pass emission to (hypothetical) decoder
>>> transcripts = ctc_decode(emissions, labels)
>>> print(transcripts[0])
HELLO WORLD

Please refer to the documentation for more details and try out this feature using this tutorial.

(Beta) Multi-channel support and MVDR beamforming

Far-field speech recognition is a more challenging task compared to near-field recognition. Multi-channel methods such as beamforming help reduce the noises and enhance the target speech.

TorchAudio now adds support for differentiable Minimum Variance Distortionless Response (MVDR) beamforming on multi-channel audio using Time-Frequency masks. Researchers can easily assemble it with any multi-channel ASR pipeline. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio. We also provide a tutorial on how to apply MVDR beamforming to the multi-channel audio in the example directory.

>>> from torchaudio.transforms import MVDR, Spectrogram, InverseSpectrogram
>>>
>>> # Load the multi-channel noisy audio
>>> waveform_mix, sr = torchaudio.load('mix.wav')
>>> # Initialize the stft and istft modules
>>> stft = Spectrogram(n_fft=1024, hop_length=256, return_complex=True, power=None)
>>> istft = InverseSpectrogram(n_fft=1024, hop_length=256)
>>> # Get the noisy spectrogram
>>> specgram_mix = stft(waveform_mix)
>>> # Get the Time-Frequency mask via machine learning models
>>> mask = model(waveform)
>>> # Initialize the MVDR module 
>>> mvdr = MVDR(ref_channel=0, solution=ref_channel, multi_mask=False)
>>> # Apply MVDR beamforming
>>> specgram_enhanced = mvdr(specgram_mix, mask)
>>> # Get the enhanced waveform via iSTFT
>>> waveform_enhanced = istft(specgram_enhanced, length=waveform.shape[-1])

Please refer to the documentation for more details and try out this feature using the MVDR tutorial.

(Beta) RNN Transducer Loss

The RNN transducer (RNNT) loss is part of the RNN transducer pipeline, which is a popular architecture for speech recognition tasks. Recently it has gotten attention for being used in a streaming setting, and has also achieved state-of-the-art WER for the LibriSpeech benchmark.

TorchAudio’s loss function supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance. The implementation is consistent with the original loss function in Sequence Transduction with Recurrent Neural Networks, but relies on code from Alignment Restricted Streaming Recurrent Neural Network Transducer. Special thanks to Jay Mahadeokar and Ching-Feng Yeh for their code contributions and guidance.

Please refer to the documentation for more details.

(Beta) Batch support and filter bank support

torchaudio.functional.lfilter now supports batch processing and multiple filters.

(Prototype) Emformer Module

Automatic speech recognition (ASR) research and productization have increasingly focused on on-device applications. Towards supporting such efforts, TorchAudio now includes Emformer, a memory-efficient transformer architecture that has achieved state-of-the-art results on LibriSpeech in low-latency streaming scenarios, as a prototype feature.

Please refer to the documentation for more details.

GPU Build

GPU builds that support custom CUDA kernels in TorchAudio, like the one being used for RNN transducer loss, have been added. Following this change, TorchAudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.

TorchVision 0.11

(Stable) New Models

RegNet and EfficientNet are two popular architectures that can be scaled to different computational budgets. In this release we include 22 pre-trained weights for their classification variants. The models were trained on ImageNet and the accuracies of the pre-trained models obtained on ImageNet val can be found below (see #4403, #4530 and #4293 for more details).

The models can be used as follows:

import torch
from torchvision import models

x = torch.rand(1, 3, 224, 224)

regnet = models.regnet_y_400mf(pretrained=True)
regnet.eval()
predictions = regnet(x)

efficientnet = models.efficientnet_b0(pretrained=True)
efficientnet.eval()
predictions = efficientnet(x)

See the full list of new models on the torchvision.models documentation page.

We would like to thank Ross Wightman and Luke Melas-Kyriazi for contributing the weights of the EfficientNet variants.

(Beta) FX-based Feature Extraction

A new Feature Extraction method has been added to our utilities. It uses torch.fx and enables us to retrieve the outputs of intermediate layers of a network which is useful for feature extraction and visualization.

Here is an example of how to use the new utility:

import torch
from torchvision.models import resnet50
from torchvision.models.feature_extraction import create_feature_extractor


x = torch.rand(1, 3, 224, 224)

model = resnet50()

return_nodes = {
"layer4.2.relu_2": "layer4"
}
model2 = create_feature_extractor(model, return_nodes=return_nodes)
intermediate_outputs = model2(x)

print(intermediate_outputs['layer4'].shape)

We would like to thank Alexander Soare for developing this utility.

(Stable) New Data Augmentations

Two new Automatic Augmentation techniques were added: RandAugment and Trivial Augment. They apply a series of transformations on the original data to enhance them and to boost the performance of the models. The new techniques build on top of the previously added AutoAugment and focus on simplifying the approach, reducing the search space for the optimal policy and improving the performance gain in terms of accuracy. These techniques enable users to reproduce recipes to achieve state-of-the-art performance on the offered models. Additionally, it enables users to apply these techniques in order to do transfer learning and achieve optimal accuracy on new datasets.

Both methods can be used as drop-in replacement of the AutoAugment technique as seen below:

from torchvision import transforms

t = transforms.RandAugment()
# t = transforms.TrivialAugmentWide()
transformed = t(image)

transform = transforms.Compose([
transforms.Resize(256),
transforms.RandAugment(), # transforms.TrivialAugmentWide()
transforms.ToTensor()])

Read the automatic augmentation transforms for more details.

We would like to thank Samuel G. Müller for contributing to Trivial Augment and for his help on refactoring the AA package.

Updated Training Recipes

We have updated our training reference scripts to add support for Exponential Moving Average, Label Smoothing, Learning-Rate Warmup, Mixup, Cutmix and other SOTA primitives. The above enabled us to improve the classification Acc@1 of some pre-trained models by over 4 points. A major update of the existing pre-trained weights is expected in the next release.

Thanks for reading. If you’re interested in these updates and want to join the PyTorch community, we encourage you to join the discussion forums and open GitHub issues. To get the latest news from PyTorch, follow us on Twitter, Medium, YouTube and LinkedIn.

Cheers! Team PyTorch