by Team PyTorch

We are introducing the beta release of TorchRec and a number of improvements to the current PyTorch domain libraries, alongside the PyTorch 1.11 release. These updates demonstrate our focus on developing common and extensible APIs across all domains to make it easier for our community to build ecosystem projects on PyTorch. Highlights include:

  • TorchRec, a PyTorch domain library for Recommendation Systems, is available in beta. View it on GitHub.
  • TorchAudio - Added Enformer- and RNN-T-based models and recipes to support the full development lifecycle of a streaming ASR model. See the release notes here.
  • TorchText - Added beta support for RoBERTa and XLM-R models, byte-level BPE tokenizer, and text datasets backed by TorchData. See the release notes here.
  • TorchVision - Added 4 new model families and 14 new classification datasets such as CLEVR, GTSRB, FER2013. See the release notes here.

TorchRec 0.1

We announced TorchRec a few weeks ago and we are excited to release the beta version today. To recap, TorchRec is a PyTorch domain library for Recommendation Systems. This new library provides common sparsity and parallelism primitives, enabling researchers to build state-of-the-art personalization models and deploy them in production. TorchRec was used to train a 1.25 trillion parameter model, pushed to production in January 2022.

In particular, the library includes:

  • Modeling primitives, such as embedding bags and jagged tensors, that enable easy authoring of large, performant multi-device/multi-node models using hybrid data-parallelism and model-parallelism.
  • Optimized RecSys kernels powered by FBGEMM, including support for sparse and quantized operations.
  • A sharder which can partition embedding tables with a variety of different strategies including data-parallel, table-wise, row-wise, table-wise-row-wise, and column-wise sharding.
  • A planner which can automatically generate optimized sharding plans for models.
  • Pipelining to overlap dataloading device transfer (copy to GPU), inter-device communications (input_dist), and computation (forward, backward) for increased performance.
  • GPU inference support.
  • Common modules for RecSys, such as models and public datasets (Criteo & Movielens).

Please check the TorchRec announcement post here, video tutorial, install instructions here, test drive the feature through this tutorial here, and refer to the reference document here.

TorchAudio 0.11

TorchAudio: Building Blocks for Audio and Speech Processing

We published a paper, TorchAudio: Building Blocks for Audio and Speech Processing, describing the overview of the TorchAudio library. If you find TorchAudio useful for your research, please help us share with the community by citing our paper.

(Beta) RNN-T & (Prototype) Emformer Models and Recipes

Emformer is an efficient memory-transformer-based streaming acoustic model that has demonstrated state-of-the-art streaming automatic speech recognition (ASR) performance in low-latency, resource-constrained scenarios, such as on-device applications (citation:

The TorchAudio v0.11 release includes the following beta features:

  • Implementation of Emformer (docs)
  • Recurrent neural network transducer (RNN-T) streaming ASR model that uses Emformer for its transcription network (docs)
  • RNN-T beam search decoder with TorchScript support (docs)
  • LibriSpeech Emformer RNN-T training recipe (GitHub) and corresponding pre-trained streaming ASR inference pipeline (docs)

Also there are prototype features that are available from nightly builds or the main branch.

  • Training recipes trained on MuST-C and TED-LIUM3 datasets. (GitHub)
  • Pre-trained pipelines corresponding to the recipes. (docs)
  • Tutorial that steps through performing online speech recognition with RNN-T Emformer model. (docs)

Collectively, these features cover the full development lifecycle of a streaming ASR model, from definition through training and inference, and enable users to easily develop their own Emformer- and RNN-T-based models.

Special thanks to Yangyang Shi, Jay Mahadeokar, and Gil Keren for their code contributions and guidance.

(Beta) HuBERT Pretrain Model

The masked prediction training of HuBERT model requires the masked logits, unmasked logits, and feature norm as the outputs. The logits are for cross-entropy losses and the feature norm is for penalty loss. The release adds HuBERTPretrainModel and corresponding factory functions (hubert_pretrain_base, hubert_pretrain_large, and hubert_pretrain_xlarge) to enable training from scratch.

(Prototype) CTC Beam Search Decoder

In recent releases, TorchAudio has added support for ASR models fine-tuned on CTC loss. The addition of an inference time CTC beam search decoder enables running end-to-end ASR evaluation using TorchAudio utils.

The CTC decoder in TorchAudio supports customizable beam search decoding with lexicon constraint. It also has optional KenLM language model support.

For more details, please check out the API tutorial. This prototype feature is available through nightly builds.

(Prototype) Streaming API

TorchAudio started as simple audio I/O APIs that supplement PyTorch. With the recent addition of ASR models and training recipes, the project has received requests to support high-level application development.

Streaming API makes it easy to develop and test the model in online inference. It utilizes ffmpeg under the hood, and enables reading media from online services and hardware devices, decoding media in an incremental manner, and applying filters and preprocessing.

Please checkout the API tutorial and the documentation. There are also the streaming ASR tutorial and the device streaming ASR tutorial. This feature is available from nightly releases. Please refer to for how to install nightly builds.

TorchText 0.12

(Beta) RoBERTa and XLM-R Models

TorchText has added support for pre-trained RoBERTa and XLM-R models. It would allow users to train end-2-end Transformer Encoder based models on standard NLP tasks using TorchText.

More specifically:

  • The models are torchscriptable and hence can be employed for production use-cases.
  • The model APIs let users to easily attach custom task-specific heads with pre-trained encoders.
  • The API also comes equipped with data pre-processing transforms to match the pre-trained weights and model configuration.

We have added a tutorial to demonstrate SST-2 binary text classification task with pre-trained XLM-R base architecture.

For additional details on model APIs and usage examples, please refer to the documentation.

(Beta) byte-level BPE tokenizer

TorchText has added support for a Byte-Level BPE tokenizer, as used in GPT-2. This tokenizer is also used for tokenizing inputs to the pre-trained RoBERTa models described previously. In addition to the RoBERTa vocab, users can also load their own custom BPE vocab to use the tokenizer. Furthermore, the tokenizer is fully torchscriptable and hence can be employed for production use-cases. For additional details on model APIs and usage examples, please refer to the documentation.

(Beta) Text datasets backed by TorchData

TorchText has modernized its datasets by migrating from older-style Iterable Datasets to TorchData’s DataPipes. TorchData is a library that provides modular/composable primitives, allowing users to load and transform data in performant data pipelines.

These DataPipes work out-of-the-box with PyTorch DataLoader and would enable new functionalities like auto-sharding. Users can now easily do data manipulation and pre-processing using user-defined functions and transformations in a functional style programming. Datasets backed by DataPipes also enable standard flow-control like batching, collation, shuffling and bucketizing.

Collectively, DataPipes provides a comprehensive experience for data preprocessing and tensorization needs in a pythonic and flexible way for model training. We have added a tutorial to demonstrate data-processing pipelining using the modernized dataset for binary text-classification.

You can learn more about TorchData DataPipe APIs in its official documentation.

TorchVision 0.12

New Models

Four new model families have been released in the latest version along with pre-trained weights for their variants.

#1 Object Detection

FCOS is a popular, fully convolutional, anchor-free model for object detection. In this release we include a community-contributed model implementation as well as pre-trained weights. The model was trained on COCO train2017 and can be used as follows:

import torch
from torchvision import models

x = [torch.rand(3, 224, 224)]
fcos = models.detection.fcos_resnet50_fpn(pretrained=True).eval()
predictions =  fcos(x)

The box AP of the pre-trained model on COCO val2017 is 39.2 (see #4961 for more details).

We would like to thank Hu Ye and Zhiqiang Wang for contributing to the model implementation and initial training. This was the first community-contributed model in a long while, and given its success, we decided to use the learnings from this process and create a new model contribution guidelines.

#2 Optical Flow support and RAFT model

TorchVision now supports optical flow! Optical Flow models try to predict movement in a video: given two consecutive frames, the model predicts where each pixel of the first frame ends up in the second frame. Check out our new tutorial on Optical Flow!

We implemented a torchscript-compatible RAFT model with pre-trained weights (both normal and “small” versions), and added support for training and evaluating optical flow models. Our training scripts support distributed training across processes and nodes, leading to much faster training time than the original implementation. We also added 5 new optical flow datasets: Flying Chairs, Flying Things, Sintel, Kitti, and HD1K.

#3. Image Classification

Vision Transformer (ViT) and ConvNeXt are two popular architectures which can be used as image classifiers or as backbones for downstream vision tasks. In this release we include 8 pre-trained weights for their classification variants. The models were trained on ImageNet and can be used as follows:

import torch
from torchvision import models

x = torch.rand(1, 3, 224, 224)
vit = models.vit_b_16(pretrained=True).eval()
convnext = models.convnext_tiny(pretrained=True).eval()
predictions1 = vit(x)
predictions2 = convnext(x)

The accuracies of the pre-trained models obtained on ImageNet val are seen below:

Model Acc@1 Acc@5
vit_b_16 81.072 95.318
vit_b_32 75.912 92.466
vit_l_16 79.662 94.638
vit_l_32 76.972 93.07
convnext_tiny 82.52 96.146
convnext_small 83.616 96.65
convnext_base 84.062 96.87
convnext_large 84.414 96.976

The above models have been trained using an adjusted version of our new training recipe and this allows us to offer models with accuracies significantly higher than the ones on the original papers.

#4. GPU Video Decoding

In this release, we add support for GPU video decoding in the video reading API. To use hardware-accelerated decoding, we just need to pass a cuda device to the video reading API as shown below:

import torchvision

reader =, device="cuda:0")
for frame in reader:

We also support seeking to anyframe or a keyframe in the video before reading, as shown below:

New Datasets

We have implemented 14 new classification datasets: CLEVR, GTSRB, FER2013, SUN397, Country211, Flowers102, fvgc_aircraft, OxfordIIITPet, DTD, Food 101, Rendered SST2, Stanford cars, PCAM, and EuroSAT.

As part of our work on Optical Flow support (see above for more details), we also added 5 new optical flow datasets: Flying Chairs, Flying Things, Sintel, Kitti, and HD1K.

Other Updates

  • New documentation layout: Each function / class is now documented in a separate page, clearing up some space in the per-module pages, and easing the discovery of the proposed APIs. Compare e.g. our previous docs vs the new ones. Please let us know if you have any feedback!
  • New model contribution guidelines have been published following the success of the FCOS model which was contributed by the community. These guidelines aim to be an overview of the model contribution process for anyone who would like to suggest, implement and train a new model.
  • Upcoming Prototype API - We are currently working on a prototype API which adds Multi-weight support on all of our model builder methods. This will enable us to offer multiple pre-trained weights, associated with their meta-data and inference transforms. The API is still under review and thus was not included in the release but you can read more about it on our blogpost and provide your feedback on the dedicated Github issue.
  • Changes in our deprecation policy - Up until now, torchvision would almost never remove deprecated APIs. In order to be more aligned and consistent with pytorch core, we are updating our deprecation policy. We are now following a 2-release deprecation cycle: deprecated APIs will raise a warning for 2 versions, and will be removed after that. To reflect these changes and to smooth the transition, we have decided to:
    • Remove all APIs that had been deprecated before or on v0.8, released 1.5 years ago.
    • Update the removal timeline of all other deprecated APIs to v0.14, to reflect the new 2-cycle policy starting now in v0.12.

Captum 0.5

Captum is a PyTorch library for model interpretability. For this release, we expanded Captum with influential instances and added support for both similarity based influences and novel algorithms, TracIn and its variants. TracIn variants offer faster approximation of influence scores based on random projections for fully connected layers.

More specifically the new, influence, subsection of Captum includes:

  • SimilarityInfluence computes similarity scores between test and training examples using default (cosine or euclidean) or custom user definite metrics w.r.t. given input model layers.
  • TracInCP approximates the influential score of each training example on a given test example based on the dot-product similarity between loss gradients w.r.t. model parameters for test and training examples. Note that if we use training examples as test examples then we compute self influence. This method and its variants described below also return top-k proponents and opponents which are the top-k largest positive and negative influential examples respectively.
  • TracInCPFast is an approximation of TracInCP that avoids computing the gradients w.r.t. large parameter matrices. It approximates influence score based on the dot products between last fully connected layer activations and loss gradients w.r.t. that layer for training and test examples.
  • TracInCPFastRandProj uses a nearest neighbor approximation library such as annoy to compute the dot product between the training and test quantities. In order to reduce the dimensionality of layer activations and corresponding gradients this method, in addition, allows to project those vectors into a lower dimensional space using random projection matrices.

More about the implementation of influential instances can be found on our GitHub page and tutorials.

Thanks for reading, If you’re interested in these updates and want to join the PyTorch community, we encourage you to join the discussion forums and open GitHub issues. To get the latest news from PyTorch, follow us on Twitter, Medium, YouTube, and LinkedIn.


Team PyTorch