Today, we’re announcing the availability of PyTorch 1.6, along with updated domain libraries. We are also excited to announce the team at Microsoft is now maintaining Windows builds and binaries and will also be supporting the community on GitHub as well as the PyTorch Windows discussion forums.
The PyTorch 1.6 release includes a number of new APIs, tools for performance improvement and profiling, as well as major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training. A few of the highlights include:
- Automatic mixed precision (AMP) training is now natively supported and a stable feature (See here for more details) - thanks for NVIDIA’s contributions;
- Native TensorPipe support now added for tensor-aware, point-to-point communication primitives built specifically for machine learning;
- Added support for complex tensors to the frontend API surface;
- New profiling tools providing tensor-level memory consumption information;
- Numerous improvements and new features for both distributed data parallel (DDP) training and the remote procedural call (RPC) packages.
Additionally, from this release onward, features will be classified as Stable, Beta and Prototype. Prototype features are not included as part of the binary distribution and are instead available through either building from source, using nightlies or via compiler flag. You can learn more about what this change means in the post here. You can also find the full release notes here.
Performance & Profiling
[Stable] Automatic Mixed Precision (AMP) Training
AMP allows users to easily enable automatic mixed precision training enabling higher performance and memory savings of up to 50% on Tensor Core GPUs. Using the natively supported
torch.cuda.amp API, AMP provides convenience methods for mixed precision, where some operations use the
torch.float32 (float) datatype and other operations use
torch.float16 (half). Some ops, like linear layers and convolutions, are much faster in
float16. Other ops, like reductions, often require the dynamic range of
float32. Mixed precision tries to match each op to its appropriate datatype.
[Beta] Fork/Join Parallelism
This release adds support for a language-level construct as well as runtime support for coarse-grained parallelism in TorchScript code. This support is useful for situations such as running models in an ensemble in parallel, or running bidirectional components of recurrent nets in parallel, and allows the ability to unlock the computational power of parallel architectures (e.g. many-core CPUs) for task level parallelism.
Parallel execution of TorchScript programs is enabled through two primitives:
torch.jit.wait. In the below example, we parallelize execution of
import torch from typing import List def foo(x): return torch.neg(x) @torch.jit.script def example(x): futures = [torch.jit.fork(foo, x) for _ in range(100)] results = [torch.jit.wait(future) for future in futures] return torch.sum(torch.stack(results)) print(example(torch.ones()))
- Documentation (Link)
[Beta] Memory Profiler
torch.autograd.profiler API now includes a memory profiler that lets you inspect the tensor memory cost of different operators inside your CPU and GPU models.
Here is an example usage of the API:
import torch import torchvision.models as models import torch.autograd.profiler as profiler model = models.resnet18() inputs = torch.randn(5, 3, 224, 224) with profiler.profile(profile_memory=True, record_shapes=True) as prof: model(inputs) # NOTE: some columns were removed for brevity print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10)) # --------------------------- --------------- --------------- --------------- # Name CPU Mem Self CPU Mem Number of Calls # --------------------------- --------------- --------------- --------------- # empty 94.79 Mb 94.79 Mb 123 # resize_ 11.48 Mb 11.48 Mb 2 # addmm 19.53 Kb 19.53 Kb 1 # empty_strided 4 b 4 b 1 # conv2d 47.37 Mb 0 b 20 # --------------------------- --------------- --------------- ---------------
Distributed Training & RPC
[Beta] TensorPipe backend for RPC
PyTorch 1.6 introduces a new backend for the RPC module which leverages the TensorPipe library, a tensor-aware point-to-point communication primitive targeted at machine learning, intended to complement the current primitives for distributed training in PyTorch (Gloo, MPI, …) which are collective and blocking. The pairwise and asynchronous nature of TensorPipe lends itself to new networking paradigms that go beyond data parallel: client-server approaches (e.g., parameter server for embeddings, actor-learner separation in Impala-style RL, …) and model and pipeline parallel training (think GPipe), gossip SGD, etc.
# One-line change needed to opt in torch.distributed.rpc.init_rpc( ... backend=torch.distributed.rpc.BackendType.TENSORPIPE, ) # No changes to the rest of the RPC API torch.distributed.rpc.rpc_sync(...)
PyTorch Distributed supports two powerful paradigms: DDP for full sync data parallel training of models and the RPC framework which allows for distributed model parallelism. Previously, these two features worked independently and users couldn’t mix and match these to try out hybrid parallelism paradigms.
Starting in PyTorch 1.6, we’ve enabled DDP and RPC to work together seamlessly so that users can combine these two techniques to achieve both data parallelism and model parallelism. An example is where users would like to place large embedding tables on parameter servers and use the RPC framework for embedding lookups, but store smaller dense parameters on trainers and use DDP to synchronize the dense parameters. Below is a simple code snippet.
// On each trainer remote_emb = create_emb(on="ps", ...) ddp_model = DDP(dense_model) for data in batch: with torch.distributed.autograd.context(): res = remote_emb(data) loss = ddp_model(res) torch.distributed.autograd.backward([loss])
[Beta] RPC - Asynchronous User Functions
RPC Asynchronous User Functions supports the ability to yield and resume on the server side when executing a user-defined function. Prior to this feature, when a callee processes a request, one RPC thread waits until the user function returns. If the user function contains IO (e.g., nested RPC) or signaling (e.g., waiting for another request to unblock), the corresponding RPC thread would sit idle waiting for these events. As a result, some applications have to use a very large number of threads and send additional RPC requests, which can potentially lead to performance degradation. To make a user function yield on such events, applications need to: 1) Decorate the function with the
@rpc.functions.async_execution decorator; and 2) Let the function return a
torch.futures.Future and install the resume logic as callbacks on the
Future object. See below for an example:
@rpc.functions.async_execution def async_add_chained(to, x, y, z): return rpc.rpc_async(to, torch.add, args=(x, y)).then( lambda fut: fut.wait() + z ) ret = rpc.rpc_sync( "worker1", async_add_chained, args=("worker2", torch.ones(2), 1, 1) ) print(ret) # prints tensor([3., 3.])
- Tutorial for performant batch RPC using Asynchronous User Functions (Link)
- Documentation (Link)
- Usage examples (Link)
Frontend API Updates
[Beta] Complex Numbers
The PyTorch 1.6 release brings beta level support for complex tensors including torch.complex64 and torch.complex128 dtypes. A complex number is a number that can be expressed in the form a + bj, where a and b are real numbers, and j is a solution of the equation x^2 = −1. Complex numbers frequently occur in mathematics and engineering, especially in signal processing and the area of complex neural networks is an active area of research. The beta release of complex tensors will support common PyTorch and complex tensor functionality, plus functions needed by Torchaudio, ESPnet and others. While this is an early version of this feature, and we expect it to improve over time, the overall goal is provide a NumPy compatible user experience that leverages PyTorch’s ability to run on accelerators and work with autograd to better support the scientific community.
PyTorch 1.6 brings increased performance and general stability for mobile on-device inference. We squashed a few bugs, continued maintenance and added few new features while improving fp32 and int8 performance on a large variety of ML model inference on CPU backend.
[Beta] Mobile Features and Performance
- Stateless and stateful XNNPACK Conv and Linear operators
- Stateless MaxPool2d + JIT optimization passes
- JIT pass optimizations: Conv + BatchNorm fusion, graph rewrite to replace conv2d/linear with xnnpack ops, relu/hardtanh fusion, dropout removal
- QNNPACK integration removes requantization scale constraint
- Per-channel quantization for conv, linear and dynamic linear
- Disable tracing for mobile client to save ~600 KB on full-jit builds
Updated Domain Libraries
torchvision 0.7 introduces two new pretrained semantic segmentation models, FCN ResNet50 and DeepLabV3 ResNet50, both trained on COCO and using smaller memory footprints than the ResNet101 backbone. We also introduced support for AMP (Automatic Mixed Precision) autocasting for torchvision models and operators, which automatically selects the floating point precision for different GPU operations to improve performance while maintaining accuracy.
- Release notes (Link)
torchaudio now officially supports Windows. This release also introduces a new model module (with wav2letter included), new functionals (contrast, cvm, dcshift, overdrive, vad, phaser, flanger, biquad), datasets (GTZAN, CMU), and a new optional sox backend with support for TorchScript.
- Release notes (Link)
The Global PyTorch Summer Hackathon is back! This year, teams can compete in three categories virtually:
- PyTorch Developer Tools: Tools or libraries designed to improve productivity and efficiency of PyTorch for researchers and developers
- Web/Mobile Applications powered by PyTorch: Applications with web/mobile interfaces and/or embedded devices powered by PyTorch
- PyTorch Responsible AI Development Tools: Tools, libraries, or web/mobile apps for responsible AI development
This is a great opportunity to connect with the community and practice your machine learning skills.
The 2020 CVPR Low-Power Vision Challenge (LPCV) - Online Track for UAV video submission deadline is coming up shortly. You have until July 31, 2020 to build a system that can discover and recognize characters in video captured by an unmanned aerial vehicle (UAV) accurately using PyTorch and Raspberry Pi 3B+.
To reiterate, Prototype features in PyTorch are early features that we are looking to gather feedback on, gauge the usefulness of and improve ahead of graduating them to Beta or Stable. The following features are not part of the PyTorch 1.6 release and instead are available in nightlies with separate docs/tutorials to help facilitate early usage and feedback.
Allow users to profile training jobs that use
torch.distributed.rpc using the autograd profiler, and remotely invoke the profiler in order to collect profiling information across different nodes. The RFC can be found here and a short recipe on how to use this feature can be found here.
TorchScript Module Freezing
Module Freezing is the process of inlining module parameters and attributes values into the TorchScript internal representation. Parameter and attribute values are treated as final value and they cannot be modified in the frozen module. The PR for this feature can be found here and a short tutorial on how to use this feature can be found here.
Graph Mode Quantization
Eager mode quantization requires users to make changes to their model, including explicitly quantizing activations, module fusion, rewriting use of torch ops with Functional Modules and quantization of functionals are not supported. If we can trace or script the model, then the quantization can be done automatically with graph mode quantization without any of the complexities in eager mode, and it is configurable through a
qconfig_dict. A tutorial on how to use this feature can be found here.
Quantization Numerical Suite
Quantization is good when it works, but it’s difficult to know what’s wrong when it doesn’t satisfy the expected accuracy. A prototype is now available for a Numerical Suite that measures comparison statistics between quantized modules and float modules. This is available to test using eager mode and on CPU only with more support coming. A tutorial on how to use this feature can be found here.