Benchmark Utils - torch.utils.benchmark¶
Timer(stmt='pass', setup='pass', timer=<built-in function perf_counter>, globals=None, label=None, sub_label=None, description=None, env=None, num_threads=1, language=<Language.PYTHON: 0>)¶
Helper class for measuring execution time of PyTorch statements.
For a full tutorial on how to use this class, see: https://pytorch.org/tutorials/recipes/recipes/benchmark.html
The PyTorch Timer is based on timeit.Timer (and in fact uses timeit.Timer internally), but with several key differences:
- Runtime aware:
Timer will perform warmups (important as some elements of PyTorch are lazily initialized), set threadpool size so that comparisons are apples-to-apples, and synchronize asynchronous CUDA functions when necessary.
- Focus on replicates:
When measuring code, and particularly complex kernels / models, run-to-run variation is a significant confounding factor. It is expected that all measurements should include replicates to quantify noise and allow median computation, which is more robust than mean. To that effect, this class deviates from the timeit API by conceptually merging timeit.Timer.repeat and timeit.Timer.autorange. (Exact algorithms are discussed in method docstrings.) The timeit method is replicated for cases where an adaptive strategy is not desired.
- Optional metadata:
When defining a Timer, one can optionally specify label, sub_label, description, and env. (Defined later) These fields are included in the representation of result object and by the Compare class to group and display results for comparison.
- Instruction counts
In addition to wall times, Timer can run a statement under Callgrind and report instructions executed.
Directly analogous to timeit.Timer constructor arguments:
stmt, setup, timer, globals
PyTorch Timer specific constructor arguments:
label, sub_label, description, env, num_threads
stmt – Code snippet to be run in a loop and timed.
setup – Optional setup code. Used to define variables used in stmt
timer – Callable which returns the current time. If PyTorch was built without CUDA or there is no GPU present, this defaults to timeit.default_timer; otherwise it will synchronize CUDA before measuring the time.
globals – A dict which defines the global variables when stmt is being executed. This is the other method for providing variables which stmt needs.
label – String which summarizes stmt. For instance, if stmt is “torch.nn.functional.relu(torch.add(x, 1, out=out))” one might set label to “ReLU(x + 1)” to improve readability.
Provide supplemental information to disambiguate measurements with identical stmt or label. For instance, in our example above sub_label might be “float” or “int”, so that it is easy to differentiate: “ReLU(x + 1): (float)”
”ReLU(x + 1): (int)” when printing Measurements or summarizing using Compare.
String to distinguish measurements with identical label and sub_label. The principal use of description is to signal to Compare the columns of data. For instance one might set it based on the input size to create a table of the form:
| n=1 | n=4 | ... ------------- ... ReLU(x + 1): (float) | ... | ... | ... ReLU(x + 1): (int) | ... | ... | ...
using Compare. It is also included when printing a Measurement.
env – This tag indicates that otherwise identical tasks were run in different environments, and are therefore not equivilent, for instance when A/B testing a change to a kernel. Compare will treat Measurements with different env specification as distinct when merging replicate runs.
num_threads – The size of the PyTorch threadpool when executing stmt. Single threaded performace is important as both a key inference workload and a good indicator of intrinsic algorithmic efficiency, so the default is set to one. This is in contrast to the default PyTorch threadpool size which tries to utilize all cores.
Measure many replicates while keeping timer overhead to a minimum.
At a high level, blocked_autorange executes the following pseudo-code:
`setup` total_time = 0 while total_time < min_run_time start = timer() for _ in range(block_size): `stmt` total_time += (timer() - start)
Note the variable block_size in the inner loop. The choice of block size is important to measurement quality, and must balance two competing objectives:
A small block size results in more replicates and generally better statistics.
A large block size better amortizes the cost of timer invocation, and results in a less biased measurement. This is important because CUDA syncronization time is non-trivial (order single to low double digit microseconds) and would otherwise bias the measurement.
blocked_autorange sets block_size by running a warmup period, increasing block size until timer overhead is less than 0.1% of the overall computation. This value is then used for the main measurement loop.
A Measurement object that contains measured runtimes and repetition counts, and can be used to compute statistics. (mean, median, etc.)
Collect instruction counts using Callgrind.
Unlike wall times, instruction counts are deterministic (modulo non-determinism in the program itself and small amounts of jitter from the Python interpreter.) This makes them ideal for detailed performance analysis. This method runs stmt in a separate process so that Valgrind can instrument the program. Performance is severely degraded due to the instrumentation, howevever this is ameliorated by the fact that a small number of iterations is generally sufficient to obtain good measurements.
In order to to use this method valgrind, callgrind_control, and callgrind_annotate must be installed.
Because there is a process boundary between the caller (this process) and the stmt execution, globals cannot contain arbitrary in-memory data structures. (Unlike timing methods) Instead, globals are restricted to builtins, nn.Modules’s, and TorchScripted functions/modules to reduce the surprise factor from serialization and subsequent deserialization. The GlobalsBridge class provides more detail on this subject. Take particular care with nn.Modules: they rely on pickle and you may need to add an import to setup for them to transfer properly.
By default, a profile for an empty statement will be collected and cached to indicate how many instructions are from the Python loop which drives stmt.
A CallgrindStats object which provides instruction counts and some basic facilities for analyzing and manipulating results.
Mirrors the semantics of timeit.Timer.timeit().
Execute the main statement (stmt) number times. https://docs.python.org/3/library/timeit.html#timeit.Timer.timeit