.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "recipes/recipes/timer_quick_start.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_recipes_recipes_timer_quick_start.py: Timer quick start ================= In this tutorial, we're going to cover the primary APIs of `torch.utils.benchmark.Timer`. The PyTorch Timer is based on the `timeit.Timer `__ API, with several PyTorch specific modifications. Familiarity with the builtin `Timer` class is not required for this tutorial, however we assume that the reader is familiar with the fundamentals of performance work. For a more comprehensive performance tuning tutorial, see `PyTorch Benchmark `__. **Contents:** 1. `Defining a Timer <#defining-a-timer>`__ 2. `Wall time: Timer.blocked_autorange(...) <#wall-time-timer-blocked-autorange>`__ 3. `C++ snippets <#c-snippets>`__ 4. `Instruction counts: Timer.collect_callgrind(...) <#instruction-counts-timer-collect-callgrind>`__ 5. `Instruction counts: Delving deeper <#instruction-counts-delving-deeper>`__ 6. `A/B testing with Callgrind <#a-b-testing-with-callgrind>`__ 7. `Wrapping up <#wrapping-up>`__ 8. `Footnotes <#footnotes>`__ .. GENERATED FROM PYTHON SOURCE LINES 29-34 1. Defining a Timer ~~~~~~~~~~~~~~~~~~~ A `Timer` serves as a task definition. .. GENERATED FROM PYTHON SOURCE LINES 34-59 .. code-block:: default from torch.utils.benchmark import Timer timer = Timer( # The computation which will be run in a loop and timed. stmt="x * y", # `setup` will be run before calling the measurement loop, and is used to # populate any state which is needed by `stmt` setup=""" x = torch.ones((128,)) y = torch.ones((128,)) """, # Alternatively, ``globals`` can be used to pass variables from the outer scope. # # globals={ # "x": torch.ones((128,)), # "y": torch.ones((128,)), # }, # Control the number of threads that PyTorch uses. (Default: 1) num_threads=1, ) .. GENERATED FROM PYTHON SOURCE LINES 60-67 2. Wall time: ``Timer.blocked_autorange(...)`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This method will handle details such as picking a suitable number if repeats, fixing the number of threads, and providing a convenient representation of the results. .. GENERATED FROM PYTHON SOURCE LINES 67-75 .. code-block:: default # Measurement objects store the results of multiple repeats, and provide # various utility features. from torch.utils.benchmark import Measurement m: Measurement = timer.blocked_autorange(min_run_time=1) print(m) .. GENERATED FROM PYTHON SOURCE LINES 76-89 .. code-block:: none :caption: **Snippet wall time.** x * y setup: x = torch.ones((128,)) y = torch.ones((128,)) Median: 2.34 us IQR: 0.07 us (2.31 to 2.38) 424 measurements, 1000 runs per measurement, 1 thread .. GENERATED FROM PYTHON SOURCE LINES 91-94 3. C++ snippets ~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 94-108 .. code-block:: default from torch.utils.benchmark import Language cpp_timer = Timer( "x * y;", """ auto x = torch::ones({128}); auto y = torch::ones({128}); """, language=Language.CPP, ) print(cpp_timer.blocked_autorange(min_run_time=1)) .. GENERATED FROM PYTHON SOURCE LINES 109-122 .. code-block:: none :caption: **C++ snippet wall time.** x * y; setup: auto x = torch::ones({128}); auto y = torch::ones({128}); Median: 1.21 us IQR: 0.03 us (1.20 to 1.23) 83 measurements, 10000 runs per measurement, 1 thread .. GENERATED FROM PYTHON SOURCE LINES 124-126 Unsurprisingly, the C++ snippet is both faster and has lower variation. .. GENERATED FROM PYTHON SOURCE LINES 128-137 4. Instruction counts: ``Timer.collect_callgrind(...)`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For deep dive investigations, ``Timer.collect_callgrind`` wraps `Callgrind `__ in order to collect instruction counts. These are useful as they offer fine grained and deterministic (or very low noise in the case of Python) insights into how a snippet is run. .. GENERATED FROM PYTHON SOURCE LINES 137-143 .. code-block:: default from torch.utils.benchmark import CallgrindStats, FunctionCounts stats: CallgrindStats = cpp_timer.collect_callgrind() print(stats) .. GENERATED FROM PYTHON SOURCE LINES 144-158 .. code-block:: none :caption: **C++ Callgrind stats (summary)** x * y; setup: auto x = torch::ones({128}); auto y = torch::ones({128}); All Noisy symbols removed Instructions: 563600 563600 Baseline: 0 0 100 runs per measurement, 1 thread .. GENERATED FROM PYTHON SOURCE LINES 160-192 5. Instruction counts: Delving deeper ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The string representation of ``CallgrindStats`` is similar to that of Measurement. `Noisy symbols` are a Python concept (removing calls in the CPython interpreter which are known to be noisy). For more detailed analysis, however, we will want to look at specific calls. ``CallgrindStats.stats()`` returns a ``FunctionCounts`` object to make this easier. Conceptually, ``FunctionCounts`` can be thought of as a tuple of pairs with some utility methods, where each pair is `(number of instructions, file path and function name)`. A note on paths: One generally doesn't care about absolute path. For instance, the full path and function name for a multiply call is something like: .. code-block:: sh /the/prefix/to/your/pytorch/install/dir/pytorch/build/aten/src/ATen/core/TensorMethods.cpp:at::Tensor::mul(at::Tensor const&) const [/the/path/to/your/conda/install/miniconda3/envs/ab_ref/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so] when in reality, all of the information that we're interested in can be represented in: .. code-block:: sh build/aten/src/ATen/core/TensorMethods.cpp:at::Tensor::mul(at::Tensor const&) const ``CallgrindStats.as_standardized()`` makes a best effort to strip low signal portions of the file path, as well as the shared object and is generally recommended. .. GENERATED FROM PYTHON SOURCE LINES 192-196 .. code-block:: default inclusive_stats = stats.as_standardized().stats(inclusive=False) print(inclusive_stats[:10]) .. GENERATED FROM PYTHON SOURCE LINES 197-214 .. code-block:: none :caption: **C++ Callgrind stats (detailed)** torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7f192a6dfd90> 47264 ???:_int_free 25963 ???:_int_malloc 19900 build/../aten/src/ATen/TensorIter ... (at::TensorIteratorConfig const&) 18000 ???:__tls_get_addr 13500 ???:malloc 11300 build/../c10/util/SmallVector.h:a ... (at::TensorIteratorConfig const&) 10345 ???:_int_memalign 10000 build/../aten/src/ATen/TensorIter ... (at::TensorIteratorConfig const&) 9200 ???:free 8000 build/../c10/util/SmallVector.h:a ... IteratorBase::get_strides() const Total: 173472 .. GENERATED FROM PYTHON SOURCE LINES 216-221 That's still quite a lot to digest. Let's use the `FunctionCounts.transform` method to trim some of the function path, and discard the function called. When we do, the counts of any collisions (e.g. `foo.h:a()` and `foo.h:b()` will both map to `foo.h`) will be added together. .. GENERATED FROM PYTHON SOURCE LINES 221-237 .. code-block:: default import os import re def group_by_file(fn_name: str): if fn_name.startswith("???"): fn_dir, fn_file = fn_name.split(":")[:2] else: fn_dir, fn_file = os.path.split(fn_name.split(":")[0]) fn_dir = re.sub("^.*build/../", "", fn_dir) fn_dir = re.sub("^.*torch/", "torch/", fn_dir) return f"{fn_dir:<15} {fn_file}" print(inclusive_stats.transform(group_by_file)[:10]) .. GENERATED FROM PYTHON SOURCE LINES 238-255 .. code-block:: none :caption: **Callgrind stats (condensed)** 118200 aten/src/ATen TensorIterator.cpp 65000 c10/util SmallVector.h 47264 ??? _int_free 25963 ??? _int_malloc 20900 c10/util intrusive_ptr.h 18000 ??? __tls_get_addr 15900 c10/core TensorImpl.h 15100 c10/core CPUAllocator.cpp 13500 ??? malloc 12500 c10/core TensorImpl.cpp Total: 352327 .. GENERATED FROM PYTHON SOURCE LINES 257-269 6. A/B testing with ``Callgrind`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ One of the most useful features of instruction counts is they allow fine grained comparison of computation, which is critical when analyzing performance. To see this in action, lets compare our multiplication of two size 128 Tensors with a {128} x {1} multiplication, which will broadcast the second Tensor: result = {a0 * b0, a1 * b0, ..., a127 * b0} .. GENERATED FROM PYTHON SOURCE LINES 269-279 .. code-block:: default broadcasting_stats = Timer( "x * y;", """ auto x = torch::ones({128}); auto y = torch::ones({1}); """, language=Language.CPP, ).collect_callgrind().as_standardized().stats(inclusive=False) .. GENERATED FROM PYTHON SOURCE LINES 280-285 Often we want to A/B test two different environments. (e.g. testing a PR, or experimenting with compile flags.) This is quite simple, as ``CallgrindStats``, ``FunctionCounts``, and Measurement are all pickleable. Simply save measurements from each environment, and load them in a single process for analysis. .. GENERATED FROM PYTHON SOURCE LINES 285-304 .. code-block:: default import pickle # Let's round trip `broadcasting_stats` just to show that we can. broadcasting_stats = pickle.loads(pickle.dumps(broadcasting_stats)) # And now to diff the two tasks: delta = broadcasting_stats - inclusive_stats def extract_fn_name(fn: str): """Trim everything except the function name.""" fn = ":".join(fn.split(":")[1:]) return re.sub(r"\(.+\)", "(...)", fn) # We use `.transform` to make the diff readable: print(delta.transform(extract_fn_name)) .. GENERATED FROM PYTHON SOURCE LINES 305-331 .. code-block:: none :caption: **Instruction count delta** 17600 at::TensorIteratorBase::compute_strides(...) 12700 at::TensorIteratorBase::allocate_or_resize_outputs() 10200 c10::SmallVectorImpl::operator=(...) 7400 at::infer_size(...) 6200 at::TensorIteratorBase::invert_perm(...) const 6064 _int_free 5100 at::TensorIteratorBase::reorder_dimensions() 4300 malloc 4300 at::TensorIteratorBase::compatible_stride(...) const ... -28 _int_memalign -100 c10::impl::check_tensor_options_and_extract_memory_format(...) -300 __memcmp_avx2_movbe -400 at::detail::empty_cpu(...) -1100 at::TensorIteratorBase::numel() const -1300 void at::native::(...) -2400 c10::TensorImpl::is_contiguous(...) const -6100 at::TensorIteratorBase::compute_fast_setup_type(...) -22600 at::TensorIteratorBase::fast_set_up(...) Total: 58091 .. GENERATED FROM PYTHON SOURCE LINES 333-338 So the broadcasting version takes an extra 580 instructions per call (recall that we're collecting 100 runs per sample), or about 10%. There are quite a few ``TensorIterator`` calls, so lets drill down to those. ``FunctionCounts.filter`` makes this easy. .. GENERATED FROM PYTHON SOURCE LINES 338-341 .. code-block:: default print(delta.transform(extract_fn_name).filter(lambda fn: "TensorIterator" in fn)) .. GENERATED FROM PYTHON SOURCE LINES 342-360 .. code-block:: none :caption: **Instruction count delta (filter)** 17600 at::TensorIteratorBase::compute_strides(...) 12700 at::TensorIteratorBase::allocate_or_resize_outputs() 6200 at::TensorIteratorBase::invert_perm(...) const 5100 at::TensorIteratorBase::reorder_dimensions() 4300 at::TensorIteratorBase::compatible_stride(...) const 4000 at::TensorIteratorBase::compute_shape(...) 2300 at::TensorIteratorBase::coalesce_dimensions() 1600 at::TensorIteratorBase::build(...) -1100 at::TensorIteratorBase::numel() const -6100 at::TensorIteratorBase::compute_fast_setup_type(...) -22600 at::TensorIteratorBase::fast_set_up(...) Total: 24000 .. GENERATED FROM PYTHON SOURCE LINES 362-368 This makes plain what is going on: there is a fast path in ``TensorIterator`` setup, but in the {128} x {1} case we miss it and have to do a more general analysis which is more expensive. The most prominent call omitted by the filter is `c10::SmallVectorImpl::operator=(...)`, which is also part of the more general setup. .. GENERATED FROM PYTHON SOURCE LINES 370-381 7. Wrapping up ~~~~~~~~~~~~~~ In summary, use `Timer.blocked_autorange` to collect wall times. If timing variation is too high, increase `min_run_time`, or move to C++ snippets if convenient. For fine grained analysis, use `Timer.collect_callgrind` to measure instruction counts and `FunctionCounts.(__add__ / __sub__ / transform / filter)` to slice-and-dice them. .. GENERATED FROM PYTHON SOURCE LINES 383-398 8. Footnotes ~~~~~~~~~~~~ - Implied ``import torch`` If `globals` does not contain "torch", Timer will automatically populate it. This means that ``Timer("torch.empty(())")`` will work. (Though other imports should be placed in `setup`, e.g. ``Timer("np.zeros(())", "import numpy as np")``) - ``REL_WITH_DEB_INFO`` In order to provide full information about the PyTorch internals which are executed, ``Callgrind`` needs access to C++ debug symbols. This is accomplished by setting ``REL_WITH_DEB_INFO=1`` when building PyTorch. Otherwise function calls will be opaque. (The resultant ``CallgrindStats`` will warn if debug symbols are missing.) .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.000 seconds) .. _sphx_glr_download_recipes_recipes_timer_quick_start.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: timer_quick_start.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: timer_quick_start.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_