Metrics ======= The metrics API in torchelastic enables users to publish telemetry metrics of their jobs. torchelastic also publishes platform level metrics such as latencies for certain stages of work (e.g. re-rendezvous). A ``metric`` can be thought of as timeseries data and is uniquely identified by the string-valued tuple ``(metric_group, metric_name)``. torchelastic makes no assumptions about what a ``metric_group`` is and what relationship it has with ``metric_name``. It is totally up to the user to use these two fields to uniquely identify a metric. .. A sensible way to use metric groups is to map them to a stage or module in your job. You may also encode certain high level properties of the job such as the region or stage (dev vs prod). The metric group ``torchelastic`` is used by torchelastic for platform level metrics that it produces. For instance torchelastic may output the latency (in milliseconds) of a checkpoint operation by creating the metric :: (torchelastic, checkpoint.write_latency_ms) Add Metric Data --------------- Using torchelastic’s metrics API is similar to using python’s logging framework. You will first have to get a handle to the metric stream and add metric values to the stream. The example below measures the latency for the ``calculate()`` function. .. code:: python import time import torchelastic.metrics as metrics def my_method(): ms = metrics.getStream(group="my_app") start = time.time() calculate() end = time.time() ms.add_value("calculate_latency", int(end - start)) Publish Metrics --------------- The ``MetricHandler`` is responsible for emitting the added metric values to a particular destination. Metric groups can be configured with different metric handlers. By default torchelastic emits all metrics to ``/dev/null``. By adding the following configuration metrics in the ``torchelastic`` and ``my_app`` metric groups will be printed out to console. .. code:: python import torchelastic.metrics as metrics metrics.configure(metrics.ConsoleMetricHandler(), group = "torchelastic") metrics.configure(metrics.ConsoleMetricHandler(), group = "my_app") Implementing a Custom Metric Handler ------------------------------------ If you want your metrics to be emitted to a custom location, implement the ``MetricHandler`` interface and configure your job to use your custom metric handler. Below is a toy example that prints the metrics to ``stdout`` .. code:: python import torchelastic.metrics as metrics class StdoutMetricHandler(metrics.MetricHandler): def emit(self, metric_data): print( f"[{metric_data.timestamp}][{metric_data.group_name}]: {metric_data.name}={metric_data.value}" ) metrics.configure(StdoutMetricHandler(), group="my_app") Now all metrics in the group ``my_app`` will be printed to stdout as: :: [1574213883.4182858][my_app]: my_metric=<value> [1574213940.5237644][my_app]: my_metric=<value>