Metrics

The metrics API in torchelastic enables users to publish telemetry metrics of their jobs. torchelastic also publishes platform level metrics such as latencies for certain stages of work (e.g. re-rendezvous). A metric can be thought of as timeseries data and is uniquely identified by the string-valued tuple (metric_group, metric_name).

torchelastic makes no assumptions about what a metric_group is and what relationship it has with metric_name. It is totally up to the user to use these two fields to uniquely identify a metric.

A sensible way to use metric groups is to map them to a stage or module in your job. You may also encode certain high level properties of the job such as the region or stage (dev vs prod).

The metric group torchelastic is used by torchelastic for platform level metrics that it produces. For instance torchelastic may output the latency (in milliseconds) of a checkpoint operation by creating the metric

(torchelastic, checkpoint.write_latency_ms)

Add Metric Data

Using torchelastic’s metrics API is similar to using python’s logging framework. You will first have to get a handle to the metric stream and add metric values to the stream. The example below measures the latency for the calculate() function.

import time
import torchelastic.metrics as metrics

def my_method():
    ms = metrics.getStream(group="my_app")
    start = time.time()
    calculate()
    end = time.time()

    ms.add_value("calculate_latency", int(end - start))

Publish Metrics

The MetricHandler is responsible for emitting the added metric values to a particular destination. Metric groups can be configured with different metric handlers. By default torchelastic emits all metrics to /dev/null. By adding the following configuration metrics in the torchelastic and my_app metric groups will be printed out to console.

import torchelastic.metrics as metrics

metrics.configure(metrics.ConsoleMetricHandler(), group = "torchelastic")
metrics.configure(metrics.ConsoleMetricHandler(), group = "my_app")

Implementing a Custom Metric Handler

If you want your metrics to be emitted to a custom location, implement the MetricHandler interface and configure your job to use your custom metric handler.

Below is a toy example that prints the metrics to stdout

import torchelastic.metrics as metrics

class StdoutMetricHandler(metrics.MetricHandler):
    def emit(self, metric_data):
        print(
            f"[{metric_data.timestamp}][{metric_data.group_name}]: {metric_data.name}={metric_data.value}"
        )

metrics.configure(StdoutMetricHandler(), group="my_app")

Now all metrics in the group my_app will be printed to stdout as:

[1574213883.4182858][my_app]: my_metric=<value>
[1574213940.5237644][my_app]: my_metric=<value>

Metrics

Add Metric Data

Publish Metrics

Implementing a Custom Metric Handler

Docs

Tutorials

Resources