Metrics
The metrics API in torchelastic enables users to publish telemetry
metrics of their jobs. torchelastic also publishes platform level
metrics such as latencies for certain stages of work
(e.g. re-rendezvous). A metric
can be thought of as timeseries data
and is uniquely identified by the string-valued tuple
(metric_group, metric_name)
.
torchelastic makes no assumptions about what ametric_group
is and what relationship it has withmetric_name
. It is totally up to the user to use these two fields to uniquely identify a metric.
A sensible way to use metric groups is to map them to a stage or module in your job. You may also encode certain high level properties of the job such as the region or stage (dev vs prod).
The metric group torchelastic
is used by torchelastic for platform
level metrics that it produces. For instance torchelastic may output the
latency (in milliseconds) of a checkpoint operation by creating the
metric
(torchelastic, checkpoint.write_latency_ms)
Add Metric Data
Using torchelastic’s metrics API is similar to using python’s logging
framework. You will first have to get a handle to the metric stream and
add metric values to the stream. The example below measures the latency
for the calculate()
function.
import time
import torchelastic.metrics as metrics
def my_method():
ms = metrics.getStream(group="my_app")
start = time.time()
calculate()
end = time.time()
ms.add_value("calculate_latency", int(end - start))
Publish Metrics
The MetricHandler
is responsible for emitting the added metric
values to a particular destination. Metric groups can be configured with
different metric handlers. By default torchelastic emits all metrics to
/dev/null
. By adding the following configuration metrics in the
torchelastic
and my_app
metric groups will be printed out to
console.
import torchelastic.metrics as metrics
metrics.configure(metrics.ConsoleMetricHandler(), group = "torchelastic")
metrics.configure(metrics.ConsoleMetricHandler(), group = "my_app")
Implementing a Custom Metric Handler
If you want your metrics to be emitted to a custom location, implement
the MetricHandler
interface and configure your job to use your
custom metric handler.
Below is a toy example that prints the metrics to stdout
import torchelastic.metrics as metrics
class StdoutMetricHandler(metrics.MetricHandler):
def emit(self, metric_data):
print(
f"[{metric_data.timestamp}][{metric_data.group_name}]: {metric_data.name}={metric_data.value}"
)
metrics.configure(StdoutMetricHandler(), group="my_app")
Now all metrics in the group my_app
will be printed to stdout as:
[1574213883.4182858][my_app]: my_metric=<value>
[1574213940.5237644][my_app]: my_metric=<value>