Tracking¶

Overview & Usage¶

Note

EXPERIMENTAL, USE AT YOUR OWN RISK, APIs SUBJECT TO CHANGE

In TorchX applications are binaries (executables), hence there is no built-in way to “return” results from applications. The torchx.runtime.tracking module allows applications to return simple results (note the keyword “simple”). The return types that the tracker module supports are intentionally constrained. For instance, attempting to return the trained model weights, which can be hundreds of GB in size, is not allowed. This module is NOT intended nor tuned to pass around large quantities of data or binary blobs.

When apps are launched as part of a higher level coordinated effort (e.g. workflow, pipeline, hyper-parameter optimization) often times, the result of the app needs to be accessible to the coordinator or other apps in the workflow.

Suppose App1 and App2 are launched sequentially with the output of App1 feeding as input of App2. Since these are binaries the typical way to chain input/outputs between apps is by passing the output file path of App1 as the input file path of App2:

$app1 --output-file s3://foo/out/app1.out$ app2 --input-file s3://foo/out/app1.out


As easy as this may seem, there are a few things one needs to worry about:

1. The format of the file app1.out (app1 needs to write it in the format app2 understands)

2. Actually parsing the url and writing/reading the output file

So the application’s main ends up looking like this (pseudo-code for demonstrative purposes):

# in app1.py
if __name__ == "__main__":
accuracy = do_something()
s3client = ...
out = {"accuracy": accuracy}

with open("/tmp/out", "w") as f:
f = json.dumps(out).encode("utf-8")

s3client.put(args.output_file, f)

# in app2.py
if __name__ == "__main__":
s3client = ...
with open("/tmp/out", "w") as f:
s3client.get(args.input_file, f)

with open("/tmp/out", "r") as f:

do_something_else(in["accuracy"])


Instead with the tracker a tracker with the same tracker_base can be used across apps to make the return values of one app available to another without the need to chain output file paths of one app with the input file path of another and deal with custom serialization and file writing.

# in app1.py
if __name__ == "__main__":
accuracy = do_something()
tracker = FsspecResultTracker(args.tracker_base)
tracker["app1_out"] = {"accuracy": accuracy}

# in app2.py
if __name__ == "__main__":
tracker = FsspecResultTracker(args.tracker_base)
app1_accuracy = tracker["app1_out"]
do_something_else(app1_accuracy)


ResultTracker¶

Base¶

class torchx.runtime.tracking.ResultTracker[source]

Base result tracker, which should be sub-classed to implement trackers. Typically there exists a tracker implementation per backing store.

Usage:

# get and put APIs can be used directly or in map-like API
# the following are equivalent
tracker.put("foo", l2norm=1.2)
tracker["foo"] = {"l2norm": 1.2}

# so are these
tracker.get("foo")["l2norm"] == 1.2
tracker["foo"]["l2norm"] == 1.2


Valid result types are:

1. numeric: int, float

2. literal:str (1kb size limit when utf-8 encoded)

Valid key types are:

1. int

2. str

As a convention, “slashes” can be used in the key to store results that are statistical. For instance, to store the mean and sem of l2norm:

tracker[key] = {"l2norm/mean" : 1.2, "l2norm/sem": 3.4}
tracker[key]["l2norm/mean"] # returns 1.2
tracker[key]["l2norm/sem"] # returns 3.4


Keys are assumed to be unique within the scope of the tracker’s backing store. For example, if a tracker is backed by a local directory and the key is the file within directory where the results are saved, then

# same key, different backing directory -> results are not overwritten
FsspecResultTracker("/tmp/foo")["1"] = {"l2norm":1.2}
FsspecResultTracker("/tmp/bar")["1"] = {"l2norm":3.4}


The tracker is NOT a central entity hence no strong consistency guarantees (beyond what the backing store provides) are made between put and get operations on the same key. Similarly no strong consistency guarantees are made between two consecutive put or get operations on the same key.

For example:

tracker[1] = {"l2norm":1.2}
tracker[1] = {"l2norm":3.4}
tracker[1] # NOT GUARANTEED TO BE 3.4!

sleep(1*MIN)
tracker[1] # more likely to be 3.4 but still not guaranteed!


It is STRONGLY advised that a unique id is used as the key. This id is often the job id for simple jobs or can be a concatenation of (experiment_id, trial_number) or (job id, replica/worker rank) for iterative applications like hyper-parameter optimization.

Fsspec¶

class torchx.runtime.tracking.FsspecResultTracker(tracker_base: str)[source]

Tracker that uses fsspec under the hood to save results.

Usage:

from torchx.runtime.tracking import FsspecResultTracker

# PUT: in trainer.py
tracker_base = "/tmp/foobar" # also supports URIs (e.g. "s3://bucket/trainer/123")
tracker = FsspecResultTracker(tracker_base)
tracker["attempt_1/out"] = {"accuracy": 0.233}

# GET: anywhere outside trainer.py
tracker = FsspecResultTracker(tracker_base)
print(tracker["attempt_1/out"]["accuracy"])

0.233