Tracking¶
Overview & Usage¶
Note
EXPERIMENTAL, USE AT YOUR OWN RISK, APIs SUBJECT TO CHANGE
In TorchX applications are binaries (executables),
hence there is no built-in way to “return” results from applications.
The torchx.runtime.tracking
module allows applications
to return simple results (note the keyword “simple”). The return types
that the tracker module supports are intentionally constrained. For instance,
attempting to return the trained model weights, which can be hundreds of GB in size,
is not allowed. This module is NOT intended nor tuned to pass around large
quantities of data or binary blobs.
When apps are launched as part of a higher level coordinated effort (e.g. workflow, pipeline, hyper-parameter optimization) often times, the result of the app needs to be accessible to the coordinator or other apps in the workflow.
Suppose App1 and App2 are launched sequentially with the output of App1 feeding as input of App2. Since these are binaries the typical way to chain input/outputs between apps is by passing the output file path of App1 as the input file path of App2:
$ app1 --output-file s3://foo/out/app1.out
$ app2 --input-file s3://foo/out/app1.out
As easy as this may seem, there are a few things one needs to worry about:
The format of the file
app1.out
(app1 needs to write it in the format app2 understands)Actually parsing the url and writing/reading the output file
So the application’s main ends up looking like this (pseudo-code for demonstrative purposes):
# in app1.py
if __name__ == "__main__":
accuracy = do_something()
s3client = ...
out = {"accuracy": accuracy}
with open("/tmp/out", "w") as f:
f = json.dumps(out).encode("utf-8")
s3client.put(args.output_file, f)
# in app2.py
if __name__ == "__main__":
s3client = ...
with open("/tmp/out", "w") as f:
s3client.get(args.input_file, f)
with open("/tmp/out", "r") as f:
in = json.loads(f.read().decode("utf-8"))
do_something_else(in["accuracy"])
Instead with the tracker a tracker with the same tracker_base
can be used across apps to make the return values of one app
available to another without the need to chain output file paths of
one app with the input file path of another and deal with custom
serialization and file writing.
# in app1.py
if __name__ == "__main__":
accuracy = do_something()
tracker = FsspecResultTracker(args.tracker_base)
tracker["app1_out"] = {"accuracy": accuracy}
# in app2.py
if __name__ == "__main__":
tracker = FsspecResultTracker(args.tracker_base)
app1_accuracy = tracker["app1_out"]
do_something_else(app1_accuracy)
ResultTracker¶
Base¶
- class torchx.runtime.tracking.ResultTracker[source]¶
Base result tracker, which should be sub-classed to implement trackers. Typically there exists a tracker implementation per backing store.
Usage:
# get and put APIs can be used directly or in map-like API # the following are equivalent tracker.put("foo", l2norm=1.2) tracker["foo"] = {"l2norm": 1.2} # so are these tracker.get("foo")["l2norm"] == 1.2 tracker["foo"]["l2norm"] == 1.2
Valid
result
types are:numeric: int, float
literal:str (1kb size limit when utf-8 encoded)
Valid
key
types are:int
str
As a convention, “slashes” can be used in the key to store results that are statistical. For instance, to store the mean and sem of l2norm:
tracker[key] = {"l2norm/mean" : 1.2, "l2norm/sem": 3.4} tracker[key]["l2norm/mean"] # returns 1.2 tracker[key]["l2norm/sem"] # returns 3.4
Keys are assumed to be unique within the scope of the tracker’s backing store. For example, if a tracker is backed by a local directory and the
key
is the file within directory where the results are saved, then# same key, different backing directory -> results are not overwritten FsspecResultTracker("/tmp/foo")["1"] = {"l2norm":1.2} FsspecResultTracker("/tmp/bar")["1"] = {"l2norm":3.4}
The tracker is NOT a central entity hence no strong consistency guarantees (beyond what the backing store provides) are made between
put
andget
operations on the same key. Similarly no strong consistency guarantees are made between two consecutiveput
orget
operations on the same key.For example:
tracker[1] = {"l2norm":1.2} tracker[1] = {"l2norm":3.4} tracker[1] # NOT GUARANTEED TO BE 3.4! sleep(1*MIN) tracker[1] # more likely to be 3.4 but still not guaranteed!
It is STRONGLY advised that a unique id is used as the key. This id is often the job id for simple jobs or can be a concatenation of (experiment_id, trial_number) or (job id, replica/worker rank) for iterative applications like hyper-parameter optimization.
Fsspec¶
- class torchx.runtime.tracking.FsspecResultTracker(tracker_base: str)[source]¶
Tracker that uses fsspec under the hood to save results.
Usage:
from torchx.runtime.tracking import FsspecResultTracker # PUT: in trainer.py tracker_base = "/tmp/foobar" # also supports URIs (e.g. "s3://bucket/trainer/123") tracker = FsspecResultTracker(tracker_base) tracker["attempt_1/out"] = {"accuracy": 0.233} # GET: anywhere outside trainer.py tracker = FsspecResultTracker(tracker_base) print(tracker["attempt_1/out"]["accuracy"])
0.233