CLI¶
The torchx
CLI is a commandline tool around torchx.runner.Runner
.
It allows users to launch torchx.specs.AppDef
directly onto
one of the supported schedulers without authoring a pipeline (aka workflow).
This is convenient for quickly iterating on the application logic without
incurring both the technical and cognitive overhead of learning, writing, and
dealing with pipelines.
Note
When in doubt use torchx --help
.
Listing the builtin components¶
Most of the components under the torchx.components
module are what the CLI
considers “built-in” apps. Before you write your own component you should
browse through the builtins to see if there is one that fits your needs already.
If so, no need to even author an app spec!
$ torchx builtins
Found <n> builtin configs:
1. metrics.tensorboard
2. serve.torchserve
3. utils.binary
... <omitted for brevity>
Listing the supported schedulers and arguments¶
To get a list of supported schedulers that you can launch your job into run:
$ torchx runopts
local_docker:
{ 'log_dir': { 'default': 'None',
'help': 'dir to write stdout/stderr log files of replicas',
'type': 'str'}}
local_cwd:
...
slurm:
...
kubernetes:
...
Running a component as a job¶
The run
subcommand takes either one of:
name of the builtin
$ torchx run --scheduler <sched_name> utils.echo
full python module path of the component function
$ torchx run --scheduler <sched_name> torchx.components.utils.echo
file path of the
*.py
file the defines the component along with the component function name in that file.$ cat ~/my_trainer_spec.py import torchx.specs as specs def my_trainer(foo: int, bar: str) -> specs.AppDef: <...spec file details omitted for brevity...> $ torchx run --scheduler <sched_name> ~/my_trainer_spec.py:my_trainer
Now that you have understood how to chose which app to launch, now it is time to see what parameters need to be passed. There are three sets of parameters:
arguments to the
run
subcommand, see a list of them by running:$ torchx run --help
arguments to the scheduler (
--scheduler_args
, also known asrun_options
orrun_configs
), each scheduler takes different args, to find out the args for a specific scheduler run (command forlocal_cwd
scheduler shown below:$ torchx runopts local_cwd { 'log_dir': { 'default': 'None', 'help': 'dir to write stdout/stderr log files of replicas', 'type': 'str'}} # pass run options as comma-delimited k=v pairs $ torchx run --scheduler local_cwd --scheduler_args log_dir=/tmp ...
arguments to the component (the app args are included here), this also depends on the component and can be seen with the
--help
string on the component$ torchx run --scheduler local_cwd utils.echo --help usage: torchx run echo.torchx [-h] [--msg MSG] Echos a message optional arguments: -h, --help show this help message and exit --msg MSG Message to echo
Putting everything together, running echo
with the local_cwd
scheduler:
$ torchx run --scheduler local_cwd --scheduler_args log_dir=/tmp utils.echo --msg "hello $USER"
=== RUN RESULT ===
torchx 2022-06-15 16:08:57 INFO Log files located in: /tmp/torchx/echo-crls3hcpwjmhc/echo/0
local_cwd://torchx/echo-crls3hcpwjmhc
By default the run
subcommand does not block for the job to finish, instead it simply
schedules the job on the specified scheduler and prints an app handle
which is a URL of the form: $scheduler_name://torchx/$job_id
.
Keep note of this handle since this is what you’ll need to provide to other
subcommands to identify your job.
Note
If the --scheduler
option is not provided, then it defaults to
the scheduler backend default
which points to local_cwd
. To change
the default scheduler, see: Registering Custom Schedulers.
Inspecting what will run (dryrun)¶
When you are authoring or debugging a component, you may want to find out and
inspect both the scheduler request object that the runner submits as well as
the AppDef
object that is created by the component function. To do this,
use the --dryrun
option to the run
subcommand:
$ torchx run --dryrun utils.echo --msg hello_world
=== APPLICATION ===
{ 'metadata': {},
'name': 'echo',
'roles': [ { 'args': ['hello_world'],
'entrypoint': '/bin/echo',
'env': {},
'image': '/tmp',
'max_retries': 0,
'name': 'echo',
'num_replicas': 1,
'port_map': {},
'resource': { 'capabilities': {},
'cpu': -1,
'gpu': -1,
'memMB': -1},
'retry_policy': <RetryPolicy.APPLICATION: 'APPLICATION'>}]}
=== SCHEDULER REQUEST ===
PopenRequest(
app_id='echo_c944ffb2',
log_dir='/tmp/torchx_asmtmyqj/torchx_kiuk/echo_c944ffb2',
role_params={
'echo': [
ReplicaParam(
args=['/bin/echo', 'hello_world'],
env={'TORCHELASTIC_ERROR_FILE': '/tmp/torchx_asmtmyqj/torchx_kiuk/echo_c944ffb2/echo/0/error.json'},
stdout=None,
stderr=None)
]
},
role_log_dirs={'echo': ['/tmp/torchx_asmtmyqj/torchx_kiuk/echo_c944ffb2/echo/0']})
Note
The scheduler request print out will look different based on the scheduler
type. The example above is a faux request since the scheduler is a local scheduler
which simply uses subprocess.Popen
to simulate replicas as a POSIX process.
Nevertheless the scheduler request contains valuable insight into what the runner
translates the AppDef
to for a particular scheduler backend.
Describing and querying the status of a job¶
The describe
subcommand is essentially the inverse of the run
command.
That is, it prints the app spec given an app_handle
.
$ torchx describe <app handle>
Important
The describe
command attempts to recreate an app spec
by querying the scheduler for the job description. So what you
see printed is not always 100% the exact same app spec that was
given to the run
. The extent to which the runner can
recreate the app spec depends on numerous factors such as
how descriptive the scheduler’s describe_job
API is as well
as whether there were fields in the app spec that were ignored
when submitting the job to the scheduler because the scheduler
does not have support for such a parameter/functionality.
NEVER rely on the describe
API as a storage function for
your app spec. It is simply there to help you spot check things.
To get the status of a running job:
$ torchx status <app_handle> # prints status for all replicas and roles
$ torchx status --role trainer <app_handle> # filters it down to the trainer role
Filtering by --role
is useful for large jobs that have multiple roles.
Viewing Logs¶
Note
This functionality depends on how long your scheduler setup retains logs
TorchX DOES NOT archive logs on your behalf, but rather relies on the scheduler’s
get_log
API to obtain the logs. Refer to your scheduler’s user manual
to setup log retention properly.
The log
subcommand is a simple wrapper around the scheduler’s get_log
API
to let you pull/print the logs for all replicas and roles from one place.
It also lets you pull replica or role specific logs. Below are a few log access
patterns that are useful and self explanatory
$ torchx log <app_handle>
Prints all logs from all replicas and roles (each log line is prefixed with role name and replica id)
$ torchx log --tail <app_handle>
If the job is still running tail the logs
$ torchx log --regex ".*Exception.*" <app_handle>
regex filter to exceptions
$ torchx log <app_handle>/<role>
$ torchx log <app_handle>/trainer
pulls all logs for the role trainer
$ torchx log <app_handle>/<role_name>/<replica_id>
$ torchx log <app_handle>/trainer/0,1
only pulls trainer 0 and trainer 1 (node not rank) logs
Note
Some schedulers do not support server-side regex filters. In this case the regex filter is applied on the client-side, meaning the full logs will have to be passed through the client. This may be very taxing to the local host. Please use your best judgment when using the logs API.
Listing Jobs¶
The list
subcommand lets you list app handles and statuses of apps launched on a scheduler.
You can then use the app handles to further describe the app, view the logs, etc.
$ torchx list -s <scheduler_name>
$ torchx list -s kubernetes
APP HANDLE APP STATUS
-------------------------------------------------- ------------
kubernetes://torchx/default:trainer-a5qvfhe1hyq2w SUCCEEDED
kubernetes://torchx/default:trainer-d796ei2tdtest SUCCEEDED
kubernetes://torchx/default:trainer-em0iao2m90000 FAILED
kubernetes://torchx/default:trainer-ew33oxmdg0123 SUCCEEDED