This describes the high level concepts behind TorchX and the project structure. For how to create and run an app check out the Quickstart Guide.
The top level modules in TorchX are:
torchx.specs: application spec (job definition) APIs
torchx.components: predefined (builtin) app specs
torchx.workspace: handles patching images for remote execution
torchx.cli: CLI tool
torchx.runner: given an app spec, submits the app as a job on a scheduler
torchx.schedulers: backend job schedulers that the runner supports
torchx.pipelines: adapters that convert the given app spec to a “stage” in an ML pipeline platform
torchx.runtime: util and abstraction libraries you can use in authoring apps (not app spec)
Below is a UML diagram
In TorchX an
AppDef is simply a struct with the definition of
the actual application. In scheduler lingo, this is a
JobDefinition and a
similar concept in Kubernetes is the
spec.yaml. To disambiguate between the
application binary (logic) and the spec, we typically refer to a TorchX
AppDef as an “app spec” or
is the common interface understood by
torchx.pipelines allowing you to run your app as a standalone job
or as a stage in an ML pipeline.
Below is a simple example of an
specs.AppDef that echos “hello world”
import torchx.specs as specs
As you can see,
specs.AppDef is a pure python dataclass that
simply encodes the name of the main binary (entrypoint), arguments to
pass to it, and a few other runtime parameters such as
information about the container in which to run (
The app spec is flexible and can encode specs for a variety of app topologies.
num_replicas > 1 means that the application is distributed.
specs.Roles makes it possible to represent a
non-homogeneous distributed application, such as those that require a single
“coordinator” and many “workers”.
torchx.specs API Docs to learn more.
What makes app specs flexible also makes it have many fields. The good
news is that in most cases you don’t have to build an app spec from scratch.
Rather you would use a templetized app spec called
A component in TorchX is simply a templetized
spec.AppDef. You can
think of them as convenient “factory methods” for
Unlike applications, components don’t map to an actual python dataclass.
Rather a factory function that returns an
is called a component.
The granularity at which the app spec is templetized varies. Some components
such as the
echo example above are ready-to-run, meaning that they
have hardcoded application binaries. Others such as
ddp (distributed data parallel)
specs only specify the topology of the application. Below is one possible templetization
of a ddp style trainer app spec that specifies a homogeneous node topology:
import torchx.specs as specs
def ddp(jobname: str, nnodes: int, image: str, entrypoint: str, *script_args: str):
single_gpu = specs.Resources(cpu=4, gpu=1, memMB=1024)
As you can see, the level of parameterization is completely up to the component author. And the effort of creating a component is no more than writing a python function. Don’t try to over generalize components by parameterizing everything. Components are easy and cheap to create, create as many as you want based on repetitive use cases.
PROTIP 1: Since components are python functions, component composition can be achieved through python function composition rather than object composition. However we do not recommend component composition for maintainability purposes.
PROTIP 2: To define dependencies between components, use a pipelining DSL. See Pipeline Adapters section below to understand how TorchX components are used in the context of pipelines.
Before authoring your own component, browse through the library of Components that are included with TorchX to see if one fits your needs.
Runner and Schedulers¶
Runner does exactly what you would expect – given an app spec it
launches the application as a job onto a cluster through a job scheduler.
There are two ways to access runners in TorchX:
torchx run ~/app_spec.py
See Schedulers for a list of schedulers that the runner can launch apps to.
While runners launch components as standalone jobs,
makes it possible to plug components into an ML pipeline/workflow. For a
specific target pipeline platform (e.g. kubeflow pipelines), TorchX
defines an adapter that converts a TorchX app spec to whatever the
“stage” representation is in the target platform. For instance,
torchx.pipelines.kfp adapter for kubeflow pipelines converts an
app spec to a
kfp.ContainerOp (or more accurately, a kfp “component spec” yaml).
In most cases an app spec would map to a “stage” (or node) in a pipeline. However advanced components, especially those that have a mini control flow of its own (e.g. HPO), may map to a “sub-pipeline” or an “inline-pipeline”. The exact semantics of how these advanced components map to the pipeline is dependent on the target pipeline platform. For example, if the pipeline DSL allows dynamically adding stages to a pipeline from an upstream stage, then TorchX may take advantage of such feature to “inline” the sub-pipeline to the main pipeline. TorchX generally tries its best to adapt app specs to the most canonical representation in the target pipeline platform.
See Pipelines for a list of supported pipeline platforms.
torchx.runtime is by no means is a requirement to use TorchX.
If your infrastructure is fixed and you don’t need your application
to be portable across different types of schedulers and pipelines,
you can skip this section.
Your application (not the app spec, but the actual app binary) has ZERO dependencies
to TorchX (e.g.
/bin/echo does not use TorchX, but a
can be created for it).
torchx.runtime is the ONLY module you should be using when
authoring your application binary!
However because TorchX essentially allows your app to run anywhere it is recommended that your application be written in a scheduler/infrastructure agnostic fashion.
This typically means adding an API layer at the touch-points with scheduler/infra. For example the following application is NOT infra agnostic
def main(input_path: str):
s3 = boto3.session.Session().client("s3")
path = s3_input_path.split("/")
bucket = path
key = "/".join(path[1:])
s3.download_file(bucket, key, "/tmp/input")
input = torch.load("/tmp/input")
# ...<rest of code omitted for brevity>...
The binary above makes an implicit assumption that the
is an AWS S3 path. One way to make this trainer storage agnostic is to introduce
FileSystem abstraction layer. For file systems, frameworks like
PyTorch Lightning already define
layers (lightning uses fsspec
under the hood). The binary above can be rewritten to be storage agnostic with
import pytorch_lightning.utilities.io as io
def main(input_url: str):
fs = io.get_filesystem(input_url)
with fs.open(input_url, "rb") as f:
input = torch.load(f)
# ...<rest of code omitted for brevity>...
main can be called as
making it compatible with input stored in various storages.
FileSystem there were existing libraries defining the file system abstraction.
torchx.runtime, you’ll find libraries or pointers to other libraries
that provide abstractions for various functionalities that you may need to author
a infra-agnostic application. Ideally features in
torchx.runtime are upstreamed
in a timely fashion to libraries such as lightning that are intended to be used to
author your application. But finding a proper permanent home for these abstractions
may take time or even require an entirely new OSS project to be created.
Until this happens the features can mature and be accessible to users
Check out the Quickstart Guide to learn how to create and run components.