Advanced Usage

TorchX defines plugin points for you to configure TorchX to best support your infrastructure setup. Most of the configuration is done through Python’s entry points.

Note

Entry points requires a python package containing them be installed. If you don’t have a python package we recommend you make one so you can share your resource definitions, schedulers and components across your team and org.

The entry points described below can be specified in your project’s setup.py file as

from setuptools import setup

setup(
    name="project foobar",
    entry_points={
        "torchx.schedulers": [
            "my_scheduler = my.custom.scheduler:create_scheduler",
        ],
        "torchx.named_resources": [
            "gpu_x2 = my_module.resources:gpu_x2",
        ],
    }
)

Registering Custom Schedulers

You may implement a custom scheduler by implementing the .. py::class torchx.schedulers.Scheduler interface.

The create_scheduler function should have the following function signature:

from torchx.schedulers import Scheduler

def create_scheduler(session_name: str, **kwargs: object) -> Scheduler:
    return MyScheduler(session_name, **kwargs)

You can then register this custom scheduler by adding an entry_points definition to your python project.

# setup.py
...
entry_points={
    "torchx.schedulers": [
        "my_scheduler = my.custom.scheduler:create_schedule",
    ],
}

Registering Named Resources

A Named Resource is a set of predefined resource specs that are given a string name. This is particularly useful when your cluster has a fixed set of instance types. For instance if your deep learning training kubernetes cluster on AWS is comprised only of p3.16xlarge (64 vcpu, 8 gpu, 488GB), then you may want to enumerate t-shirt sized resource specs for the containers as:

from torchx.specs import Resource

def gpu_x1() -> Resource:
    return Resource(cpu=8,  gpu=1, memMB=61_000)

def gpu_x2() -> Resource:
    return Resource(cpu=16, gpu=2, memMB=122_000)

def gpu_x3() -> Resource:
    return Resource(cpu=32, gpu=4, memMB=244_000)

def gpu_x4() -> Resource:
    return Resource(cpu=64, gpu=8, memMB=488_000)

To make these resource definitions available you then need to register them via entry_points:

# setup.py
...
entry_points={
    "torchx.named_resources": [
        "gpu_x2 = my_module.resources:gpu_x2",
    ],
}

Once you install the package with the entry_points definitions, the named resource can then be used in the following manner:

>>> from torchx.specs import get_named_resources
>>> get_named_resources("gpu_x2")
Resource(cpu=16, gpu=2, memMB=122000, ...)

# my_module.component
from torchx.specs import AppDef, Role, get_named_resources

def test_app(resource: str) -> AppDef:
    return AppDef(name="test_app", roles=[
        Role(
            name="...",
            image="...",
            resource=get_named_resources(resource),
        )
    ])

test_app("gpu_x2")

Registering Custom Components

You can author and register a custom set of components with the torchx CLI as builtins to the CLI. This makes it possible to customize a set of components most relevant to your team or organization and support it as a CLI builtin. This way users will see your custom components when they run

$ torchx builtins

Custom components can be registered via [torchx.components] entrypoints. If my_project.bar had the following directory structure:

$PROJECT_ROOT/my_project/bar/
    |- baz.py

And baz.py had a single component (function) called trainer:

# baz.py
import torchx.specs as specs

def trainer(...) -> specs.AppDef: ...

And the entrypoints were added as:

# setup.py
...
entry_points={
    "torchx.components": [
        "foo = my_project.bar",
    ],
}

TorchX will search the module my_project.bar for all defined components and group the found components under the foo.* prefix. In this case, the component my_project.bar.baz.trainer would be registered with the name foo.baz.trainer.

Note

Only python packages (those directories with an __init__.py file) are searched for and TorchX makes no attempt to recurse into namespace packages (directories without a __init__.py file). However you may register a top level namespace package.

torchx CLI will display registered components via:

$ torchx builtins
Found 1 builtin components:
1. foo.baz.trainer

The custom component can then be used as:

$ torchx run foo.baz.trainer -- --name "test app"

When you register your own components, TorchX will not include its own builtins. To add TorchX’s builtin components you must specify another entry as:

# setup.py
...
entry_points={
    "torchx.components": [
        "foo = my_project.bar",
        "torchx = torchx.components",
    ],
}

This will add back the TorchX builtins but with a torchx.* component name prefix (e.g. torchx.dist.ddp versus the default dist.ddp).

If there are two registry entries pointing to the same component, for instance

# setup.py
...
entry_points={
    "torchx.components": [
        "foo = my_project.bar",
        "test = my_project",
    ],
}

There will be two sets of overlapping components for those components in my_project.bar with different prefix aliases: foo.* and test.bar.*. Concretely,

$ torchx builtins
Found 2 builtin components:
1. foo.baz.trainer
2. test.bar.baz.trainer

To omit groupings and make the component names shorter, use underscore (e.g _ or _0, _1, etc). For example:

# setup.py
...
entry_points={
    "torchx.components": [
        "_0 = my_project.bar",
        "_1 = torchx.components",
    ],
}

This has the effect of exposing the trainer component as baz.trainer (as opposed to foo.baz.trainer) and adds back the builtin components as in the vanilla installation of torchx, without the torchx.* prefix.

$ torchx builtins
Found 11 builtin components:
1. baz.trainer
2. dist.ddp
3. utils.python
4. ... <more builtins from torchx.components.* ...>

Advanced Usage

Registering Custom Schedulers

Registering Named Resources

Registering Custom Components

Docs

Tutorials

Resources