Advanced Usage
======================
TorchX defines plugin points for you to configure TorchX to best support
your infrastructure setup. Most of the configuration is done through
Python's `entry points `__.
.. note::
Entry points requires a python package containing them be installed.
If you don't have a python package we recommend you make one so you can share
your resource definitions, schedulers and components across your team and org.
The entry points described below can be specified in your project's `setup.py`
file as
.. testsetup:: setup
import sys
sys.argv = ["setup.py", "--version"]
.. testcode:: setup
from setuptools import setup
setup(
name="project foobar",
entry_points={
"torchx.schedulers": [
"my_scheduler = my.custom.scheduler:create_scheduler",
],
"torchx.named_resources": [
"gpu_x2 = my_module.resources:gpu_x2",
],
}
)
.. testoutput:: setup
:hide:
0.0.0
Registering Custom Schedulers
--------------------------------
You may implement a custom scheduler by implementing the
.. py::class torchx.schedulers.Scheduler interface.
The ``create_scheduler`` function should have the following function signature:
.. testcode::
from torchx.schedulers import Scheduler
def create_scheduler(session_name: str, **kwargs: object) -> Scheduler:
return MyScheduler(session_name, **kwargs)
You can then register this custom scheduler by adding an entry_points definition
to your python project.
.. testcode::
# setup.py
...
entry_points={
"torchx.schedulers": [
"my_scheduler = my.custom.scheduler:create_schedule",
],
}
Registering Named Resources
-------------------------------
A Named Resource is a set of predefined resource specs that are given a
string name. This is particularly useful
when your cluster has a fixed set of instance types. For instance if your
deep learning training kubernetes cluster on AWS is
comprised only of p3.16xlarge (64 vcpu, 8 gpu, 488GB), then you may want to
enumerate t-shirt sized resource specs for the containers as:
.. testcode:: python
from torchx.specs import Resource
def gpu_x1() -> Resource:
return Resource(cpu=8, gpu=1, memMB=61_000)
def gpu_x2() -> Resource:
return Resource(cpu=16, gpu=2, memMB=122_000)
def gpu_x3() -> Resource:
return Resource(cpu=32, gpu=4, memMB=244_000)
def gpu_x4() -> Resource:
return Resource(cpu=64, gpu=8, memMB=488_000)
.. testcode:: python
:hide:
gpu_x1()
gpu_x2()
gpu_x3()
gpu_x4()
To make these resource definitions available you then need to register them via
entry_points:
.. testcode::
# setup.py
...
entry_points={
"torchx.named_resources": [
"gpu_x2 = my_module.resources:gpu_x2",
],
}
Once you install the package with the entry_points definitions, the named
resource can then be used in the following manner:
.. testsetup:: role
from torchx.specs import _named_resource_factories, Resource
_named_resource_factories["gpu_x2"] = lambda: Resource(cpu=16, gpu=2, memMB=122_000)
.. doctest:: role
>>> from torchx.specs import get_named_resources
>>> get_named_resources("gpu_x2")
Resource(cpu=16, gpu=2, memMB=122000, ...)
.. testcode:: role
# my_module.component
from torchx.specs import AppDef, Role, get_named_resources
def test_app(resource: str) -> AppDef:
return AppDef(name="test_app", roles=[
Role(
name="...",
image="...",
resource=get_named_resources(resource),
)
])
test_app("gpu_x2")
Registering Custom Components
-------------------------------
You can author and register a custom set of components with the
``torchx`` CLI as builtins to the CLI. This makes it possible to customize
a set of components most relevant to your team or organization and support
it as a CLI ``builtin``. This way users will see your custom components
when they run
.. code-block:: shell-session
$ torchx builtins
Custom components can be registered via ``[torchx.components]`` entrypoints.
If ``my_project.bar`` had the following directory structure:
::
$PROJECT_ROOT/my_project/bar/
|- baz.py
And ``baz.py`` had a single component (function) called ``trainer``:
::
# baz.py
import torchx.specs as specs
def trainer(...) -> specs.AppDef: ...
And the entrypoints were added as:
.. testcode::
# setup.py
...
entry_points={
"torchx.components": [
"foo = my_project.bar",
],
}
TorchX will search the module ``my_project.bar`` for all defined components and group the found
components under the ``foo.*`` prefix. In this case, the component ``my_project.bar.baz.trainer``
would be registered with the name ``foo.baz.trainer``.
.. note::
Only python packages (those directories with an ``__init__.py`` file)
are searched for and TorchX makes no attempt to recurse into namespace packages
(directories without a ``__init__.py`` file).
However you may register a top level namespace package.
``torchx`` CLI will display registered components via:
.. code-block:: shell-session
$ torchx builtins
Found 1 builtin components:
1. foo.baz.trainer
The custom component can then be used as:
.. code-block:: shell-session
$ torchx run foo.baz.trainer -- --name "test app"
When you register your own components, TorchX will not include its own builtins. To add TorchX's
builtin components you must specify another entry as:
.. testcode::
# setup.py
...
entry_points={
"torchx.components": [
"foo = my_project.bar",
"torchx = torchx.components",
],
}
This will add back the TorchX builtins but with a ``torchx.*`` component name prefix (e.g. ``torchx.dist.ddp``
versus the default ``dist.ddp``).
If there are two registry entries pointing to the same component, for instance
.. testcode::
# setup.py
...
entry_points={
"torchx.components": [
"foo = my_project.bar",
"test = my_project",
],
}
There will be two sets of overlapping components for those components in ``my_project.bar`` with different
prefix aliases: ``foo.*`` and ``test.bar.*``. Concretely,
.. code-block:: shell-session
$ torchx builtins
Found 2 builtin components:
1. foo.baz.trainer
2. test.bar.baz.trainer
To omit groupings and make the component names shorter, use underscore (e.g ``_`` or ``_0``, ``_1``, etc).
For example:
.. testcode::
# setup.py
...
entry_points={
"torchx.components": [
"_0 = my_project.bar",
"_1 = torchx.components",
],
}
This has the effect of exposing the trainer component as ``baz.trainer`` (as opposed to ``foo.baz.trainer``)
and adds back the builtin components as in the vanilla installation of torchx, without the ``torchx.*`` prefix.
.. code-block:: shell-session
$ torchx builtins
Found 11 builtin components:
1. baz.trainer
2. dist.ddp
3. utils.python
4. ...