Component Best Practices¶

This has a list of common things you might want to do with a component and best practices for them. Components are designed to be flexible so you can deviate from these practices if necessary however these are the best practices we use for the builtin TorchX components.

See App Best Practices for information how to write apps using TorchX.

Entrypoints¶

When possible it’s best to call your reusable component via python -m <module> instead of specifying the path to the main module. This makes it so it can be used in multiple different environments such as docker and slurm by relying on the python module resolution instead of the directory structure.

If your app isn’t python based, you can place your app in a folder on your PATH so it’s accessible regardless of the directory structure.

def trainer(img_name: str, img_version: str) -> AppDef:
    return AppDef(roles=[
        Role(
            entrypoint="python",
            args=[
                "-m",
                "your.app",
            ],
        )
    ])

Simplify¶

When writing a component you want to keep each component as simple as possible to make it easier for others to reuse and understand.

Argument Processing¶

Argument processing makes it hard to use the component in other environments. For images in particular we want to directly pass the image field to the AppDef since any sort of manipulation will make it impossible to use in other environments with different image naming conventions.

def trainer(image: str):
    return AppDef(roles=[Role(image=image)...)

Branching Logic¶

You should avoid branching logic in the components. If you have a case where you feel like you need an if statement in the component you should prefer to create multiple components with shared logic. Complex arguments make it hard for others to understand how to use it.

def trainer_test():
    return _trainer(num_replicas=1)

def trainer_prod() -> AppDef:
    return _trainer(num_replicas=10)

# not a component just a function
def _trainer(num_replicas: int) -> AppDef:
     return AppDef(roles=[Role(..., num_replicas=num_replicas)])

Documentation¶

The documentation is optional, but it is the best practice to keep component functions documented, especially if you want to share your components. See :ref:Component Authoring<components/overview:Authoring> for more details.

Named Resources¶

When writing components it’s best to use TorchX’s named resources support instead of manually specifying cpu and memory allocations. Named resources allow your component to be environment independent and allow for better scheduling behavior by using t-shirt sizes.

See torchx.specs.get_named_resources() for more info.

Composing Components¶

For common component styles we provide base component definitions. These can be called from your custom component definition and an alternative to creating a full AppDef from scratch.

See:

torchx.components.base for simple single node components.
torchx.components.dist.ddp() for distributed components.

For even more complex components it’s possible to merge multiple existing components into a single one. For instance you could use a metrics UI component and merge the roles from it into training component roles to have a sidecar service to your main training job.

Distributed Components¶

If you’re writing a component for distributed training or other similar distributed computation, we recommend using the torchx.components.dist.ddp() component since it provides out of the box support for torch.distributed.elastic jobs.

You can extend the ddp component by writing a custom component that simple imports the ddp component and calls it with your app configuration.

Define All Arguments¶

It’s preferable to define all component arguments as function arguments instead of consuming a dictionary of arguments. This makes it easier for users to figure out the options as well as can provide static typing when used with pyre or mypy.

Unit Tests¶

You can unit test the component definitions as you would normal Python code since they are valid Python definitions.

We do recommend using ComponentTestCase to ensure that your component can be parsed by the TorchX CLI. The CLI requires stricter formatting on the doc string than pure Python as the doc string is used for parsing CLI args.

class torchx.components.component_test_base.ComponentTestCase(methodName='runTest')[source]¶

run_component(component: Callable[[...], AppDef], args: Optional[Dict[str, Any]] = None, scheduler_params: Optional[Dict[str, Any]] = None, scheduler: str = 'local_cwd', interval: float = 0.1, timeout: float = 1) → Optional[AppStatus][source]¶

Helper function that hides complexity of setting up the runner and polling results. Note: method is blocking until either scheduler exits or timeout is reached (for non-blocking schedulers).

Parameters:

components – component function, factory for AppDef
args – optional component factory arguments
scheduler_params – optional parameters for scheduler factory method
scheduler – scheduler name
interval – scheduler comppletion polling interval
timeout – max time for scheduler to complete

setUp() → None[source]¶: Hook method for setting up the test fixture before exercising it.

tearDown() → None[source]¶: Hook method for deconstructing the test fixture after testing it.

validate(module: module, function_name: str) → None[source]¶

Validates the component by effectively running:

$ torchx run COMPONENT.py:FN --help

Integration Tests¶

You can setup integration tests with your components by either using the programmatic runner API or write a bash script to call the CLI.

You can see both styles in use in the core TorchX scheduler integration tests.