Component Best Practices¶
This has a list of common things you might want to do with a component and best practices for them. Components are designed to be flexible so you can deviate from these practices if necessary however these are the best practices we use for the builtin TorchX components.
See App Best Practices for information how to write apps using TorchX.
When possible it’s best to call your reusable component via
python -m <module>
instead of specifying the path to the main module. This makes it so it can be
used in multiple different environments such as docker and slurm by relying on
the python module resolution instead of the directory structure.
If your app isn’t python based, you can place your app in a folder on your PATH so it’s accessible regardless of the directory structure.
def trainer(img_name: str, img_version: str) -> AppDef: return AppDef(roles=[ Role( entrypoint="python", args=[ "-m", "your.app", ], ) ])
When writing a component you want to keep each component as simple as possible to make it easier for others to reuse and understand.
Argument processing makes it hard to use the component in other environments. For images in particular we want to directly pass the image field to the AppDef since any sort of manipulation will make it impossible to use in other environments with different image naming conventions.
def trainer(image: str): return AppDef(roles=[Role(image=image)...)
You should avoid branching logic in the components. If you have a case where you
feel like you need an
if statement in the component you should prefer to
create multiple components with shared logic. Complex arguments make it hard for
others to understand how to use it.
def trainer_test(): return _trainer(num_replicas=1) def trainer_prod() -> AppDef: return _trainer(num_replicas=10) # not a component just a function def _trainer(num_replicas: int) -> AppDef: return AppDef(roles=[Role(..., num_replicas=num_replicas)])
The documentation is optional, but it is the best practice to keep component functions documented, especially if you want to share your components. See :ref:Component Authoring<components/overview:Authoring> for more details.
When writing components it’s best to use TorchX’s named resources support instead of manually specifying cpu and memory allocations. Named resources allow your component to be environment independent and allow for better scheduling behavior by using t-shirt sizes.
torchx.specs.get_named_resources() for more info.
For common component styles we provide base component definitions. These can be called from your custom component definition and an alternative to creating a full AppDef from scratch.
torchx.components.basefor simple single node components.
torchx.components.dist.ddp()for distributed components.
For even more complex components it’s possible to merge multiple existing components into a single one. For instance you could use a metrics UI component and merge the roles from it into training component roles to have a sidecar service to your main training job.
If you’re writing a component for distributed training or other similar
distributed computation, we recommend using the
torchx.components.dist.ddp() component since it provides out of the box
You can extend the
ddp component by writing a custom component that simple
ddp component and calls it with your app configuration.
Define All Arguments¶
It’s preferable to define all component arguments as function arguments instead of consuming a dictionary of arguments. This makes it easier for users to figure out the options as well as can provide static typing when used with pyre or mypy.
You can unit test the component definitions as you would normal Python code since they are valid Python definitions.
We do recommend using
ComponentTestCase to ensure that your
component can be parsed by the TorchX CLI. The CLI requires stricter formatting
on the doc string than pure Python as the doc string is used for parsing CLI
- class torchx.components.component_test_base.ComponentTestCase(methodName='runTest')¶
- run_component(component: Callable[[...], AppDef], args: Optional[Dict[str, Any]] = None, scheduler_params: Optional[Dict[str, Any]] = None, scheduler: str = 'local_cwd', interval: float = 0.1, timeout: float = 1) Optional[AppStatus] ¶
Helper function that hides complexity of setting up the runner and polling results. Note: method is blocking until either scheduler exits or timeout is reached (for non-blocking schedulers).
components – component function, factory for AppDef
args – optional component factory arguments
scheduler_params – optional parameters for scheduler factory method
scheduler – scheduler name
interval – scheduler comppletion polling interval
timeout – max time for scheduler to complete
You can setup integration tests with your components by either using the programmatic runner API or write a bash script to call the CLI.
You can see both styles in use in the core TorchX scheduler integration tests.