# Quickstart This is a self contained guide on how to write a simple app and start launching distributed jobs on local and remote clusters. ## Installation First thing we need to do is to install the TorchX python package which includes the CLI and the library. ```sh # install torchx with all dependencies $ pip install torchx[dev] ``` See the [README](https://github.com/pytorch/torchx) for more information on installation. ```sh torchx --help ``` ## Hello World Lets start off with writing a simple "Hello World" python app. This is just a normal python program and can contain anything you'd like.
Note
This example uses Jupyter Notebook `%%writefile` to create local files for example purposes. Under normal usage you would have these as standalone files.
```python %%writefile my_app.py import sys print(f"Hello, {sys.argv[1]}!") ``` ## Launching We can execute our app via `torchx run`. The `local_cwd` scheduler executes the app relative to the current directory. For this we'll use the `utils.python` component: ```sh torchx run --scheduler local_cwd utils.python --help ``` The component takes in the script name and any extra arguments will be passed to the script itself. ```sh torchx run --scheduler local_cwd utils.python --script my_app.py "your name" ``` We can run the exact same app via the `local_docker` scheduler. This scheduler will package up the local workspace as a layer on top of the specified image. This provides a very similar environment to the container based remote schedulers.
Note
This requires Docker installed and won't work in environments such as Google Colab. See the Docker install instructions: [https://docs.docker.com/get-docker/](https://docs.docker.com/get-docker/)
```sh torchx run --scheduler local_docker utils.python --script my_app.py "your name" ``` TorchX defaults to using the [ghcr.io/pytorch/torchx](https://ghcr.io/pytorch/torchx) Docker container image which contains the PyTorch libraries, TorchX and related dependencies. ## Distributed TorchX's `dist.ddp` component uses [TorchElastic](https://pytorch.org/docs/stable/distributed.elastic.html) to manage the workers. This means you can launch multi-worker and multi-host jobs out of the box on all of the schedulers we support. ```sh torchx run --scheduler local_docker dist.ddp --help ``` Lets create a slightly more interesting app to leverage the TorchX distributed support. ```python %%writefile dist_app.py import torch import torch.distributed as dist dist.init_process_group(backend="gloo") print(f"I am worker {dist.get_rank()} of {dist.get_world_size()}!") a = torch.tensor([dist.get_rank()]) dist.all_reduce(a) print(f"all_reduce output = {a}") ``` Let launch a small job with 2 nodes and 2 worker processes per node: ```sh torchx run --scheduler local_docker dist.ddp -j 2x2 --script dist_app.py ``` ## Workspaces / Patching For each scheduler there's a concept of an `image`. For `local_cwd` and `slurm` it uses the current working directory. For container based schedulers such as `local_docker`, `kubernetes` and `aws_batch` it uses a docker container. To provide the same environment between local and remote jobs, TorchX CLI uses workspaces to automatically patch images for remote jobs on a per scheduler basis. When you launch a job via `torchx run` it'll overlay the current directory on top of the provided image so your code is available in the launched job. For `docker` based schedulers you'll need a local docker daemon to build and push the image to your remote docker repository. ## `.torchxconfig` Arguments to schedulers can be specified either via a command line flag to `torchx run -s -c ` or on a per scheduler basis via a `.torchxconfig` file. ```python %%writefile .torchxconfig [kubernetes] queue=torchx image_repo= [slurm] partition=torchx ``` ## Remote Schedulers TorchX supports a large number of schedulers. Don't see yours? [Request it!](https://github.com/pytorch/torchx/issues/new?assignees=&labels=&template=feature-request.md) Remote schedulers operate the exact same way the local schedulers do. The same run command for local works out of the box on remote. ```sh $ torchx run --scheduler slurm dist.ddp -j 2x2 --script dist_app.py $ torchx run --scheduler kubernetes dist.ddp -j 2x2 --script dist_app.py $ torchx run --scheduler aws_batch dist.ddp -j 2x2 --script dist_app.py $ torchx run --scheduler ray dist.ddp -j 2x2 --script dist_app.py ``` Depending on the scheduler there may be a few extra configuration parameters so TorchX knows where to run the job and upload built images. These can either be set via `-c` or in the `.torchxconfig` file. All config options: ```sh torchx runopts ``` ## Custom Images ### Docker-based Schedulers If you want more than the standard PyTorch libraries you can add custom Dockerfile or build your own docker container and use it as the base image for your TorchX jobs. ```python %%writefile timm_app.py import timm print(timm.models.resnet18()) ``` ```python %%writefile Dockerfile.torchx FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime RUN pip install timm COPY . . ``` Once we have the Dockerfile created we can launch as normal and TorchX will automatically build the image with the newly provided Dockerfile instead of the default one. ```sh torchx run --scheduler local_docker utils.python --script timm_app.py ``` ### Slurm The `slurm` and `local_cwd` use the current environment so you can use `pip` and `conda` as normal. ## Next Steps 1. Checkout other features of the [torchx CLI](cli.rst) 2. Take a look at the [list of schedulers](schedulers.rst) supported by the runner 3. Browse through the collection of [builtin components](components/overview.rst) 4. See which [ML pipeline platforms](pipelines.rst) you can run components on 5. See a [training app example](examples_apps/index.rst)