torchelastic on Kubernetes
==========================

This directory contains kubernetes resources that help users run
torchelastic jobs on Kubernetes. We use `Amazon
EKS <https://aws.amazon.com/eks/>`__ as an example here, the steps would
also work for all other cloud providers (You need to find alternative
storage options in your environment)

Prerequisites
-------------

1. Setup AWS credentials on your machine.
2. `eksctl <https://eksctl.io/>`__ to manage Kubernetes cluster.
3. Network file storage which supports ReadWriteMany mode.
   `EFS <https://aws.amazon.com/efs/>`__, `FSx for
   Lustrue <https://aws.amazon.com/fsx/lustre/>`__ are good examples.
4. `S3 <https://aws.amazon.com/s3/>`__ bucket to host training scripts.

Quickstart
----------

This guide shows how to get a torchelastic job running on EKS.

Create Kubernetes cluster
~~~~~~~~~~~~~~~~~~~~~~~~~

We highly recommend to use `eksctl <https://eksctl.io/>`__ to create
Amazon EKS cluster. This process will take 10~15 minutes.

.. code:: bash

   $ eksctl create cluster \
       --name=torchelastic \
       --node-type=p3.2xlarge \
       --region=us-west-2 \
       --version=1.14 \
       --ssh-access \
       --ssh-public-key=~/.ssh/id_rsa.pub \
       --nodes=2

Install Nvidia Device Plugin
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In order to enable GPU support in your EKS cluster, deploy the following
Daemonset:

.. code:: bash

   $ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

Prepare storage
~~~~~~~~~~~~~~~

Create EFS and mount targets
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Create an EFS volume in AWS console. ``eksctl`` will create a new VPC
for the cluster, please select the VPC to make sure EKS cluster nodes
can access EFS. Select ``Public`` subnets when you create mount targets.
Please use security group ``*ClusterSharedNodeSecurityGroup*`` created
by ``eksctl`` to associate with the mount targets.

.. figure:: _static/img/efs-setup.jpg
   :alt: efs-setup-diagram

   efs-setup-diagram

Install EFS CSI Driver
^^^^^^^^^^^^^^^^^^^^^^

EFS CSI Driver manages the lifecycle of Amazon EFS filesystems in
Kubernetes.

.. code:: bash

   kubectl apply -k "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"

..

   Note: If you use kubernetes cluster < 1.14, please follow
   `instructions <https://github.com/kubernetes-sigs/aws-efs-csi-driver>`__
   to install older CSI versions.

Create PersistentVolume and PersistentVolumeClaim
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Once the file system is statically provisioned and CSI driver is ready.
EFS can be mounted inside a container as a volume using the driver.

.. code:: bash

   export EFS_ID=your_efs_id
   sed -i '' 's/<EFS_ID>/'"$EFS_ID"'/' storage.yaml

   kubectl apply -f storage.yaml

Download imagenet dataset
^^^^^^^^^^^^^^^^^^^^^^^^^

In this example, we use `tiny imagenet
dataset <https://tiny-imagenet.herokuapp.com/>`__ instead of full
dataset because dataset is small and easy to download. It only has 200
classes. Each class has 500 training images, 50 validation images, and
50 test images.

   Note: You can download full imagenet dataset
   `here <http://www.image-net.org/>`__.

Run following scripts to create a pods to prefetch dataset on NFS.

.. code:: bash

   cat <<EOF | kubectl apply -f -
   apiVersion: v1
   kind: Pod
   metadata:
     name: download-dataset-task
   spec:
     restartPolicy: OnFailure
     containers:
     - name: app
       image: centos:latest
       command:
       - /bin/sh
       - "-c"
       - |
         /bin/bash <<'EOF'
         yum update
         yum install -y wget unzip
         wget http://cs231n.stanford.edu/tiny-imagenet-200.zip
         unzip tiny-imagenet-200.zip -d /data
         EOF
       volumeMounts:
       - name: persistent-storage
         mountPath: /data
     volumes:
     - name: persistent-storage
       persistentVolumeClaim:
         claimName: efs-claim
   EOF

..

   Note: Job running time may vary from 10 mins to 30 mins depending on
   the throughput and performance mode you configure for your EFS.

Lauch etcd instance
~~~~~~~~~~~~~~~~~~~

This will give you a single etcd instance, your training workers later
can talk to this etcd instance for peer discovery and distributed
synchronization.

.. code:: bash

   $ kubectl apply -f etcd.yaml

Launch torchelastic job
~~~~~~~~~~~~~~~~~~~~~~~

Update training scripts location
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In this example, we’d like to reuse container image
``torchelastic/examples:0.1.0rc1``, then we need to upload training
scripts `main.py <../examples/imagenet/main.py>`__ to a S3 bucket. In
addition, args in ``imagenet.yaml`` needs to be updated as well. Please
update ``s3://<BUCKET>/petctl/<USER>/<JOB_ID>/main.py`` to the your file
location.

.. code:: bash

   containers:
     - name: elastic-trainer-worker
       image: torchelastic/examples:0.1.0rc1
       args:
       - "s3://<BUCKET>/petctl/<USER>/<JOB_ID>/main.py"
       - "--input_path"
       ....
       ....

..

   Note: You can export env ``BUCKET``, ``USER``, ``JOB_ID`` and replace
   values in the file.

::

   sed -i '' 's/<BUCKET>/'"$BUCKET"'/; s/<USER>/'"$USER"'/; s/<JOB_ID>/'"$JOB_ID"'/;' imagenet.yaml

..

   Note: If you don’t like to use S3 or use a different cloud provider,
   please modify `Dockerfile <../examples/Dockerfile>`__ and create your
   own container image.

Attach S3 access to EKS worker nodes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In order to make trainining container to successfully download
``main.py`` from S3 to local, you need to grant S3 access to your worker
nodes. Go to AWS IAM console, attach ``AmazonS3ReadOnlyAccess`` policy
to the role. The worker nodegroup role created by ``eksctl`` will looks
like ``eksctl-torchelastic-nodegroup-***-NodeInstanceRole-***``

Create workers
^^^^^^^^^^^^^^

Kubernetes will create two pods and two headless services. The headless
services is created for pod to pod communication using hostname.

.. code:: bash

   $ kubectl apply -f imagenet.yaml
   pod/imagenet-worker-1 created
   service/imagenet-worker-1 created
   pod/imagenet-worker-2 created
   service/imagenet-worker-2 created

Check worker logs
~~~~~~~~~~~~~~~~~

.. code:: bash

   $ kubectl logs -f imagenet-worker-1

   download: s3://torchelastic-shjiaxin-1h71m-s3bucket-m1b9b9pjldqw/petctl/shjiaxin/imagenet-job/main.py to ../tmp/fetch_and_run_ee5yh8qm/s3_file_x8ure7if
   [INFO] 2020-01-03 23:08:49,297 main: rdzv init method=etcd://etcd:2379/imagenet?min_workers=1&max_workers=3&last_call_timeout=5
   [INFO] 2020-01-03 23:08:49,298 main: Loading data from: /data/tiny-imagenet-200/train
   [INFO] 2020-01-03 23:09:16,761 main: Loading model: resnet101
   [INFO] 2020-01-03 23:09:20,231 main: Rank [0] running on GPU [0]
   INFO 2020-01-03 23:09:20,234 Etcd machines: ['http://0.0.0.0:2379']
   [INFO] 2020-01-03 23:09:20,241 main: Entering torchelastic train_loop
   INFO 2020-01-03 23:09:20,242 Attempting to join next rendezvous
   INFO 2020-01-03 23:09:20,244 Observed existing rendezvous state: {'status': 'joinable', 'version': '1', 'participants': [0]}
   INFO 2020-01-03 23:09:20,255 Joined rendezvous version 1 as rank 1. Full state: {'status': 'joinable', 'version': '1', 'participants': [0, 1]}
   INFO 2020-01-03 23:09:20,255 Waiting for remaining peers.
   INFO 2020-01-03 23:09:25,265 All peers arrived. Confirming membership.
   INFO 2020-01-03 23:09:25,363 Waiting for confirmations from all peers.
   INFO 2020-01-03 23:09:25,367 Rendezvous version 1 is complete. Final state: {'status': 'final', 'version': '1', 'participants': [0, 1], 'keep_alives': ['/torchelastic/p2p/run_imagenet/rdzv/v_1/rank_1', '/torchelastic/p2p/run_imagenet/rdzv/v_1/rank_0'], 'num_workers_waiting': 0}
   INFO 2020-01-03 23:09:25,367 Using TCPStore for c10d::Store implementation
   INFO 2020-01-03 23:09:25,371 Rank 1 will conenct to TCPStore server at imagenet-worker-2:51903
   [INFO] 2020-01-03 23:09:25,372 coordinator_p2p: Got next rendezvous: rank 1, world size 2
   [INFO] 2020-01-03 23:09:25,383 coordinator_p2p: Initialized process group rank 1, world size 2
   [INFO] 2020-01-03 23:09:25,385 main: Rank 1: Model state synced from rank: 0
       batch_size=32
       num_data_workers=0
       data_start_index=0
       iteration=0
       epoch=0/10
   [INFO] 2020-01-03 23:09:25,629 train_loop: Rank 1 synced state with other nodes
   [INFO] 2020-01-03 23:09:27,288 main: epoch: 0, iteration: 0, data_idx: 0
   [INFO] 2020-01-03 23:09:28,856 main: epoch: 0, iteration: 1, data_idx: 96
   [INFO] 2020-01-03 23:09:30,434 main: epoch: 0, iteration: 2, data_idx: 192
   [INFO] 2020-01-03 23:09:31,992 main: epoch: 0, iteration: 3, data_idx: 288
   [INFO] 2020-01-03 23:09:33,543 main: epoch: 0, iteration: 4, data_idx: 384
   [INFO] 2020-01-03 23:09:35,120 main: epoch: 0, iteration: 5, data_idx: 480
   [INFO] 2020-01-03 23:09:36,685 main: epoch: 0, iteration: 6, data_idx: 576
   [INFO] 2020-01-03 23:09:38,256 main: epoch: 0, iteration: 7, data_idx: 672
   [INFO] 2020-01-03 23:09:39,835 main: epoch: 0, iteration: 8, data_idx: 768
   [INFO] 2020-01-03 23:09:41,420 main: epoch: 0, iteration: 9, data_idx: 864
   [INFO] 2020-01-03 23:09:42,981 main: epoch: 0, iteration: 10, data_idx: 960

Clean up job
~~~~~~~~~~~~

::

   $ kubectl delete -f imagenet.yaml

Elasticity
----------

You can launch new pods with same configuration to join the training or
delete individual pod to check

torchelastic operator
---------------------

Above example is a hard way to launch torchelastic job on Kubernetes.
User needs to launch etcd, write pods and headless services and manage
pod lifecycle. A kubernetes `operator <https://coreos.com/operators/>`__
can be introduced here to simplify the torchelastic job setups.
Underneath etcd and workers creation and destroy can be handled by
controller. Dynamic resource allocation and preemption can be added
later at this layer as well.