# Bazel in Pytorch/XLA

[Bazel](https://bazel.build/) is a free software tool used for the
automation of building and testing software.
[TensorFlow](https://www.tensorflow.org/http) and
[OpenXLA](https://github.com/openxla/xla) both use it, which makes it a
good fit for PyTorch/XLA as well.

## Bazel dependencies

Tensorflow is a [bazel external dependency](https://bazel.build/external/overview) for PyTorch/XLA,
which can be seen in the `WORKSPACE` file:

`WORKSPACE`

``` python
http_archive(
name = "org_tensorflow",
strip_prefix = "tensorflow-f7759359f8420d3ca7b9fd19493f2a01bd47b4ef",
urls = [
"https://github.com/tensorflow/tensorflow/archive/f7759359f8420d3ca7b9fd19493f2a01bd47b4ef.tar.gz",
],
)
```

TensorFlow pin can be updated by pointing this repository to a different
revision. Patches may be added as needed. Bazel will resolve the
dependency, prepare the code and patch it hermetically.

For PyTorch, a different dependency mechanism is deployed because a
local [PyTorch](https://github.com/pytorch/pytorch) checkout is used,
and this local checkout has to be `built` from source and ideally
installed on the system for version compatibility (e.g codegen in
PyTorch/XLA uses `torchgen` python module that should be installed in
the system).

The local directory can either set in `bazel/dependencies.bzl`, or
overriden on the command line:

``` bash
bazel build --override_repository=org_tensorflow=/path/to/exported/tf_repo //...
```

``` bash
bazel build --override_repository=torch=/path/to/exported/and/built/torch_repo //...
```

Please make sure that the overridden repositories are at the appropriate
revisions and in case of `torch`, that it has been built with
`USE_CUDA=0 python setup.py bdist_wheel` to make sure that all expected
build objects are present; ideally installed into the system.

`WORKSPACE`

``` python
new_local_repository(
name = "torch",
build_file = "//bazel:torch.BUILD",
path = PYTORCH_LOCAL_DIR,
)
```

PyTorch headers are directly sourced from the `torch` dependency, the
local checkout of PyTorch. The shared libraries (e.g. `libtorch.so`) are
sourced from the same local checkout where the code has been built and
`build/lib/` contains the built objects. For this to work, it's required
to pass `-isystemexternal/torch` to the compiler so it can find `system`
libraries and satisfy them from the local checkout. Some are included as
`<system>` and some as `"user"` headers.

Bazel brings in [pybind11](https://github.com/pybind/pybind11) embeded
python and links against it to provide `libpython` to the plugin using
this mechanism. Python headers are also sourced from there instead of
depending on the system version. These are satisfied from the
`"@pybind11//:pybind11_embed"`, which sets up compiler options for
linking with `libpython` transitively.

## How to build XLA libraries

Building the libraries is simple:

``` bash
bazel build //torch_xla/csrc/runtime/...
```

Bazel is configred via `.bazelrc`, but it can also take flags on the
command line.

``` bash
bazel build --config=remote_cache //torch_xla/csrc/runtime/...
```

The `remote_cache` configurations use gcloud for caching and usually
faster, but require authentication with gcloud. See `.bazelrc` for the
configuration.

Using bazel makes it easy to express complex dependencies and there is a
lot of gain from having a single build graph with everything expressed
in the same way. Therefore, there is no need to build the XLA libraries
separately from the rest of the pluing as used to be the case, building
the whole repository, or the plugin shared object that links everythin
else in, is enough.

## How to build the Torch/XLA plugin

The normal build can be achieved by the invoking the standard
`python setup.py bdist_wheel`, but C++ bindings can be built simply
with:

``` bash
bazel build //:_XLAC.so
```

This will build the XLA client and the PyTorch plugin and link it all
together. This can be useful when testing changes, to be able to compile
the C++ code without building the python plugin faster iteration cycles.

## Remote caching

Bazel comes with [remote caching](https://bazel.build/remote/caching)
built in. There are plenty of cache backends that can be used; we deploy
our caching on
(GCS)\[<https://bazel.build/remote/caching#cloud-storage>\]. You can see
the configuration in `.bazelrc`, under config name `remote_cache`.

Remote caching is disabled by default but because it speeds up
incremental builds by a huge margin, it is almost always recommended,
and it is enabled by default in the CI automation and on Cloud Build.

To authenticate on a machine, please ensure that you have the
credentials present with:

``` bash
gcloud auth application-default login --no-launch-browser
```

Using the remote cache configured by `remote_cache` configuration setup
requires authentication with GCP. There are various ways to authenticate
with GCP. For individual developers who have access to the development
GCP project, one only needs to specify the `--config=remote_cache` flag
to bazel, and the default `--google_default_credentials` will be used
and if the gcloud token is present on the machine, it will work out of
the box, using the logged in user for authentication. The user needs to
have remote build permissions in GCP (add new developers into the
`Remote Bazel` role). In the CI, the service account key is used for
authentication and is passed to bazel using
`--config=remote_cache --google_credentials=path/to/service.key`. On
[Cloud Build](https://cloud.google.com/build),
`docker build --network=cloudbuild` is used to pass the authentication
from the service account running the cloud build down into the docker
image doing the compilation: [Application Default
Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
does the work there and authenticates as the service account. All
accounts, both user and service accounts, need to have remote cache
read/write permissions.

Remote cache uses cache silos. Each unique machine and build should
specify a unique silo key to benefit from consistent caching. The silo
key can be passed using a flag:
`-remote_default_exec_properties=cache-silo-key=SOME_SILO_KEY'`.

Running the build with remote cache:

``` bash
BAZEL_REMOTE_CACHE=1 SILO_NAME="cache-silo-YOUR-USER" TPUVM_MODE=1 python setup.py bdist_wheel
```

Adding

``` bash
GCLOUD_SERVICE_KEY_FILE=~/.config/gcloud/application_default_credentials.json
```

might help too if `bazel` cannot find the auth token.

`YOUR-USER` here can the author's username or machine name, a unique
name that ensures good cache behavior. Other `setup.py` functionality
works as intended too (e.g. `develop`).

The first time the code is compiled using a new cached key will be slow
because it will compile everything from scratch, but incremental
compilations will be very fast. On updating the TensorFlow pin, it will
once again be a bit slower the first time per key, and then until the
next update quite fast again.

## Running tests

Currently C++ code is built and tested by bazel. Python code will be
migrated in the future.

Bazel is a test plafrom too, making it easy to run tests:

``` bash
bazel test //test/cpp:main
```

Of course the XLA and PJRT configuration have to be present in the
environment to run the tests. Not all environmental variables are passed
into the bazel test environment to make sure that the remote cache
misses are not too common (environment is part of the cache key), see
`.bazelrc` test configuration to see which ones are passed in, and add
new ones as required.

You can run the tests using the helper script too:

``` bash
BAZEL_REMOTE_CACHE=1 SILO_NAME="cache-silo-YOUR-USER" ./test/cpp/run_tests.sh -R
```

The `xla_client` tests are pure hermetic tests that can be easily
executed. The `torch_xla` plugin tests are more complex: they require
`torch` and `torch_xla` to be installed, and they cannot run in
parallel, since they are using either XRT server/client on the same
port, or because they use a GPU or TPU device and there's only one
available at the time. For that reason, all tests under
`torch_xla/csrc/` are bundled into a single target `:main` that runs
them all sequentially.

## Code coverage

When running tests, it can be useful to calculate code coverage.

``` bash
bazel coverage //torch_xla/csrc/runtime/...
```

Coverage can be visualized using `lcov` as described in [Bazel's
documentation](https://bazel.build/configure/coverage), or in your
editor of choice with lcov plugins, e.g. [Coverage
Gutters](https://marketplace.visualstudio.com/items?itemName=ryanluker.vscode-coverage-gutters)
for VSCode.

## Language Server

Bazel can power a language server like [clangd](https://clangd.llvm.org/) that brings code references,
autocompletion and semantic understanding of the underlying code to your
editor of choice. For VSCode, one can use [Bazel Stack](https://github.com/stackb/bazel-stack-vscode-cc)
that can be combined with [Visual Studio clangd extension](https://marketplace.visualstudio.com/items?itemName=llvm-vs-code-extensions.vscode-clangd)
functionality to bring powerful features to assist code editing.

## Building PyTorch/XLA

As always, PyTorch/XLA can be built using Python `distutils`:

``` bash
BAZEL_REMOTE_CACHE=1 SILO_NAME="cache-silo-YOUR-USER" TPUVM_MODE=1 python setup.py bdist_wheel
```