Bazel in Pytorch/XLA¶
Bazel is a free software tool used for the automation of building and testing software. TensorFlow and OpenXLA both use it, which makes it a good fit for PyTorch/XLA as well.
Bazel dependencies¶
Tensorflow is a bazel external dependency for PyTorch/XLA,
which can be seen in the WORKSPACE
file:
WORKSPACE
http_archive(
name = "org_tensorflow",
strip_prefix = "tensorflow-f7759359f8420d3ca7b9fd19493f2a01bd47b4ef",
urls = [
"https://github.com/tensorflow/tensorflow/archive/f7759359f8420d3ca7b9fd19493f2a01bd47b4ef.tar.gz",
],
)
TensorFlow pin can be updated by pointing this repository to a different revision. Patches may be added as needed. Bazel will resolve the dependency, prepare the code and patch it hermetically.
For PyTorch, a different dependency mechanism is deployed because a
local PyTorch checkout is used,
and this local checkout has to be built
from source and ideally
installed on the system for version compatibility (e.g codegen in
PyTorch/XLA uses torchgen
python module that should be installed in
the system).
The local directory can either set in bazel/dependencies.bzl
, or
overriden on the command line:
bazel build --override_repository=org_tensorflow=/path/to/exported/tf_repo //...
bazel build --override_repository=torch=/path/to/exported/and/built/torch_repo //...
Please make sure that the overridden repositories are at the appropriate
revisions and in case of torch
, that it has been built with
USE_CUDA=0 python setup.py bdist_wheel
to make sure that all expected
build objects are present; ideally installed into the system.
WORKSPACE
new_local_repository(
name = "torch",
build_file = "//bazel:torch.BUILD",
path = PYTORCH_LOCAL_DIR,
)
PyTorch headers are directly sourced from the torch
dependency, the
local checkout of PyTorch. The shared libraries (e.g. libtorch.so
) are
sourced from the same local checkout where the code has been built and
build/lib/
contains the built objects. For this to work, it’s required
to pass -isystemexternal/torch
to the compiler so it can find system
libraries and satisfy them from the local checkout. Some are included as
<system>
and some as "user"
headers.
Bazel brings in pybind11 embeded
python and links against it to provide libpython
to the plugin using
this mechanism. Python headers are also sourced from there instead of
depending on the system version. These are satisfied from the
"@pybind11//:pybind11_embed"
, which sets up compiler options for
linking with libpython
transitively.
How to build XLA libraries¶
Building the libraries is simple:
bazel build //torch_xla/csrc/runtime/...
Bazel is configred via .bazelrc
, but it can also take flags on the
command line.
bazel build --config=remote_cache //torch_xla/csrc/runtime/...
The remote_cache
configurations use gcloud for caching and usually
faster, but require authentication with gcloud. See .bazelrc
for the
configuration.
Using bazel makes it easy to express complex dependencies and there is a lot of gain from having a single build graph with everything expressed in the same way. Therefore, there is no need to build the XLA libraries separately from the rest of the pluing as used to be the case, building the whole repository, or the plugin shared object that links everythin else in, is enough.
How to build the Torch/XLA plugin¶
The normal build can be achieved by the invoking the standard
python setup.py bdist_wheel
, but C++ bindings can be built simply
with:
bazel build //:_XLAC.so
This will build the XLA client and the PyTorch plugin and link it all together. This can be useful when testing changes, to be able to compile the C++ code without building the python plugin faster iteration cycles.
Remote caching¶
Bazel comes with remote caching
built in. There are plenty of cache backends that can be used; we deploy
our caching on
(GCS)[https://bazel.build/remote/caching#cloud-storage]. You can see
the configuration in .bazelrc
, under config name remote_cache
.
Remote caching is disabled by default but because it speeds up incremental builds by a huge margin, it is almost always recommended, and it is enabled by default in the CI automation and on Cloud Build.
To authenticate on a machine, please ensure that you have the credentials present with:
gcloud auth application-default login --no-launch-browser
Using the remote cache configured by remote_cache
configuration setup
requires authentication with GCP. There are various ways to authenticate
with GCP. For individual developers who have access to the development
GCP project, one only needs to specify the --config=remote_cache
flag
to bazel, and the default --google_default_credentials
will be used
and if the gcloud token is present on the machine, it will work out of
the box, using the logged in user for authentication. The user needs to
have remote build permissions in GCP (add new developers into the
Remote Bazel
role). In the CI, the service account key is used for
authentication and is passed to bazel using
--config=remote_cache --google_credentials=path/to/service.key
. On
Cloud Build,
docker build --network=cloudbuild
is used to pass the authentication
from the service account running the cloud build down into the docker
image doing the compilation: Application Default
Credentials
does the work there and authenticates as the service account. All
accounts, both user and service accounts, need to have remote cache
read/write permissions.
Remote cache uses cache silos. Each unique machine and build should
specify a unique silo key to benefit from consistent caching. The silo
key can be passed using a flag:
-remote_default_exec_properties=cache-silo-key=SOME_SILO_KEY'
.
Running the build with remote cache:
BAZEL_REMOTE_CACHE=1 SILO_NAME="cache-silo-YOUR-USER" TPUVM_MODE=1 python setup.py bdist_wheel
Adding
GCLOUD_SERVICE_KEY_FILE=~/.config/gcloud/application_default_credentials.json
might help too if bazel
cannot find the auth token.
YOUR-USER
here can the author’s username or machine name, a unique
name that ensures good cache behavior. Other setup.py
functionality
works as intended too (e.g. develop
).
The first time the code is compiled using a new cached key will be slow because it will compile everything from scratch, but incremental compilations will be very fast. On updating the TensorFlow pin, it will once again be a bit slower the first time per key, and then until the next update quite fast again.
Running tests¶
Currently C++ code is built and tested by bazel. Python code will be migrated in the future.
Bazel is a test plafrom too, making it easy to run tests:
bazel test //test/cpp:main
Of course the XLA and PJRT configuration have to be present in the
environment to run the tests. Not all environmental variables are passed
into the bazel test environment to make sure that the remote cache
misses are not too common (environment is part of the cache key), see
.bazelrc
test configuration to see which ones are passed in, and add
new ones as required.
You can run the tests using the helper script too:
BAZEL_REMOTE_CACHE=1 SILO_NAME="cache-silo-YOUR-USER" ./test/cpp/run_tests.sh -R
The xla_client
tests are pure hermetic tests that can be easily
executed. The torch_xla
plugin tests are more complex: they require
torch
and torch_xla
to be installed, and they cannot run in
parallel, since they are using either XRT server/client on the same
port, or because they use a GPU or TPU device and there’s only one
available at the time. For that reason, all tests under
torch_xla/csrc/
are bundled into a single target :main
that runs
them all sequentially.
Code coverage¶
When running tests, it can be useful to calculate code coverage.
bazel coverage //torch_xla/csrc/runtime/...
Coverage can be visualized using lcov
as described in Bazel’s
documentation, or in your
editor of choice with lcov plugins, e.g. Coverage
Gutters
for VSCode.
Language Server¶
Bazel can power a language server like clangd that brings code references, autocompletion and semantic understanding of the underlying code to your editor of choice. For VSCode, one can use Bazel Stack that can be combined with Visual Studio clangd extension functionality to bring powerful features to assist code editing.
Building PyTorch/XLA¶
As always, PyTorch/XLA can be built using Python distutils
:
BAZEL_REMOTE_CACHE=1 SILO_NAME="cache-silo-YOUR-USER" TPUVM_MODE=1 python setup.py bdist_wheel