torchft
This repository implements primitives and E2E solutions for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job.
GETTING STARTED? See Install and Usage in the README.
Reference
License
torchft is BSD 3-Clause licensed. See LICENSE for more details.
Copyright © Meta Platforms, Inc