torchft¶
This repository implements primitives and E2E solutions for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job.
GETTING STARTED? See Install and Usage in the README.
License¶
torchft is BSD 3-Clause licensed. See LICENSE for more details.
Copyright © Meta Platforms, Inc