Coordination (Low Level API)
Warning
As torchft is still in development, the APIs in this module are subject to change.
This module exposes low level coordination APIs to allow you to build your own custom fault tolerance algorithms on top of torchft.
If you’re looking for a more complete solution, please use the other modules in torchft.
This provides direct access to the Lighthouse and Manager servers and clients.
- class torchft.coordination.LighthouseClient(addr, connect_timeout)
Bases:
object
LighthouseClient is a GRPC client to the lighthouse service.
It is used to directly communicate with the lighthouse Server.
- Parameters:
addr (str) – The HTTP address of the lighthouse server.
connect_timeout (timedelta) – The timeout for connecting to the lighthouse server.
- quorum(replica_id, timeout, address=Ellipsis, store_address=Ellipsis, step=0, world_size=0, shrink_only=False, data=None)
quorum sends a request to the lighthouse server to form a quorum.
- Parameters:
replica_id (str) – The string id of the replica calling quorum.
timeout (timedelta) – The timeout for quorum.
address (str) – The address of the replica calling quorum. Default: “”.
store_address (str) – The address of the store. Default: “”.
step (python:int) – The step of the replica calling quorum. Default: 0.
world_size (python:int) – The world size of the replica calling quorum. Default: 0.
shrink_only (bool) – Whether the quorum is for shrinking only. Default: false.
data (Optional[dict]) – The data to be passed with quorum.
- Returns:
Current quorum if successful.
- Return type:
- class torchft.coordination.LighthouseServer(bind, min_replicas, join_timeout_ms=None, quorum_tick_ms=None, heartbeat_timeout_ms=None)
Bases:
object
LighthouseServer is a GRPC server for the lighthouse service.
It is used to coordinate the ManagerServer for each replica group.
This entrypoint is primarily for testing and debugging purposes. The
torchft_lighthouse
command is recommended for most use cases.- Parameters:
bind (str) – The HTTP address to bind the server to.
min_replicas (python:int) – The minimum number of replicas required to form a quorum.
join_timeout_ms (python:int) – The timeout for joining the quorum.
quorum_tick_ms (python:int) – The interval at which the quorum is checked.
heartbeat_timeout_ms (python:int) – The timeout for heartbeats.
- class torchft.coordination.ManagerClient(addr, connect_timeout)
Bases:
object
ManagerClient is a GRPC client to the manager service.
It is used by the trainer to communicate with the ManagerServer.
- Parameters:
addr (str) – The HTTP address of the manager server.
connect_timeout (timedelta) – The timeout for connecting to the manager server.
- should_commit(rank, step, should_commit, timeout)
should_commit makes a request to the manager to determine if the trainer should commit the current step. This waits until all ranks check in at the specified step and will return false if any worker passes
should_commit=False
.- Parameters:
rank (python:int) – The rank of the trainer.
step (python:int) – The step of the trainer.
should_commit (bool) – Whether the trainer should commit the current step.
timeout (timedelta) – The timeout for the request. If the request times out a TimeoutError is raised.
- Returns:
Whether the trainer should commit the current step.
- Return type:
bool
- class torchft.coordination.ManagerServer(replica_id, lighthouse_addr, hostname, bind, store_addr, world_size, heartbeat_interval, connect_timeout)
Bases:
object
ManagerServer is a GRPC server for the manager service. There should be one manager server per replica group (typically running on the rank 0 host). The individual ranks within a replica group should use ManagerClient to communicate with the manager server and participate in quorum operations.
- Parameters:
replica_id (str) – The ID of the replica group.
lighthouse_addr (str) – The HTTP address of the lighthouse server.
hostname (str) – The hostname of the manager server.
bind (str) – The HTTP address to bind the server to.
store_addr (str) – The HTTP address of the store server.
world_size (python:int) – The world size of the replica group.
heartbeat_interval (timedelta) – The interval at which heartbeats are sent.
connect_timeout (timedelta) – The timeout for connecting to the lighthouse server.
- class torchft.coordination.Quorum
Bases:
object
quorum result.
- Parameters:
quorum_id (python:int) – The id of current quorum.
participants (list[QuorumMember]) – All members within the quorum.
created (timedelta) – Time of quorum created in server.
- class torchft.coordination.QuorumMember
Bases:
object
quorum member of one quorum.
- Parameters:
replica_id (str) – The string id of the replica calling quorum.
address (str) – The address of the replica calling quorum.
store_address (str) – The address of the store.
step (python:int) – The step of the replica calling quorum.
world_size (python:int) – The world size of the replica calling quorum.
shrink_only (bool) – Whether the quorum is for shrinking only.
timeout (timedelta) – The timeout for quorum.
data (dict or None) – The data to be passed with quorum.