• Docs >
  • Coordination (Low Level API)
Shortcuts

Coordination (Low Level API)

Warning

As torchft is still in development, the APIs in this module are subject to change.

This module exposes low level coordination APIs to allow you to build your own custom fault tolerance algorithms on top of torchft.

If you’re looking for a more complete solution, please use the other modules in torchft.

This provides direct access to the Lighthouse and Manager servers and clients.

class torchft.coordination.LighthouseClient(addr, connect_timeout)

Bases: object

LighthouseClient is a GRPC client to the lighthouse service.

It is used to directly communicate with the lighthouse Server.

Parameters:
  • addr (str) – The HTTP address of the lighthouse server.

  • connect_timeout (timedelta) – The timeout for connecting to the lighthouse server.

quorum(replica_id, timeout, address=Ellipsis, store_address=Ellipsis, step=0, world_size=0, shrink_only=False, data=None)

quorum sends a request to the lighthouse server to form a quorum.

Parameters:
  • replica_id (str) – The string id of the replica calling quorum.

  • timeout (timedelta) – The timeout for quorum.

  • address (str) – The address of the replica calling quorum. Default: “”.

  • store_address (str) – The address of the store. Default: “”.

  • step (python:int) – The step of the replica calling quorum. Default: 0.

  • world_size (python:int) – The world size of the replica calling quorum. Default: 0.

  • shrink_only (bool) – Whether the quorum is for shrinking only. Default: false.

  • data (Optional[dict]) – The data to be passed with quorum.

Returns:

Current quorum if successful.

Return type:

Quorum

class torchft.coordination.LighthouseServer(bind, min_replicas, join_timeout_ms=None, quorum_tick_ms=None, heartbeat_timeout_ms=None)

Bases: object

LighthouseServer is a GRPC server for the lighthouse service.

It is used to coordinate the ManagerServer for each replica group.

This entrypoint is primarily for testing and debugging purposes. The torchft_lighthouse command is recommended for most use cases.

Parameters:
  • bind (str) – The HTTP address to bind the server to.

  • min_replicas (python:int) – The minimum number of replicas required to form a quorum.

  • join_timeout_ms (python:int) – The timeout for joining the quorum.

  • quorum_tick_ms (python:int) – The interval at which the quorum is checked.

  • heartbeat_timeout_ms (python:int) – The timeout for heartbeats.

address()

address returns the address of the lighthouse server.

Returns:

The address of the lighthouse server.

Return type:

str

shutdown()

shutdown shuts down the lighthouse server.

class torchft.coordination.ManagerClient(addr, connect_timeout)

Bases: object

ManagerClient is a GRPC client to the manager service.

It is used by the trainer to communicate with the ManagerServer.

Parameters:
  • addr (str) – The HTTP address of the manager server.

  • connect_timeout (timedelta) – The timeout for connecting to the manager server.

should_commit(rank, step, should_commit, timeout)

should_commit makes a request to the manager to determine if the trainer should commit the current step. This waits until all ranks check in at the specified step and will return false if any worker passes should_commit=False.

Parameters:
  • rank (python:int) – The rank of the trainer.

  • step (python:int) – The step of the trainer.

  • should_commit (bool) – Whether the trainer should commit the current step.

  • timeout (timedelta) – The timeout for the request. If the request times out a TimeoutError is raised.

Returns:

Whether the trainer should commit the current step.

Return type:

bool

class torchft.coordination.ManagerServer(replica_id, lighthouse_addr, hostname, bind, store_addr, world_size, heartbeat_interval, connect_timeout)

Bases: object

ManagerServer is a GRPC server for the manager service. There should be one manager server per replica group (typically running on the rank 0 host). The individual ranks within a replica group should use ManagerClient to communicate with the manager server and participate in quorum operations.

Parameters:
  • replica_id (str) – The ID of the replica group.

  • lighthouse_addr (str) – The HTTP address of the lighthouse server.

  • hostname (str) – The hostname of the manager server.

  • bind (str) – The HTTP address to bind the server to.

  • store_addr (str) – The HTTP address of the store server.

  • world_size (python:int) – The world size of the replica group.

  • heartbeat_interval (timedelta) – The interval at which heartbeats are sent.

  • connect_timeout (timedelta) – The timeout for connecting to the lighthouse server.

address()

address returns the address of the manager server.

Returns:

The address of the manager server.

Return type:

str

shutdown()

shutdown shuts down the manager server.

class torchft.coordination.Quorum

Bases: object

quorum result.

Parameters:
  • quorum_id (python:int) – The id of current quorum.

  • participants (list[QuorumMember]) – All members within the quorum.

  • created (timedelta) – Time of quorum created in server.

created
participants
quorum_id
class torchft.coordination.QuorumMember

Bases: object

quorum member of one quorum.

Parameters:
  • replica_id (str) – The string id of the replica calling quorum.

  • address (str) – The address of the replica calling quorum.

  • store_address (str) – The address of the store.

  • step (python:int) – The step of the replica calling quorum.

  • world_size (python:int) – The world size of the replica calling quorum.

  • shrink_only (bool) – Whether the quorum is for shrinking only.

  • timeout (timedelta) – The timeout for quorum.

  • data (dict or None) – The data to be passed with quorum.

address
data
replica_id
shrink_only
step
store_address
world_size

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources