Shortcuts

get_rewards_ppo

torchtune.rlhf.get_rewards_ppo(scores: Tensor, logprobs: Tensor, ref_logprobs: Tensor, kl_coeff: float, valid_score_idxs: Optional[Tensor] = None) Tuple[Tensor, Tensor, Tensor][source]

Calculates PPO rewards for the given scores, logprobs, and reference logprobs.

Parameters:
  • scores (torch.Tensor) – Reward model scores, shape (b,).

  • logprobs (torch.Tensor) – Policy logprobs, shape (b, response_len).

  • ref_logprobs (torch.Tensor) – Reference base model logprobs, shape (b, response_len).

  • kl_coeff (float) – KL reward contribution coefficient.

  • valid_score_idxs (Optional[torch.Tensor]) – A tensor of indexes for valid (non-padded) token predictions. This is useful when calculating rewards for padded sequences, as scores and value estimates are defined for the last valid predicted token. Shape: (b,). Default None.

Returns:

A tuple of tensors with shape (b, response_len) each:
  • total_reward: total reward combining per-token kl rewards and reward model score.

  • kl: kl divergence between policy and reference policy logprobs.

  • kl_reward: kl divergence scaled by kl_coeff.

Return type:

Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

Notation used for tensor shapes:
  • b: batch size

  • response_len: model response length

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources