get_rewards_ppo¶
- torchtune.rlhf.get_rewards_ppo(scores: Tensor, logprobs: Tensor, ref_logprobs: Tensor, kl_coeff: float, valid_score_idxs: Optional[Tensor] = None) Tuple[Tensor, Tensor, Tensor] [source]¶
Calculates PPO rewards for the given scores, logprobs, and reference logprobs.
- Parameters:
scores (torch.Tensor) – Reward model scores, shape
(b,)
.logprobs (torch.Tensor) – Policy logprobs, shape
(b, response_len)
.ref_logprobs (torch.Tensor) – Reference base model logprobs, shape
(b, response_len)
.kl_coeff (float) – KL reward contribution coefficient.
valid_score_idxs (Optional[torch.Tensor]) – A tensor of indexes for valid (non-padded) token predictions. This is useful when calculating rewards for padded sequences, as scores and value estimates are defined for the last valid predicted token. Shape:
(b,)
. Default None.
- Returns:
- A tuple of tensors with shape
(b, response_len)
each: total_reward: total reward combining per-token kl rewards and reward model score.
kl: kl divergence between policy and reference policy logprobs.
kl_reward: kl divergence scaled by
kl_coeff
.
- A tuple of tensors with shape
- Return type:
Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
- Notation used for tensor shapes:
b: batch size
response_len: model response length