torchtune.rlhf

Components and losses for RLHF algorithms like PPO and DPO.

`estimate_advantages`	Estimates the advantages and returns for the PPO algorithm using Generalized Advantage Estimation https://arxiv.org/pdf/1506.02438.pdf
`get_rewards_ppo`	Calculates PPO rewards for the given scores, logprobs, and reference logprobs.
`truncate_sequence_at_first_stop_token`	Truncates sequence(s) after the first stop token and pads with `fill_value`.
`loss.PPOLoss`	Proximal Policy Optimization (PPO) Loss module.
`loss.DPOLoss`	Direct Preference Optimization (DPO) Loss module: https://arxiv.org/abs/2305.18290 Simply stated from the paper:
`loss.RSOLoss`	https://arxiv.org/abs/2309.06657.

Docs