torchtune.rlhf¶
Components and losses for RLHF algorithms like PPO and DPO.
Estimates the advantages and returns for the PPO algorithm using Generalized Advantage Estimation https://arxiv.org/pdf/1506.02438.pdf |
|
Calculates PPO rewards for the given scores, logprobs, and reference logprobs. |
|
Truncates sequence(s) after the first stop token and pads with |
|
Proximal Policy Optimization (PPO) Loss module. |
|
Direct Preference Optimization (DPO) Loss module: https://arxiv.org/abs/2305.18290 Simply stated from the paper: |
|
Statistical Rejection Sampling Optimization (RSO) or "hinge" loss module: https://arxiv.org/abs/2309.06657. |
|
SimPO: Simple Preference Optimization with a Reference-Free Reward: https://arxiv.org/abs/2405.14734. |