Shortcuts

torchtune.rlhf

Components and losses for RLHF algorithms like PPO and DPO.

estimate_advantages

Estimates the advantages and returns for the PPO algorithm using Generalized Advantage Estimation https://arxiv.org/pdf/1506.02438.pdf

get_rewards_ppo

Calculates PPO rewards for the given scores, logprobs, and reference logprobs.

truncate_sequence_at_first_stop_token

Truncates sequence(s) after the first stop token and pads with fill_value.

loss.PPOLoss

Proximal Policy Optimization (PPO) Loss module.

loss.DPOLoss

Direct Preference Optimization (DPO) Loss module: https://arxiv.org/abs/2305.18290 Simply stated from the paper:

loss.RSOLoss

Statistical Rejection Sampling Optimization (RSO) or "hinge" loss module: https://arxiv.org/abs/2309.06657.

loss.SimPOLoss

Simple Preference Optimization with a Reference-Free Reward: https://arxiv.org/abs/2405.14734.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources