DPOLoss

class torchtune.rlhf.loss.DPOLoss(beta: float = 0.1, label_smoothing: float = 0.0)[source]

Direct Preference Optimization (DPO) Loss module: https://arxiv.org/abs/2305.18290 Simply stated from the paper:

Intuitively, the DPO update increases the relative log probability of preferred to dispreferred responses, but it incorporates a dynamic, per-example importance weight that prevents the model degeneration that we find occurs with a naive probability ratio objective.

Based on the implementation in HF’s TRL library: https://github.com/huggingface/trl/blob/5d1deb1445828cfd0e947cb3a7925b1c03a283fc/trl/trainer/dpo_trainer.py#L844

DPO retains similarities to PPO (https://arxiv.org/abs/2009.01325), where it optimizes a policy (language) model to align with human preferences, and regularizes the loss function using a baseline reference (the frozen, initial language model) to prevent over-fitting to the preference dataset. It differs from PPO by optimizing the policy model directly using labelled preference data, rather than using an additional reward model to provide feedback. This significantly simplifies training and reduces compute overhead.

Parameters:

beta (float) – Temperature parameter for the DPO loss, typically in the range of 0.1 to 0.5. Default is 0.1.
label_smoothing (float) – Parameter encoding uncertainty about the labels. Default is 0.

forward(policy_inputs: ChosenRejectedOutputs, reference_inputs: ChosenRejectedOutputs) → Tuple[Tensor, Tensor, Tensor][source]

Compute the DPO loss for a batch of policy and reference model log probabilities.

Parameters:

policy_inputs (ChosenRejectedOutputs) – Policy log-probs and logits required for the calculation.
reference_inputs (ChosenRejectedOutputs) – Reference log-probs and logits required for the calculation.

Returns:

A tuple of three tensors:

losses: The DPO loss for each example in the batch.
chosen_rewards: Rewards for the chosen responses.
rejected_rewards: Rewards for the rejected responses.

Return type:

Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

DPOLoss

Docs

Tutorials

Resources