- class torchrl.objectives.value.functional.reward2go(reward, done, gamma, time_dim: int = - 2)¶
Compute the discounted cumulative sum of rewards given multiple trajectories and the episode ends.
reward (torch.Tensor) – A tensor containing the rewards received at each time step over multiple trajectories.
done (torch.Tensor) – A tensor with done (or truncated) states.
gamma (float, optional) – The discount factor to use for computing the discounted cumulative sum of rewards. Defaults to 1.0.
time_dim (int) – dimension where the time is unrolled. Defaults to -2.
- A tensor of shape [B, T] containing the discounted cumulative
sum of rewards (reward-to-go) at each time step.
- Return type:
>>> reward = torch.ones(1, 10) >>> done = torch.zeros(1, 10, dtype=torch.bool) >>> done[:, [3, 7]] = True >>> reward2go(reward, done, 0.99, time_dim=-1) tensor([[3.9404], [2.9701], [1.9900], [1.0000], [3.9404], [2.9701], [1.9900], [1.0000], [1.9900], [1.0000]])