• Docs >
  • Pendulum: Writing your environment and transforms with TorchRL
Shortcuts

Pendulum: Writing your environment and transforms with TorchRL

Author: Vincent Moens

Creating an environment (a simulator or an interface to a physical control system) is an integrative part of reinforcement learning and control engineering.

TorchRL provides a set of tools to do this in multiple contexts. This tutorial demonstrates how to use PyTorch and TorchRL code a pendulum simulator from the ground up. It is freely inspired by the Pendulum-v1 implementation from OpenAI-Gym/Farama-Gymnasium control library.

Pendulum

Simple Pendulum

Key learnings:

  • How to design an environment in TorchRL: - Writing specs (input, observation and reward); - Implementing behavior: seeding, reset and step.

  • Transforming your environment inputs and outputs, and writing your own transforms;

  • How to use TensorDict to carry arbitrary data structures through the codebase.

    In the process, we will touch three crucial components of TorchRL:

To give a sense of what can be achieved with TorchRL’s environments, we will be designing a stateless environment. While stateful environments keep track of the latest physical state encountered and rely on this to simulate the state-to-state transition, stateless environments expect the current state to be provided to them at each step, along with the action undertaken. TorchRL supports both types of environments, but stateless environments are more generic and hence cover a broader range of features of the environment API in TorchRL.

Modeling stateless environments gives users full control over the input and outputs of the simulator: one can reset an experiment at any stage or actively modify the dynamics from the outside. However, it assumes that we have some control over a task, which may not always be the case: solving a problem where we cannot control the current state is more challenging but has a much wider set of applications.

Another advantage of stateless environments is that they can enable batched execution of transition simulations. If the backend and the implementation allow it, an algebraic operation can be executed seamlessly on scalars, vectors, or tensors. This tutorial gives such examples.

This tutorial will be structured as follows:

  • We will first get acquainted with the environment properties: its shape (batch_size), its methods (mainly step(), reset() and set_seed()) and finally its specs.

  • After having coded our simulator, we will demonstrate how it can be used during training with transforms.

  • We will explore new avenues that follow from the TorchRL’s API, including: the possibility of transforming inputs, the vectorized execution of the simulation and the possibility of backpropagation through the simulation graph.

  • Finally, we will train a simple policy to solve the system we implemented.

A built-in version of this environment can be found in class:~torchrl.envs.PendulumEnv.

from collections import defaultdict
from typing import Optional

import numpy as np
import torch
import tqdm
from tensordict import TensorDict, TensorDictBase
from tensordict.nn import TensorDictModule
from torch import nn

from torchrl.data import Bounded, Composite, Unbounded
from torchrl.envs import (
    CatTensors,
    EnvBase,
    Transform,
    TransformedEnv,
    UnsqueezeTransform,
)
from torchrl.envs.transforms.transforms import _apply_to_composite
from torchrl.envs.utils import check_env_specs, step_mdp

DEFAULT_X = np.pi
DEFAULT_Y = 1.0

There are four things you must take care of when designing a new environment class:

  • EnvBase._reset(), which codes for the resetting of the simulator at a (potentially random) initial state;

  • EnvBase._step() which codes for the state transition dynamic;

  • EnvBase._set_seed() which implements the seeding mechanism;

  • the environment specs.

Let us first describe the problem at hand: we would like to model a simple pendulum over which we can control the torque applied on its fixed point. Our goal is to place the pendulum in upward position (angular position at 0 by convention) and having it standing still in that position. To design our dynamic system, we need to define two equations: the motion equation following an action (the torque applied) and the reward equation that will constitute our objective function.

For the motion equation, we will update the angular velocity following:

\[\dot{\theta}_{t+1} = \dot{\theta}_t + (3 * g / (2 * L) * \sin(\theta_t) + 3 / (m * L^2) * u) * dt\]

where \(\dot{\theta}\) is the angular velocity in rad/sec, \(g\) is the gravitational force, \(L\) is the pendulum length, \(m\) is its mass, \(\theta\) is its angular position and \(u\) is the torque. The angular position is then updated according to

\[\theta_{t+1} = \theta_{t} + \dot{\theta}_{t+1} dt\]

We define our reward as

\[r = -(\theta^2 + 0.1 * \dot{\theta}^2 + 0.001 * u^2)\]

which will be maximized when the angle is close to 0 (pendulum in upward position), the angular velocity is close to 0 (no motion) and the torque is 0 too.

Coding the effect of an action: _step()

The step method is the first thing to consider, as it will encode the simulation that is of interest to us. In TorchRL, the EnvBase class has a EnvBase.step() method that receives a tensordict.TensorDict instance with an "action" entry indicating what action is to be taken.

To facilitate the reading and writing from that tensordict and to make sure that the keys are consistent with what’s expected from the library, the simulation part has been delegated to a private abstract method _step() which reads input data from a tensordict, and writes a new tensordict with the output data.

The _step() method should do the following:

  1. Read the input keys (such as "action") and execute the simulation based on these;

  2. Retrieve observations, done state and reward;

  3. Write the set of observation values along with the reward and done state at the corresponding entries in a new TensorDict.

Next, the step() method will merge the output of step() in the input tensordict to enforce input/output consistency.

Typically, for stateful environments, this will look like this:

>>> policy(env.reset())
>>> print(tensordict)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)
>>> env.step(tensordict)
>>> print(tensordict)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=cpu,
            is_shared=False),
        observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)

Notice that the root tensordict has not changed, the only modification is the appearance of a new "next" entry that contains the new information.

In the Pendulum example, our _step() method will read the relevant entries from the input tensordict and compute the position and velocity of the pendulum after the force encoded by the "action" key has been applied onto it. We compute the new angular position of the pendulum "new_th" as the result of the previous position "th" plus the new velocity "new_thdot" over a time interval dt.

Since our goal is to turn the pendulum up and maintain it still in that position, our cost (negative reward) function is lower for positions close to the target and low speeds. Indeed, we want to discourage positions that are far from being “upward” and/or speeds that are far from 0.

In our example, EnvBase._step() is encoded as a static method since our environment is stateless. In stateful settings, the self argument is needed as the state needs to be read from the environment.

def _step(tensordict):
    th, thdot = tensordict["th"], tensordict["thdot"]  # th := theta

    g_force = tensordict["params", "g"]
    mass = tensordict["params", "m"]
    length = tensordict["params", "l"]
    dt = tensordict["params", "dt"]
    u = tensordict["action"].squeeze(-1)
    u = u.clamp(-tensordict["params", "max_torque"], tensordict["params", "max_torque"])
    costs = angle_normalize(th) ** 2 + 0.1 * thdot**2 + 0.001 * (u**2)

    new_thdot = (
        thdot
        + (3 * g_force / (2 * length) * th.sin() + 3.0 / (mass * length**2) * u) * dt
    )
    new_thdot = new_thdot.clamp(
        -tensordict["params", "max_speed"], tensordict["params", "max_speed"]
    )
    new_th = th + new_thdot * dt
    reward = -costs.view(*tensordict.shape, 1)
    done = torch.zeros_like(reward, dtype=torch.bool)
    out = TensorDict(
        {
            "th": new_th,
            "thdot": new_thdot,
            "params": tensordict["params"],
            "reward": reward,
            "done": done,
        },
        tensordict.shape,
    )
    return out


def angle_normalize(x):
    return ((x + torch.pi) % (2 * torch.pi)) - torch.pi

Resetting the simulator: _reset()

The second method we need to care about is the _reset() method. Like _step(), it should write the observation entries and possibly a done state in the tensordict it outputs (if the done state is omitted, it will be filled as False by the parent method reset()). In some contexts, it is required that the _reset method receives a command from the function that called it (for example, in multi-agent settings we may want to indicate which agents need to be reset). This is why the _reset() method also expects a tensordict as input, albeit it may perfectly be empty or None.

The parent EnvBase.reset() does some simple checks like the EnvBase.step() does, such as making sure that a "done" state is returned in the output tensordict and that the shapes match what is expected from the specs.

For us, the only important thing to consider is whether EnvBase._reset() contains all the expected observations. Once more, since we are working with a stateless environment, we pass the configuration of the pendulum in a nested tensordict named "params".

In this example, we do not pass a done state as this is not mandatory for _reset() and our environment is non-terminating, so we always expect it to be False.

def _reset(self, tensordict):
    if tensordict is None or tensordict.is_empty():
        # if no ``tensordict`` is passed, we generate a single set of hyperparameters
        # Otherwise, we assume that the input ``tensordict`` contains all the relevant
        # parameters to get started.
        tensordict = self.gen_params(batch_size=self.batch_size)

    high_th = torch.tensor(DEFAULT_X, device=self.device)
    high_thdot = torch.tensor(DEFAULT_Y, device=self.device)
    low_th = -high_th
    low_thdot = -high_thdot

    # for non batch-locked environments, the input ``tensordict`` shape dictates the number
    # of simulators run simultaneously. In other contexts, the initial
    # random state's shape will depend upon the environment batch-size instead.
    th = (
        torch.rand(tensordict.shape, generator=self.rng, device=self.device)
        * (high_th - low_th)
        + low_th
    )
    thdot = (
        torch.rand(tensordict.shape, generator=self.rng, device=self.device)
        * (high_thdot - low_thdot)
        + low_thdot
    )
    out = TensorDict(
        {
            "th": th,
            "thdot": thdot,
            "params": tensordict["params"],
        },
        batch_size=tensordict.shape,
    )
    return out

Environment metadata: env.*_spec

The specs define the input and output domain of the environment. It is important that the specs accurately define the tensors that will be received at runtime, as they are often used to carry information about environments in multiprocessing and distributed settings. They can also be used to instantiate lazily defined neural networks and test scripts without actually querying the environment (which can be costly with real-world physical systems for instance).

There are four specs that we must code in our environment:

  • EnvBase.observation_spec: This will be a CompositeSpec instance where each key is an observation (a CompositeSpec can be viewed as a dictionary of specs).

  • EnvBase.action_spec: It can be any type of spec, but it is required that it corresponds to the "action" entry in the input tensordict;

  • EnvBase.reward_spec: provides information about the reward space;

  • EnvBase.done_spec: provides information about the space of the done flag.

TorchRL specs are organized in two general containers: input_spec which contains the specs of the information that the step function reads (divided between action_spec containing the action and state_spec containing all the rest), and output_spec which encodes the specs that the step outputs (observation_spec, reward_spec and done_spec). In general, you should not interact directly with output_spec and input_spec but only with their content: observation_spec, reward_spec, done_spec, action_spec and state_spec. The reason if that the specs are organized in a non-trivial way within output_spec and input_spec and neither of these should be directly modified.

In other words, the observation_spec and related properties are convenient shortcuts to the content of the output and input spec containers.

TorchRL offers multiple TensorSpec subclasses to encode the environment’s input and output characteristics.

Specs shape

The environment specs leading dimensions must match the environment batch-size. This is done to enforce that every component of an environment (including its transforms) have an accurate representation of the expected input and output shapes. This is something that should be accurately coded in stateful settings.

For non batch-locked environments, such as the one in our example (see below), this is irrelevant as the environment batch size will most likely be empty.

def _make_spec(self, td_params):
    # Under the hood, this will populate self.output_spec["observation"]
    self.observation_spec = Composite(
        th=Bounded(
            low=-torch.pi,
            high=torch.pi,
            shape=(),
            dtype=torch.float32,
        ),
        thdot=Bounded(
            low=-td_params["params", "max_speed"],
            high=td_params["params", "max_speed"],
            shape=(),
            dtype=torch.float32,
        ),
        # we need to add the ``params`` to the observation specs, as we want
        # to pass it at each step during a rollout
        params=make_composite_from_td(td_params["params"]),
        shape=(),
    )
    # since the environment is stateless, we expect the previous output as input.
    # For this, ``EnvBase`` expects some state_spec to be available
    self.state_spec = self.observation_spec.clone()
    # action-spec will be automatically wrapped in input_spec when
    # `self.action_spec = spec` will be called supported
    self.action_spec = Bounded(
        low=-td_params["params", "max_torque"],
        high=td_params["params", "max_torque"],
        shape=(1,),
        dtype=torch.float32,
    )
    self.reward_spec = Unbounded(shape=(*td_params.shape, 1))


def make_composite_from_td(td):
    # custom function to convert a ``tensordict`` in a similar spec structure
    # of unbounded values.
    composite = Composite(
        {
            key: make_composite_from_td(tensor)
            if isinstance(tensor, TensorDictBase)
            else Unbounded(dtype=tensor.dtype, device=tensor.device, shape=tensor.shape)
            for key, tensor in td.items()
        },
        shape=td.shape,
    )
    return composite

Reproducible experiments: seeding

Seeding an environment is a common operation when initializing an experiment. The only goal of EnvBase._set_seed() is to set the seed of the contained simulator. If possible, this operation should not call reset() or interact with the environment execution. The parent EnvBase.set_seed() method incorporates a mechanism that allows seeding multiple environments with a different pseudo-random and reproducible seed.

def _set_seed(self, seed: Optional[int]):
    rng = torch.manual_seed(seed)
    self.rng = rng

Wrapping things together: the EnvBase class

We can finally put together the pieces and design our environment class. The specs initialization needs to be performed during the environment construction, so we must take care of calling the _make_spec() method within PendulumEnv.__init__().

We add a static method PendulumEnv.gen_params() which deterministically generates a set of hyperparameters to be used during execution:

def gen_params(g=10.0, batch_size=None) -> TensorDictBase:
    """Returns a ``tensordict`` containing the physical parameters such as gravitational force and torque or speed limits."""
    if batch_size is None:
        batch_size = []
    td = TensorDict(
        {
            "params": TensorDict(
                {
                    "max_speed": 8,
                    "max_torque": 2.0,
                    "dt": 0.05,
                    "g": g,
                    "m": 1.0,
                    "l": 1.0,
                },
                [],
            )
        },
        [],
    )
    if batch_size:
        td = td.expand(batch_size).contiguous()
    return td

We define the environment as non-batch_locked by turning the homonymous attribute to False. This means that we will not enforce the input tensordict to have a batch-size that matches the one of the environment.

The following code will just put together the pieces we have coded above.

class PendulumEnv(EnvBase):
    metadata = {
        "render_modes": ["human", "rgb_array"],
        "render_fps": 30,
    }
    batch_locked = False

    def __init__(self, td_params=None, seed=None, device="cpu"):
        if td_params is None:
            td_params = self.gen_params()

        super().__init__(device=device, batch_size=[])
        self._make_spec(td_params)
        if seed is None:
            seed = torch.empty((), dtype=torch.int64).random_().item()
        self.set_seed(seed)

    # Helpers: _make_step and gen_params
    gen_params = staticmethod(gen_params)
    _make_spec = _make_spec

    # Mandatory methods: _step, _reset and _set_seed
    _reset = _reset
    _step = staticmethod(_step)
    _set_seed = _set_seed

Testing our environment

TorchRL provides a simple function check_env_specs() to check that a (transformed) environment has an input/output structure that matches the one dictated by its specs. Let us try it out:

env = PendulumEnv()
check_env_specs(env)

We can have a look at our specs to have a visual representation of the environment signature:

print("observation_spec:", env.observation_spec)
print("state_spec:", env.state_spec)
print("reward_spec:", env.reward_spec)
observation_spec: Composite(
    th: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    thdot: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    params: Composite(
        max_speed: UnboundedDiscrete(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        max_torque: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        dt: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        g: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        m: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        l: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([])),
    device=cpu,
    shape=torch.Size([]))
state_spec: Composite(
    th: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    thdot: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    params: Composite(
        max_speed: UnboundedDiscrete(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        max_torque: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        dt: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        g: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        m: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        l: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([])),
    device=cpu,
    shape=torch.Size([]))
reward_spec: UnboundedContinuous(
    shape=torch.Size([1]),
    space=ContinuousBox(
        low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
        high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
    device=cpu,
    dtype=torch.float32,
    domain=continuous)

We can execute a couple of commands too to check that the output structure matches what is expected.

td = env.reset()
print("reset tensordict", td)
reset tensordict TensorDict(
    fields={
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)

We can run the env.rand_step() to generate an action randomly from the action_spec domain. A tensordict containing the hyperparameters and the current state must be passed since our environment is stateless. In stateful contexts, env.rand_step() works perfectly too.

td = env.rand_step(td)
print("random step tensordict", td)
random step tensordict TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)

Transforming an environment

Writing environment transforms for stateless simulators is slightly more complicated than for stateful ones: transforming an output entry that needs to be read at the following iteration requires to apply the inverse transform before calling meth.step() at the next step. This is an ideal scenario to showcase all the features of TorchRL’s transforms!

For instance, in the following transformed environment we unsqueeze the entries ["th", "thdot"] to be able to stack them along the last dimension. We also pass them as in_keys_inv to squeeze them back to their original shape once they are passed as input in the next iteration.

env = TransformedEnv(
    env,
    # ``Unsqueeze`` the observations that we will concatenate
    UnsqueezeTransform(
        dim=-1,
        in_keys=["th", "thdot"],
        in_keys_inv=["th", "thdot"],
    ),
)

Writing custom transforms

TorchRL’s transforms may not cover all the operations one wants to execute after an environment has been executed. Writing a transform does not require much effort. As for the environment design, there are two steps in writing a transform:

  • Getting the dynamics right (forward and inverse);

  • Adapting the environment specs.

A transform can be used in two settings: on its own, it can be used as a Module. It can also be used appended to a TransformedEnv. The structure of the class allows to customize the behavior in the different contexts.

A Transform skeleton can be summarized as follows:

class Transform(nn.Module):
    def forward(self, tensordict):
        ...
    def _apply_transform(self, tensordict):
        ...
    def _step(self, tensordict):
        ...
    def _call(self, tensordict):
        ...
    def inv(self, tensordict):
        ...
    def _inv_apply_transform(self, tensordict):
        ...

There are three entry points (forward(), _step() and inv()) which all receive tensordict.TensorDict instances. The first two will eventually go through the keys indicated by in_keys and call _apply_transform() to each of these. The results will be written in the entries pointed by Transform.out_keys if provided (if not the in_keys will be updated with the transformed values). If inverse transforms need to be executed, a similar data flow will be executed but with the Transform.inv() and Transform._inv_apply_transform() methods and across the in_keys_inv and out_keys_inv list of keys. The following figure summarized this flow for environments and replay buffers.

Transform API

In some cases, a transform will not work on a subset of keys in a unitary manner, but will execute some operation on the parent environment or work with the entire input tensordict. In those cases, the _call() and forward() methods should be re-written, and the _apply_transform() method can be skipped.

Let us code new transforms that will compute the sine and cosine values of the position angle, as these values are more useful to us to learn a policy than the raw angle value:

class SinTransform(Transform):
    def _apply_transform(self, obs: torch.Tensor) -> None:
        return obs.sin()

    # The transform must also modify the data at reset time
    def _reset(
        self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
    ) -> TensorDictBase:
        return self._call(tensordict_reset)

    # _apply_to_composite will execute the observation spec transform across all
    # in_keys/out_keys pairs and write the result in the observation_spec which
    # is of type ``Composite``
    @_apply_to_composite
    def transform_observation_spec(self, observation_spec):
        return Bounded(
            low=-1,
            high=1,
            shape=observation_spec.shape,
            dtype=observation_spec.dtype,
            device=observation_spec.device,
        )


class CosTransform(Transform):
    def _apply_transform(self, obs: torch.Tensor) -> None:
        return obs.cos()

    # The transform must also modify the data at reset time
    def _reset(
        self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
    ) -> TensorDictBase:
        return self._call(tensordict_reset)

    # _apply_to_composite will execute the observation spec transform across all
    # in_keys/out_keys pairs and write the result in the observation_spec which
    # is of type ``Composite``
    @_apply_to_composite
    def transform_observation_spec(self, observation_spec):
        return Bounded(
            low=-1,
            high=1,
            shape=observation_spec.shape,
            dtype=observation_spec.dtype,
            device=observation_spec.device,
        )


t_sin = SinTransform(in_keys=["th"], out_keys=["sin"])
t_cos = CosTransform(in_keys=["th"], out_keys=["cos"])
env.append_transform(t_sin)
env.append_transform(t_cos)
TransformedEnv(
    env=PendulumEnv(),
    transform=Compose(
            UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
            SinTransform(keys=['th']),
            CosTransform(keys=['th'])))

Concatenates the observations onto an “observation” entry. del_keys=False ensures that we keep these values for the next iteration.

cat_transform = CatTensors(
    in_keys=["sin", "cos", "thdot"], dim=-1, out_key="observation", del_keys=False
)
env.append_transform(cat_transform)
TransformedEnv(
    env=PendulumEnv(),
    transform=Compose(
            UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
            SinTransform(keys=['th']),
            CosTransform(keys=['th']),
            CatTensors(in_keys=['cos', 'sin', 'thdot'], out_key=observation)))

Once more, let us check that our environment specs match what is received:

check_env_specs(env)

Executing a rollout

Executing a rollout is a succession of simple steps:

  • reset the environment

  • while some condition is not met:

    • compute an action given a policy

    • execute a step given this action

    • collect the data

    • make a MDP step

  • gather the data and return

These operations have been conveniently wrapped in the rollout() method, from which we provide a simplified version here below.

def simple_rollout(steps=100):
    # preallocate:
    data = TensorDict({}, [steps])
    # reset
    _data = env.reset()
    for i in range(steps):
        _data["action"] = env.action_spec.rand()
        _data = env.step(_data)
        data[i] = _data
        _data = step_mdp(_data, keep_other=True)
    return data


print("data from rollout:", simple_rollout(100))
data from rollout: TensorDict(
    fields={
        action: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([100]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([100]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([100]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([100]),
    device=None,
    is_shared=False)

Batching computations

The last unexplored end of our tutorial is the ability that we have to batch computations in TorchRL. Because our environment does not make any assumptions regarding the input data shape, we can seamlessly execute it over batches of data. Even better: for non-batch-locked environments such as our Pendulum, we can change the batch size on the fly without recreating the environment. To do this, we just generate parameters with the desired shape.

batch_size = 10  # number of environments to be executed in batch
td = env.reset(env.gen_params(batch_size=[batch_size]))
print("reset (batch size of 10)", td)
td = env.rand_step(td)
print("rand step (batch size of 10)", td)
reset (batch size of 10) TensorDict(
    fields={
        cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10]),
    device=None,
    is_shared=False)
rand step (batch size of 10) TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([10]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10]),
    device=None,
    is_shared=False)

Executing a rollout with a batch of data requires us to reset the environment out of the rollout function, since we need to define the batch_size dynamically and this is not supported by rollout():

rollout = env.rollout(
    3,
    auto_reset=False,  # we're executing the reset out of the ``rollout`` call
    tensordict=env.reset(env.gen_params(batch_size=[batch_size])),
)
print("rollout of len 3 (batch size of 10):", rollout)
rollout of len 3 (batch size of 10): TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([10, 3]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10, 3]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10, 3]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10, 3]),
    device=None,
    is_shared=False)

Training a simple policy

In this example, we will train a simple policy using the reward as a differentiable objective, such as a negative loss. We will take advantage of the fact that our dynamic system is fully differentiable to backpropagate through the trajectory return and adjust the weights of our policy to maximize this value directly. Of course, in many settings many of the assumptions we make do not hold, such as differentiable system and full access to the underlying mechanics.

Still, this is a very simple example that showcases how a training loop can be coded with a custom environment in TorchRL.

Let us first write the policy network:

torch.manual_seed(0)
env.set_seed(0)

net = nn.Sequential(
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(1),
)
policy = TensorDictModule(
    net,
    in_keys=["observation"],
    out_keys=["action"],
)

and our optimizer:

Training loop

We will successively:

  • generate a trajectory

  • sum the rewards

  • backpropagate through the graph defined by these operations

  • clip the gradient norm and make an optimization step

  • repeat

At the end of the training loop, we should have a final reward close to 0 which demonstrates that the pendulum is upward and still as desired.

batch_size = 32
pbar = tqdm.tqdm(range(20_000 // batch_size))
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optim, 20_000)
logs = defaultdict(list)

for _ in pbar:
    init_td = env.reset(env.gen_params(batch_size=[batch_size]))
    rollout = env.rollout(100, policy, tensordict=init_td, auto_reset=False)
    traj_return = rollout["next", "reward"].mean()
    (-traj_return).backward()
    gn = torch.nn.utils.clip_grad_norm_(net.parameters(), 1.0)
    optim.step()
    optim.zero_grad()
    pbar.set_description(
        f"reward: {traj_return: 4.4f}, "
        f"last reward: {rollout[..., -1]['next', 'reward'].mean(): 4.4f}, gradient norm: {gn: 4.4}"
    )
    logs["return"].append(traj_return.item())
    logs["last_reward"].append(rollout[..., -1]["next", "reward"].mean().item())
    scheduler.step()


def plot():
    import matplotlib
    from matplotlib import pyplot as plt

    is_ipython = "inline" in matplotlib.get_backend()
    if is_ipython:
        from IPython import display

    with plt.ion():
        plt.figure(figsize=(10, 5))
        plt.subplot(1, 2, 1)
        plt.plot(logs["return"])
        plt.title("returns")
        plt.xlabel("iteration")
        plt.subplot(1, 2, 2)
        plt.plot(logs["last_reward"])
        plt.title("last reward")
        plt.xlabel("iteration")
        if is_ipython:
            display.display(plt.gcf())
            display.clear_output(wait=True)
        plt.show()


plot()
returns, last reward
  0%|          | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm:  8.519:   0%|          | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm:  8.519:   0%|          | 1/625 [00:00<01:33,  6.69it/s]
reward: -7.0499, last reward: -7.4472, gradient norm:  5.073:   0%|          | 1/625 [00:00<01:33,  6.69it/s]
reward: -7.0499, last reward: -7.4472, gradient norm:  5.073:   0%|          | 2/625 [00:00<01:33,  6.63it/s]
reward: -7.0685, last reward: -7.0408, gradient norm:  5.552:   0%|          | 2/625 [00:00<01:33,  6.63it/s]
reward: -7.0685, last reward: -7.0408, gradient norm:  5.552:   0%|          | 3/625 [00:00<01:33,  6.63it/s]
reward: -6.5154, last reward: -5.9086, gradient norm:  2.527:   0%|          | 3/625 [00:00<01:33,  6.63it/s]
reward: -6.5154, last reward: -5.9086, gradient norm:  2.527:   1%|          | 4/625 [00:00<01:34,  6.60it/s]
reward: -6.2006, last reward: -5.9385, gradient norm:  8.155:   1%|          | 4/625 [00:00<01:34,  6.60it/s]
reward: -6.2006, last reward: -5.9385, gradient norm:  8.155:   1%|          | 5/625 [00:00<01:33,  6.60it/s]
reward: -6.2568, last reward: -5.4981, gradient norm:  6.223:   1%|          | 5/625 [00:00<01:33,  6.60it/s]
reward: -6.2568, last reward: -5.4981, gradient norm:  6.223:   1%|          | 6/625 [00:00<01:33,  6.62it/s]
reward: -5.8929, last reward: -8.4491, gradient norm:  4.581:   1%|          | 6/625 [00:01<01:33,  6.62it/s]
reward: -5.8929, last reward: -8.4491, gradient norm:  4.581:   1%|          | 7/625 [00:01<01:33,  6.63it/s]
reward: -6.3233, last reward: -9.0664, gradient norm:  7.596:   1%|          | 7/625 [00:01<01:33,  6.63it/s]
reward: -6.3233, last reward: -9.0664, gradient norm:  7.596:   1%|▏         | 8/625 [00:01<01:34,  6.56it/s]
reward: -6.1021, last reward: -9.5263, gradient norm:  0.9579:   1%|▏         | 8/625 [00:01<01:34,  6.56it/s]
reward: -6.1021, last reward: -9.5263, gradient norm:  0.9579:   1%|▏         | 9/625 [00:01<01:34,  6.53it/s]
reward: -6.5807, last reward: -8.8075, gradient norm:  3.212:   1%|▏         | 9/625 [00:01<01:34,  6.53it/s]
reward: -6.5807, last reward: -8.8075, gradient norm:  3.212:   2%|▏         | 10/625 [00:01<01:33,  6.56it/s]
reward: -6.2009, last reward: -8.5525, gradient norm:  2.914:   2%|▏         | 10/625 [00:01<01:33,  6.56it/s]
reward: -6.2009, last reward: -8.5525, gradient norm:  2.914:   2%|▏         | 11/625 [00:01<01:33,  6.58it/s]
reward: -6.2894, last reward: -8.0115, gradient norm:  52.06:   2%|▏         | 11/625 [00:01<01:33,  6.58it/s]
reward: -6.2894, last reward: -8.0115, gradient norm:  52.06:   2%|▏         | 12/625 [00:01<01:32,  6.59it/s]
reward: -6.0977, last reward: -6.1845, gradient norm:  18.09:   2%|▏         | 12/625 [00:01<01:32,  6.59it/s]
reward: -6.0977, last reward: -6.1845, gradient norm:  18.09:   2%|▏         | 13/625 [00:01<01:32,  6.60it/s]
reward: -6.1830, last reward: -7.4858, gradient norm:  5.233:   2%|▏         | 13/625 [00:02<01:32,  6.60it/s]
reward: -6.1830, last reward: -7.4858, gradient norm:  5.233:   2%|▏         | 14/625 [00:02<01:32,  6.61it/s]
reward: -6.2863, last reward: -5.0297, gradient norm:  1.464:   2%|▏         | 14/625 [00:02<01:32,  6.61it/s]
reward: -6.2863, last reward: -5.0297, gradient norm:  1.464:   2%|▏         | 15/625 [00:02<01:31,  6.63it/s]
reward: -6.4617, last reward: -5.5997, gradient norm:  2.904:   2%|▏         | 15/625 [00:02<01:31,  6.63it/s]
reward: -6.4617, last reward: -5.5997, gradient norm:  2.904:   3%|▎         | 16/625 [00:02<01:31,  6.63it/s]
reward: -6.1647, last reward: -6.0777, gradient norm:  4.901:   3%|▎         | 16/625 [00:02<01:31,  6.63it/s]
reward: -6.1647, last reward: -6.0777, gradient norm:  4.901:   3%|▎         | 17/625 [00:02<01:31,  6.64it/s]
reward: -6.4709, last reward: -6.6813, gradient norm:  0.8317:   3%|▎         | 17/625 [00:02<01:31,  6.64it/s]
reward: -6.4709, last reward: -6.6813, gradient norm:  0.8317:   3%|▎         | 18/625 [00:02<01:31,  6.64it/s]
reward: -6.3221, last reward: -6.5554, gradient norm:  1.276:   3%|▎         | 18/625 [00:02<01:31,  6.64it/s]
reward: -6.3221, last reward: -6.5554, gradient norm:  1.276:   3%|▎         | 19/625 [00:02<01:31,  6.64it/s]
reward: -6.3353, last reward: -7.9999, gradient norm:  4.701:   3%|▎         | 19/625 [00:03<01:31,  6.64it/s]
reward: -6.3353, last reward: -7.9999, gradient norm:  4.701:   3%|▎         | 20/625 [00:03<01:31,  6.65it/s]
reward: -5.8570, last reward: -7.6656, gradient norm:  5.463:   3%|▎         | 20/625 [00:03<01:31,  6.65it/s]
reward: -5.8570, last reward: -7.6656, gradient norm:  5.463:   3%|▎         | 21/625 [00:03<01:30,  6.65it/s]
reward: -5.7779, last reward: -6.6911, gradient norm:  6.875:   3%|▎         | 21/625 [00:03<01:30,  6.65it/s]
reward: -5.7779, last reward: -6.6911, gradient norm:  6.875:   4%|▎         | 22/625 [00:03<01:30,  6.66it/s]
reward: -6.0796, last reward: -5.7082, gradient norm:  5.308:   4%|▎         | 22/625 [00:03<01:30,  6.66it/s]
reward: -6.0796, last reward: -5.7082, gradient norm:  5.308:   4%|▎         | 23/625 [00:03<01:30,  6.65it/s]
reward: -6.0421, last reward: -6.1496, gradient norm:  12.4:   4%|▎         | 23/625 [00:03<01:30,  6.65it/s]
reward: -6.0421, last reward: -6.1496, gradient norm:  12.4:   4%|▍         | 24/625 [00:03<01:30,  6.65it/s]
reward: -5.5037, last reward: -5.1755, gradient norm:  22.62:   4%|▍         | 24/625 [00:03<01:30,  6.65it/s]
reward: -5.5037, last reward: -5.1755, gradient norm:  22.62:   4%|▍         | 25/625 [00:03<01:30,  6.65it/s]
reward: -5.5029, last reward: -4.9454, gradient norm:  3.665:   4%|▍         | 25/625 [00:03<01:30,  6.65it/s]
reward: -5.5029, last reward: -4.9454, gradient norm:  3.665:   4%|▍         | 26/625 [00:03<01:30,  6.65it/s]
reward: -5.9330, last reward: -6.2118, gradient norm:  5.444:   4%|▍         | 26/625 [00:04<01:30,  6.65it/s]
reward: -5.9330, last reward: -6.2118, gradient norm:  5.444:   4%|▍         | 27/625 [00:04<01:29,  6.65it/s]
reward: -6.0995, last reward: -6.6294, gradient norm:  11.69:   4%|▍         | 27/625 [00:04<01:29,  6.65it/s]
reward: -6.0995, last reward: -6.6294, gradient norm:  11.69:   4%|▍         | 28/625 [00:04<01:29,  6.65it/s]
reward: -6.3146, last reward: -7.2909, gradient norm:  5.461:   4%|▍         | 28/625 [00:04<01:29,  6.65it/s]
reward: -6.3146, last reward: -7.2909, gradient norm:  5.461:   5%|▍         | 29/625 [00:04<01:29,  6.66it/s]
reward: -5.9720, last reward: -6.1298, gradient norm:  19.91:   5%|▍         | 29/625 [00:04<01:29,  6.66it/s]
reward: -5.9720, last reward: -6.1298, gradient norm:  19.91:   5%|▍         | 30/625 [00:04<01:29,  6.66it/s]
reward: -5.9923, last reward: -7.0345, gradient norm:  3.464:   5%|▍         | 30/625 [00:04<01:29,  6.66it/s]
reward: -5.9923, last reward: -7.0345, gradient norm:  3.464:   5%|▍         | 31/625 [00:04<01:29,  6.65it/s]
reward: -5.3438, last reward: -4.3688, gradient norm:  2.424:   5%|▍         | 31/625 [00:04<01:29,  6.65it/s]
reward: -5.3438, last reward: -4.3688, gradient norm:  2.424:   5%|▌         | 32/625 [00:04<01:29,  6.64it/s]
reward: -5.6953, last reward: -4.5233, gradient norm:  3.411:   5%|▌         | 32/625 [00:04<01:29,  6.64it/s]
reward: -5.6953, last reward: -4.5233, gradient norm:  3.411:   5%|▌         | 33/625 [00:04<01:29,  6.64it/s]
reward: -5.4288, last reward: -2.8011, gradient norm:  10.82:   5%|▌         | 33/625 [00:05<01:29,  6.64it/s]
reward: -5.4288, last reward: -2.8011, gradient norm:  10.82:   5%|▌         | 34/625 [00:05<01:29,  6.64it/s]
reward: -5.5329, last reward: -4.2677, gradient norm:  15.71:   5%|▌         | 34/625 [00:05<01:29,  6.64it/s]
reward: -5.5329, last reward: -4.2677, gradient norm:  15.71:   6%|▌         | 35/625 [00:05<01:28,  6.64it/s]
reward: -5.6969, last reward: -3.7010, gradient norm:  1.376:   6%|▌         | 35/625 [00:05<01:28,  6.64it/s]
reward: -5.6969, last reward: -3.7010, gradient norm:  1.376:   6%|▌         | 36/625 [00:05<01:28,  6.65it/s]
reward: -5.9352, last reward: -4.7707, gradient norm:  15.49:   6%|▌         | 36/625 [00:05<01:28,  6.65it/s]
reward: -5.9352, last reward: -4.7707, gradient norm:  15.49:   6%|▌         | 37/625 [00:05<01:28,  6.65it/s]
reward: -5.6178, last reward: -4.5646, gradient norm:  3.348:   6%|▌         | 37/625 [00:05<01:28,  6.65it/s]
reward: -5.6178, last reward: -4.5646, gradient norm:  3.348:   6%|▌         | 38/625 [00:05<01:28,  6.65it/s]
reward: -5.7304, last reward: -3.9407, gradient norm:  4.942:   6%|▌         | 38/625 [00:05<01:28,  6.65it/s]
reward: -5.7304, last reward: -3.9407, gradient norm:  4.942:   6%|▌         | 39/625 [00:05<01:28,  6.64it/s]
reward: -5.3882, last reward: -3.7604, gradient norm:  9.85:   6%|▌         | 39/625 [00:06<01:28,  6.64it/s]
reward: -5.3882, last reward: -3.7604, gradient norm:  9.85:   6%|▋         | 40/625 [00:06<01:28,  6.63it/s]
reward: -5.3507, last reward: -2.8928, gradient norm:  1.258:   6%|▋         | 40/625 [00:06<01:28,  6.63it/s]
reward: -5.3507, last reward: -2.8928, gradient norm:  1.258:   7%|▋         | 41/625 [00:06<01:27,  6.64it/s]
reward: -5.6978, last reward: -4.4641, gradient norm:  4.549:   7%|▋         | 41/625 [00:06<01:27,  6.64it/s]
reward: -5.6978, last reward: -4.4641, gradient norm:  4.549:   7%|▋         | 42/625 [00:06<01:27,  6.64it/s]
reward: -5.5263, last reward: -3.6047, gradient norm:  2.544:   7%|▋         | 42/625 [00:06<01:27,  6.64it/s]
reward: -5.5263, last reward: -3.6047, gradient norm:  2.544:   7%|▋         | 43/625 [00:06<01:27,  6.64it/s]
reward: -5.5005, last reward: -4.4136, gradient norm:  11.49:   7%|▋         | 43/625 [00:06<01:27,  6.64it/s]
reward: -5.5005, last reward: -4.4136, gradient norm:  11.49:   7%|▋         | 44/625 [00:06<01:27,  6.64it/s]
reward: -5.2993, last reward: -6.3222, gradient norm:  32.53:   7%|▋         | 44/625 [00:06<01:27,  6.64it/s]
reward: -5.2993, last reward: -6.3222, gradient norm:  32.53:   7%|▋         | 45/625 [00:06<01:27,  6.65it/s]
reward: -5.4046, last reward: -5.7314, gradient norm:  7.275:   7%|▋         | 45/625 [00:06<01:27,  6.65it/s]
reward: -5.4046, last reward: -5.7314, gradient norm:  7.275:   7%|▋         | 46/625 [00:06<01:27,  6.65it/s]
reward: -5.6331, last reward: -4.9318, gradient norm:  6.961:   7%|▋         | 46/625 [00:07<01:27,  6.65it/s]
reward: -5.6331, last reward: -4.9318, gradient norm:  6.961:   8%|▊         | 47/625 [00:07<01:26,  6.65it/s]
reward: -4.8331, last reward: -4.1604, gradient norm:  26.26:   8%|▊         | 47/625 [00:07<01:26,  6.65it/s]
reward: -4.8331, last reward: -4.1604, gradient norm:  26.26:   8%|▊         | 48/625 [00:07<01:26,  6.66it/s]
reward: -5.4099, last reward: -4.4761, gradient norm:  8.125:   8%|▊         | 48/625 [00:07<01:26,  6.66it/s]
reward: -5.4099, last reward: -4.4761, gradient norm:  8.125:   8%|▊         | 49/625 [00:07<01:26,  6.65it/s]
reward: -5.4262, last reward: -3.6363, gradient norm:  2.382:   8%|▊         | 49/625 [00:07<01:26,  6.65it/s]
reward: -5.4262, last reward: -3.6363, gradient norm:  2.382:   8%|▊         | 50/625 [00:07<01:26,  6.65it/s]
reward: -5.3593, last reward: -5.7377, gradient norm:  22.62:   8%|▊         | 50/625 [00:07<01:26,  6.65it/s]
reward: -5.3593, last reward: -5.7377, gradient norm:  22.62:   8%|▊         | 51/625 [00:07<01:26,  6.64it/s]
reward: -5.2847, last reward: -3.3443, gradient norm:  2.867:   8%|▊         | 51/625 [00:07<01:26,  6.64it/s]
reward: -5.2847, last reward: -3.3443, gradient norm:  2.867:   8%|▊         | 52/625 [00:07<01:26,  6.64it/s]
reward: -5.3592, last reward: -6.4760, gradient norm:  8.441:   8%|▊         | 52/625 [00:07<01:26,  6.64it/s]
reward: -5.3592, last reward: -6.4760, gradient norm:  8.441:   8%|▊         | 53/625 [00:07<01:26,  6.64it/s]
reward: -5.9950, last reward: -10.8021, gradient norm:  11.77:   8%|▊         | 53/625 [00:08<01:26,  6.64it/s]
reward: -5.9950, last reward: -10.8021, gradient norm:  11.77:   9%|▊         | 54/625 [00:08<01:25,  6.65it/s]
reward: -6.3528, last reward: -7.1214, gradient norm:  7.708:   9%|▊         | 54/625 [00:08<01:25,  6.65it/s]
reward: -6.3528, last reward: -7.1214, gradient norm:  7.708:   9%|▉         | 55/625 [00:08<01:25,  6.64it/s]
reward: -6.4023, last reward: -7.3583, gradient norm:  9.041:   9%|▉         | 55/625 [00:08<01:25,  6.64it/s]
reward: -6.4023, last reward: -7.3583, gradient norm:  9.041:   9%|▉         | 56/625 [00:08<01:25,  6.64it/s]
reward: -6.3801, last reward: -7.0310, gradient norm:  120.1:   9%|▉         | 56/625 [00:08<01:25,  6.64it/s]
reward: -6.3801, last reward: -7.0310, gradient norm:  120.1:   9%|▉         | 57/625 [00:08<01:25,  6.64it/s]
reward: -6.4244, last reward: -6.2039, gradient norm:  15.48:   9%|▉         | 57/625 [00:08<01:25,  6.64it/s]
reward: -6.4244, last reward: -6.2039, gradient norm:  15.48:   9%|▉         | 58/625 [00:08<01:25,  6.64it/s]
reward: -6.4850, last reward: -6.8748, gradient norm:  4.706:   9%|▉         | 58/625 [00:08<01:25,  6.64it/s]
reward: -6.4850, last reward: -6.8748, gradient norm:  4.706:   9%|▉         | 59/625 [00:08<01:25,  6.63it/s]
reward: -6.4897, last reward: -5.9210, gradient norm:  11.63:   9%|▉         | 59/625 [00:09<01:25,  6.63it/s]
reward: -6.4897, last reward: -5.9210, gradient norm:  11.63:  10%|▉         | 60/625 [00:09<01:25,  6.64it/s]
reward: -6.2299, last reward: -7.8964, gradient norm:  13.35:  10%|▉         | 60/625 [00:09<01:25,  6.64it/s]
reward: -6.2299, last reward: -7.8964, gradient norm:  13.35:  10%|▉         | 61/625 [00:09<01:24,  6.65it/s]
reward: -6.0832, last reward: -9.3934, gradient norm:  4.456:  10%|▉         | 61/625 [00:09<01:24,  6.65it/s]
reward: -6.0832, last reward: -9.3934, gradient norm:  4.456:  10%|▉         | 62/625 [00:09<01:24,  6.65it/s]
reward: -5.8971, last reward: -10.2933, gradient norm:  10.74:  10%|▉         | 62/625 [00:09<01:24,  6.65it/s]
reward: -5.8971, last reward: -10.2933, gradient norm:  10.74:  10%|█         | 63/625 [00:09<01:24,  6.64it/s]
reward: -5.3377, last reward: -4.6996, gradient norm:  23.29:  10%|█         | 63/625 [00:09<01:24,  6.64it/s]
reward: -5.3377, last reward: -4.6996, gradient norm:  23.29:  10%|█         | 64/625 [00:09<01:24,  6.63it/s]
reward: -5.2274, last reward: -2.8916, gradient norm:  4.098:  10%|█         | 64/625 [00:09<01:24,  6.63it/s]
reward: -5.2274, last reward: -2.8916, gradient norm:  4.098:  10%|█         | 65/625 [00:09<01:24,  6.64it/s]
reward: -5.2660, last reward: -4.9110, gradient norm:  12.28:  10%|█         | 65/625 [00:09<01:24,  6.64it/s]
reward: -5.2660, last reward: -4.9110, gradient norm:  12.28:  11%|█         | 66/625 [00:09<01:24,  6.64it/s]
reward: -5.4503, last reward: -5.6956, gradient norm:  12.22:  11%|█         | 66/625 [00:10<01:24,  6.64it/s]
reward: -5.4503, last reward: -5.6956, gradient norm:  12.22:  11%|█         | 67/625 [00:10<01:23,  6.65it/s]
reward: -5.9172, last reward: -5.4026, gradient norm:  7.946:  11%|█         | 67/625 [00:10<01:23,  6.65it/s]
reward: -5.9172, last reward: -5.4026, gradient norm:  7.946:  11%|█         | 68/625 [00:10<01:23,  6.65it/s]
reward: -5.9229, last reward: -4.5205, gradient norm:  6.294:  11%|█         | 68/625 [00:10<01:23,  6.65it/s]
reward: -5.9229, last reward: -4.5205, gradient norm:  6.294:  11%|█         | 69/625 [00:10<01:23,  6.65it/s]
reward: -5.8872, last reward: -5.6637, gradient norm:  8.019:  11%|█         | 69/625 [00:10<01:23,  6.65it/s]
reward: -5.8872, last reward: -5.6637, gradient norm:  8.019:  11%|█         | 70/625 [00:10<01:23,  6.64it/s]
reward: -5.9281, last reward: -4.2082, gradient norm:  5.724:  11%|█         | 70/625 [00:10<01:23,  6.64it/s]
reward: -5.9281, last reward: -4.2082, gradient norm:  5.724:  11%|█▏        | 71/625 [00:10<01:23,  6.63it/s]
reward: -5.8561, last reward: -5.6574, gradient norm:  8.357:  11%|█▏        | 71/625 [00:10<01:23,  6.63it/s]
reward: -5.8561, last reward: -5.6574, gradient norm:  8.357:  12%|█▏        | 72/625 [00:10<01:23,  6.62it/s]
reward: -5.4138, last reward: -4.5230, gradient norm:  7.385:  12%|█▏        | 72/625 [00:11<01:23,  6.62it/s]
reward: -5.4138, last reward: -4.5230, gradient norm:  7.385:  12%|█▏        | 73/625 [00:11<01:23,  6.61it/s]
reward: -5.4065, last reward: -5.5642, gradient norm:  9.921:  12%|█▏        | 73/625 [00:11<01:23,  6.61it/s]
reward: -5.4065, last reward: -5.5642, gradient norm:  9.921:  12%|█▏        | 74/625 [00:11<01:23,  6.62it/s]
reward: -4.9786, last reward: -3.2894, gradient norm:  32.73:  12%|█▏        | 74/625 [00:11<01:23,  6.62it/s]
reward: -4.9786, last reward: -3.2894, gradient norm:  32.73:  12%|█▏        | 75/625 [00:11<01:23,  6.63it/s]
reward: -5.4129, last reward: -7.5831, gradient norm:  9.266:  12%|█▏        | 75/625 [00:11<01:23,  6.63it/s]
reward: -5.4129, last reward: -7.5831, gradient norm:  9.266:  12%|█▏        | 76/625 [00:11<01:22,  6.62it/s]
reward: -5.7723, last reward: -7.4152, gradient norm:  5.608:  12%|█▏        | 76/625 [00:11<01:22,  6.62it/s]
reward: -5.7723, last reward: -7.4152, gradient norm:  5.608:  12%|█▏        | 77/625 [00:11<01:22,  6.63it/s]
reward: -6.1604, last reward: -8.0898, gradient norm:  4.389:  12%|█▏        | 77/625 [00:11<01:22,  6.63it/s]
reward: -6.1604, last reward: -8.0898, gradient norm:  4.389:  12%|█▏        | 78/625 [00:11<01:22,  6.63it/s]
reward: -6.5155, last reward: -5.5376, gradient norm:  36.34:  12%|█▏        | 78/625 [00:11<01:22,  6.63it/s]
reward: -6.5155, last reward: -5.5376, gradient norm:  36.34:  13%|█▎        | 79/625 [00:11<01:22,  6.61it/s]
reward: -6.5616, last reward: -6.4094, gradient norm:  8.283:  13%|█▎        | 79/625 [00:12<01:22,  6.61it/s]
reward: -6.5616, last reward: -6.4094, gradient norm:  8.283:  13%|█▎        | 80/625 [00:12<01:22,  6.61it/s]
reward: -6.5333, last reward: -7.4803, gradient norm:  5.895:  13%|█▎        | 80/625 [00:12<01:22,  6.61it/s]
reward: -6.5333, last reward: -7.4803, gradient norm:  5.895:  13%|█▎        | 81/625 [00:12<01:22,  6.62it/s]
reward: -6.6566, last reward: -5.2588, gradient norm:  7.662:  13%|█▎        | 81/625 [00:12<01:22,  6.62it/s]
reward: -6.6566, last reward: -5.2588, gradient norm:  7.662:  13%|█▎        | 82/625 [00:12<01:21,  6.62it/s]
reward: -6.4732, last reward: -6.7503, gradient norm:  6.068:  13%|█▎        | 82/625 [00:12<01:21,  6.62it/s]
reward: -6.4732, last reward: -6.7503, gradient norm:  6.068:  13%|█▎        | 83/625 [00:12<01:22,  6.60it/s]
reward: -6.0714, last reward: -7.3370, gradient norm:  8.059:  13%|█▎        | 83/625 [00:12<01:22,  6.60it/s]
reward: -6.0714, last reward: -7.3370, gradient norm:  8.059:  13%|█▎        | 84/625 [00:12<01:21,  6.61it/s]
reward: -5.8612, last reward: -6.1915, gradient norm:  9.3:  13%|█▎        | 84/625 [00:12<01:21,  6.61it/s]
reward: -5.8612, last reward: -6.1915, gradient norm:  9.3:  14%|█▎        | 85/625 [00:12<01:21,  6.61it/s]
reward: -5.3855, last reward: -5.0349, gradient norm:  15.2:  14%|█▎        | 85/625 [00:12<01:21,  6.61it/s]
reward: -5.3855, last reward: -5.0349, gradient norm:  15.2:  14%|█▍        | 86/625 [00:12<01:21,  6.61it/s]
reward: -4.9644, last reward: -3.4538, gradient norm:  3.445:  14%|█▍        | 86/625 [00:13<01:21,  6.61it/s]
reward: -4.9644, last reward: -3.4538, gradient norm:  3.445:  14%|█▍        | 87/625 [00:13<01:21,  6.62it/s]
reward: -5.0392, last reward: -4.4080, gradient norm:  11.45:  14%|█▍        | 87/625 [00:13<01:21,  6.62it/s]
reward: -5.0392, last reward: -4.4080, gradient norm:  11.45:  14%|█▍        | 88/625 [00:13<01:21,  6.62it/s]
reward: -5.1648, last reward: -5.9599, gradient norm:  143.4:  14%|█▍        | 88/625 [00:13<01:21,  6.62it/s]
reward: -5.1648, last reward: -5.9599, gradient norm:  143.4:  14%|█▍        | 89/625 [00:13<01:20,  6.62it/s]
reward: -5.4284, last reward: -5.5946, gradient norm:  10.3:  14%|█▍        | 89/625 [00:13<01:20,  6.62it/s]
reward: -5.4284, last reward: -5.5946, gradient norm:  10.3:  14%|█▍        | 90/625 [00:13<01:20,  6.63it/s]
reward: -5.2590, last reward: -5.9181, gradient norm:  11.15:  14%|█▍        | 90/625 [00:13<01:20,  6.63it/s]
reward: -5.2590, last reward: -5.9181, gradient norm:  11.15:  15%|█▍        | 91/625 [00:13<01:20,  6.63it/s]
reward: -5.4621, last reward: -5.9075, gradient norm:  8.674:  15%|█▍        | 91/625 [00:13<01:20,  6.63it/s]
reward: -5.4621, last reward: -5.9075, gradient norm:  8.674:  15%|█▍        | 92/625 [00:13<01:20,  6.62it/s]
reward: -5.1772, last reward: -4.9444, gradient norm:  8.351:  15%|█▍        | 92/625 [00:14<01:20,  6.62it/s]
reward: -5.1772, last reward: -4.9444, gradient norm:  8.351:  15%|█▍        | 93/625 [00:14<01:20,  6.63it/s]
reward: -4.9391, last reward: -4.5595, gradient norm:  8.1:  15%|█▍        | 93/625 [00:14<01:20,  6.63it/s]
reward: -4.9391, last reward: -4.5595, gradient norm:  8.1:  15%|█▌        | 94/625 [00:14<01:20,  6.61it/s]
reward: -4.8673, last reward: -4.6240, gradient norm:  14.43:  15%|█▌        | 94/625 [00:14<01:20,  6.61it/s]
reward: -4.8673, last reward: -4.6240, gradient norm:  14.43:  15%|█▌        | 95/625 [00:14<01:20,  6.61it/s]
reward: -4.5919, last reward: -5.0018, gradient norm:  26.09:  15%|█▌        | 95/625 [00:14<01:20,  6.61it/s]
reward: -4.5919, last reward: -5.0018, gradient norm:  26.09:  15%|█▌        | 96/625 [00:14<01:20,  6.61it/s]
reward: -5.1071, last reward: -3.9127, gradient norm:  2.251:  15%|█▌        | 96/625 [00:14<01:20,  6.61it/s]
reward: -5.1071, last reward: -3.9127, gradient norm:  2.251:  16%|█▌        | 97/625 [00:14<01:19,  6.62it/s]
reward: -4.9799, last reward: -5.3131, gradient norm:  19.65:  16%|█▌        | 97/625 [00:14<01:19,  6.62it/s]
reward: -4.9799, last reward: -5.3131, gradient norm:  19.65:  16%|█▌        | 98/625 [00:14<01:19,  6.63it/s]
reward: -4.9612, last reward: -3.9705, gradient norm:  12.55:  16%|█▌        | 98/625 [00:14<01:19,  6.63it/s]
reward: -4.9612, last reward: -3.9705, gradient norm:  12.55:  16%|█▌        | 99/625 [00:14<01:19,  6.61it/s]
reward: -4.8741, last reward: -4.2230, gradient norm:  6.19:  16%|█▌        | 99/625 [00:15<01:19,  6.61it/s]
reward: -4.8741, last reward: -4.2230, gradient norm:  6.19:  16%|█▌        | 100/625 [00:15<01:19,  6.62it/s]
reward: -5.0972, last reward: -5.0337, gradient norm:  11.86:  16%|█▌        | 100/625 [00:15<01:19,  6.62it/s]
reward: -5.0972, last reward: -5.0337, gradient norm:  11.86:  16%|█▌        | 101/625 [00:15<01:19,  6.62it/s]
reward: -5.0350, last reward: -5.0654, gradient norm:  10.83:  16%|█▌        | 101/625 [00:15<01:19,  6.62it/s]
reward: -5.0350, last reward: -5.0654, gradient norm:  10.83:  16%|█▋        | 102/625 [00:15<01:18,  6.62it/s]
reward: -5.2441, last reward: -4.4596, gradient norm:  7.362:  16%|█▋        | 102/625 [00:15<01:18,  6.62it/s]
reward: -5.2441, last reward: -4.4596, gradient norm:  7.362:  16%|█▋        | 103/625 [00:15<01:18,  6.63it/s]
reward: -5.1664, last reward: -5.4362, gradient norm:  8.171:  16%|█▋        | 103/625 [00:15<01:18,  6.63it/s]
reward: -5.1664, last reward: -5.4362, gradient norm:  8.171:  17%|█▋        | 104/625 [00:15<01:18,  6.61it/s]
reward: -5.4041, last reward: -5.6907, gradient norm:  7.77:  17%|█▋        | 104/625 [00:15<01:18,  6.61it/s]
reward: -5.4041, last reward: -5.6907, gradient norm:  7.77:  17%|█▋        | 105/625 [00:15<01:18,  6.58it/s]
reward: -5.4664, last reward: -6.2760, gradient norm:  11.19:  17%|█▋        | 105/625 [00:15<01:18,  6.58it/s]
reward: -5.4664, last reward: -6.2760, gradient norm:  11.19:  17%|█▋        | 106/625 [00:15<01:18,  6.57it/s]
reward: -5.0299, last reward: -3.9712, gradient norm:  9.349:  17%|█▋        | 106/625 [00:16<01:18,  6.57it/s]
reward: -5.0299, last reward: -3.9712, gradient norm:  9.349:  17%|█▋        | 107/625 [00:16<01:18,  6.58it/s]
reward: -4.3332, last reward: -2.4479, gradient norm:  5.772:  17%|█▋        | 107/625 [00:16<01:18,  6.58it/s]
reward: -4.3332, last reward: -2.4479, gradient norm:  5.772:  17%|█▋        | 108/625 [00:16<01:18,  6.60it/s]
reward: -4.4357, last reward: -2.9591, gradient norm:  4.543:  17%|█▋        | 108/625 [00:16<01:18,  6.60it/s]
reward: -4.4357, last reward: -2.9591, gradient norm:  4.543:  17%|█▋        | 109/625 [00:16<01:18,  6.58it/s]
reward: -4.6216, last reward: -3.1353, gradient norm:  4.692:  17%|█▋        | 109/625 [00:16<01:18,  6.58it/s]
reward: -4.6216, last reward: -3.1353, gradient norm:  4.692:  18%|█▊        | 110/625 [00:16<01:18,  6.60it/s]
reward: -4.6261, last reward: -3.7086, gradient norm:  4.496:  18%|█▊        | 110/625 [00:16<01:18,  6.60it/s]
reward: -4.6261, last reward: -3.7086, gradient norm:  4.496:  18%|█▊        | 111/625 [00:16<01:17,  6.60it/s]
reward: -4.7758, last reward: -5.9818, gradient norm:  21.71:  18%|█▊        | 111/625 [00:16<01:17,  6.60it/s]
reward: -4.7758, last reward: -5.9818, gradient norm:  21.71:  18%|█▊        | 112/625 [00:16<01:17,  6.59it/s]
reward: -4.7772, last reward: -7.5055, gradient norm:  62.86:  18%|█▊        | 112/625 [00:17<01:17,  6.59it/s]
reward: -4.7772, last reward: -7.5055, gradient norm:  62.86:  18%|█▊        | 113/625 [00:17<01:17,  6.60it/s]
reward: -4.5840, last reward: -5.3180, gradient norm:  18.74:  18%|█▊        | 113/625 [00:17<01:17,  6.60it/s]
reward: -4.5840, last reward: -5.3180, gradient norm:  18.74:  18%|█▊        | 114/625 [00:17<01:17,  6.62it/s]
reward: -4.2976, last reward: -3.2083, gradient norm:  10.63:  18%|█▊        | 114/625 [00:17<01:17,  6.62it/s]
reward: -4.2976, last reward: -3.2083, gradient norm:  10.63:  18%|█▊        | 115/625 [00:17<01:17,  6.62it/s]
reward: -4.5275, last reward: -3.6873, gradient norm:  15.65:  18%|█▊        | 115/625 [00:17<01:17,  6.62it/s]
reward: -4.5275, last reward: -3.6873, gradient norm:  15.65:  19%|█▊        | 116/625 [00:17<01:16,  6.63it/s]
reward: -4.4107, last reward: -3.1624, gradient norm:  19.7:  19%|█▊        | 116/625 [00:17<01:16,  6.63it/s]
reward: -4.4107, last reward: -3.1624, gradient norm:  19.7:  19%|█▊        | 117/625 [00:17<01:16,  6.63it/s]
reward: -4.6372, last reward: -3.2571, gradient norm:  15.83:  19%|█▊        | 117/625 [00:17<01:16,  6.63it/s]
reward: -4.6372, last reward: -3.2571, gradient norm:  15.83:  19%|█▉        | 118/625 [00:17<01:16,  6.62it/s]
reward: -4.4039, last reward: -4.4428, gradient norm:  13.06:  19%|█▉        | 118/625 [00:17<01:16,  6.62it/s]
reward: -4.4039, last reward: -4.4428, gradient norm:  13.06:  19%|█▉        | 119/625 [00:17<01:16,  6.63it/s]
reward: -4.4728, last reward: -3.5628, gradient norm:  12.04:  19%|█▉        | 119/625 [00:18<01:16,  6.63it/s]
reward: -4.4728, last reward: -3.5628, gradient norm:  12.04:  19%|█▉        | 120/625 [00:18<01:16,  6.61it/s]
reward: -4.6767, last reward: -5.2466, gradient norm:  6.522:  19%|█▉        | 120/625 [00:18<01:16,  6.61it/s]
reward: -4.6767, last reward: -5.2466, gradient norm:  6.522:  19%|█▉        | 121/625 [00:18<01:16,  6.60it/s]
reward: -4.5873, last reward: -6.5072, gradient norm:  19.21:  19%|█▉        | 121/625 [00:18<01:16,  6.60it/s]
reward: -4.5873, last reward: -6.5072, gradient norm:  19.21:  20%|█▉        | 122/625 [00:18<01:16,  6.60it/s]
reward: -4.6548, last reward: -6.3766, gradient norm:  5.692:  20%|█▉        | 122/625 [00:18<01:16,  6.60it/s]
reward: -4.6548, last reward: -6.3766, gradient norm:  5.692:  20%|█▉        | 123/625 [00:18<01:16,  6.60it/s]
reward: -4.5134, last reward: -7.1955, gradient norm:  11.11:  20%|█▉        | 123/625 [00:18<01:16,  6.60it/s]
reward: -4.5134, last reward: -7.1955, gradient norm:  11.11:  20%|█▉        | 124/625 [00:18<01:15,  6.62it/s]
reward: -4.2481, last reward: -7.0591, gradient norm:  11.85:  20%|█▉        | 124/625 [00:18<01:15,  6.62it/s]
reward: -4.2481, last reward: -7.0591, gradient norm:  11.85:  20%|██        | 125/625 [00:18<01:15,  6.60it/s]
reward: -4.4500, last reward: -5.3368, gradient norm:  10.19:  20%|██        | 125/625 [00:19<01:15,  6.60it/s]
reward: -4.4500, last reward: -5.3368, gradient norm:  10.19:  20%|██        | 126/625 [00:19<01:15,  6.60it/s]
reward: -3.9708, last reward: -2.7059, gradient norm:  42.81:  20%|██        | 126/625 [00:19<01:15,  6.60it/s]
reward: -3.9708, last reward: -2.7059, gradient norm:  42.81:  20%|██        | 127/625 [00:19<01:15,  6.61it/s]
reward: -4.3031, last reward: -3.2534, gradient norm:  4.843:  20%|██        | 127/625 [00:19<01:15,  6.61it/s]
reward: -4.3031, last reward: -3.2534, gradient norm:  4.843:  20%|██        | 128/625 [00:19<01:15,  6.61it/s]
reward: -4.3327, last reward: -4.6193, gradient norm:  20.96:  20%|██        | 128/625 [00:19<01:15,  6.61it/s]
reward: -4.3327, last reward: -4.6193, gradient norm:  20.96:  21%|██        | 129/625 [00:19<01:15,  6.61it/s]
reward: -4.4831, last reward: -4.1172, gradient norm:  24.81:  21%|██        | 129/625 [00:19<01:15,  6.61it/s]
reward: -4.4831, last reward: -4.1172, gradient norm:  24.81:  21%|██        | 130/625 [00:19<01:14,  6.61it/s]
reward: -4.2593, last reward: -4.4219, gradient norm:  5.962:  21%|██        | 130/625 [00:19<01:14,  6.61it/s]
reward: -4.2593, last reward: -4.4219, gradient norm:  5.962:  21%|██        | 131/625 [00:19<01:14,  6.61it/s]
reward: -4.4800, last reward: -3.8380, gradient norm:  2.899:  21%|██        | 131/625 [00:19<01:14,  6.61it/s]
reward: -4.4800, last reward: -3.8380, gradient norm:  2.899:  21%|██        | 132/625 [00:19<01:14,  6.59it/s]
reward: -4.2721, last reward: -4.9048, gradient norm:  7.166:  21%|██        | 132/625 [00:20<01:14,  6.59it/s]
reward: -4.2721, last reward: -4.9048, gradient norm:  7.166:  21%|██▏       | 133/625 [00:20<01:14,  6.59it/s]
reward: -4.2419, last reward: -4.5248, gradient norm:  25.93:  21%|██▏       | 133/625 [00:20<01:14,  6.59it/s]
reward: -4.2419, last reward: -4.5248, gradient norm:  25.93:  21%|██▏       | 134/625 [00:20<01:14,  6.60it/s]
reward: -4.2139, last reward: -4.4278, gradient norm:  20.26:  21%|██▏       | 134/625 [00:20<01:14,  6.60it/s]
reward: -4.2139, last reward: -4.4278, gradient norm:  20.26:  22%|██▏       | 135/625 [00:20<01:14,  6.62it/s]
reward: -4.0690, last reward: -2.5140, gradient norm:  22.5:  22%|██▏       | 135/625 [00:20<01:14,  6.62it/s]
reward: -4.0690, last reward: -2.5140, gradient norm:  22.5:  22%|██▏       | 136/625 [00:20<01:13,  6.62it/s]
reward: -4.1140, last reward: -3.7402, gradient norm:  11.11:  22%|██▏       | 136/625 [00:20<01:13,  6.62it/s]
reward: -4.1140, last reward: -3.7402, gradient norm:  11.11:  22%|██▏       | 137/625 [00:20<01:13,  6.62it/s]
reward: -4.5356, last reward: -5.1636, gradient norm:  400.1:  22%|██▏       | 137/625 [00:20<01:13,  6.62it/s]
reward: -4.5356, last reward: -5.1636, gradient norm:  400.1:  22%|██▏       | 138/625 [00:20<01:13,  6.60it/s]
reward: -5.0671, last reward: -5.8798, gradient norm:  13.34:  22%|██▏       | 138/625 [00:20<01:13,  6.60it/s]
reward: -5.0671, last reward: -5.8798, gradient norm:  13.34:  22%|██▏       | 139/625 [00:20<01:13,  6.62it/s]
reward: -4.8918, last reward: -6.3298, gradient norm:  7.307:  22%|██▏       | 139/625 [00:21<01:13,  6.62it/s]
reward: -4.8918, last reward: -6.3298, gradient norm:  7.307:  22%|██▏       | 140/625 [00:21<01:13,  6.62it/s]
reward: -5.1779, last reward: -4.1915, gradient norm:  11.43:  22%|██▏       | 140/625 [00:21<01:13,  6.62it/s]
reward: -5.1779, last reward: -4.1915, gradient norm:  11.43:  23%|██▎       | 141/625 [00:21<01:13,  6.62it/s]
reward: -5.1771, last reward: -4.3624, gradient norm:  6.936:  23%|██▎       | 141/625 [00:21<01:13,  6.62it/s]
reward: -5.1771, last reward: -4.3624, gradient norm:  6.936:  23%|██▎       | 142/625 [00:21<01:12,  6.63it/s]
reward: -5.1683, last reward: -3.4810, gradient norm:  13.29:  23%|██▎       | 142/625 [00:21<01:12,  6.63it/s]
reward: -5.1683, last reward: -3.4810, gradient norm:  13.29:  23%|██▎       | 143/625 [00:21<01:13,  6.60it/s]
reward: -4.9373, last reward: -5.4435, gradient norm:  19.33:  23%|██▎       | 143/625 [00:21<01:13,  6.60it/s]
reward: -4.9373, last reward: -5.4435, gradient norm:  19.33:  23%|██▎       | 144/625 [00:21<01:12,  6.61it/s]
reward: -4.4396, last reward: -4.8092, gradient norm:  118.9:  23%|██▎       | 144/625 [00:21<01:12,  6.61it/s]
reward: -4.4396, last reward: -4.8092, gradient norm:  118.9:  23%|██▎       | 145/625 [00:21<01:12,  6.60it/s]
reward: -4.3911, last reward: -8.2572, gradient norm:  15.04:  23%|██▎       | 145/625 [00:22<01:12,  6.60it/s]
reward: -4.3911, last reward: -8.2572, gradient norm:  15.04:  23%|██▎       | 146/625 [00:22<01:12,  6.62it/s]
reward: -4.4212, last reward: -3.0260, gradient norm:  26.01:  23%|██▎       | 146/625 [00:22<01:12,  6.62it/s]
reward: -4.4212, last reward: -3.0260, gradient norm:  26.01:  24%|██▎       | 147/625 [00:22<01:12,  6.62it/s]
reward: -4.0939, last reward: -4.6478, gradient norm:  9.605:  24%|██▎       | 147/625 [00:22<01:12,  6.62it/s]
reward: -4.0939, last reward: -4.6478, gradient norm:  9.605:  24%|██▎       | 148/625 [00:22<01:12,  6.62it/s]
reward: -4.6606, last reward: -4.7289, gradient norm:  11.19:  24%|██▎       | 148/625 [00:22<01:12,  6.62it/s]
reward: -4.6606, last reward: -4.7289, gradient norm:  11.19:  24%|██▍       | 149/625 [00:22<01:11,  6.63it/s]
reward: -4.9300, last reward: -4.7193, gradient norm:  8.563:  24%|██▍       | 149/625 [00:22<01:11,  6.63it/s]
reward: -4.9300, last reward: -4.7193, gradient norm:  8.563:  24%|██▍       | 150/625 [00:22<01:11,  6.63it/s]
reward: -5.1166, last reward: -4.8514, gradient norm:  8.384:  24%|██▍       | 150/625 [00:22<01:11,  6.63it/s]
reward: -5.1166, last reward: -4.8514, gradient norm:  8.384:  24%|██▍       | 151/625 [00:22<01:11,  6.63it/s]
reward: -4.9108, last reward: -5.0672, gradient norm:  9.292:  24%|██▍       | 151/625 [00:22<01:11,  6.63it/s]
reward: -4.9108, last reward: -5.0672, gradient norm:  9.292:  24%|██▍       | 152/625 [00:22<01:11,  6.63it/s]
reward: -4.8591, last reward: -4.3768, gradient norm:  9.72:  24%|██▍       | 152/625 [00:23<01:11,  6.63it/s]
reward: -4.8591, last reward: -4.3768, gradient norm:  9.72:  24%|██▍       | 153/625 [00:23<01:11,  6.63it/s]
reward: -4.2721, last reward: -3.9976, gradient norm:  10.37:  24%|██▍       | 153/625 [00:23<01:11,  6.63it/s]
reward: -4.2721, last reward: -3.9976, gradient norm:  10.37:  25%|██▍       | 154/625 [00:23<01:11,  6.63it/s]
reward: -4.0576, last reward: -2.0067, gradient norm:  8.935:  25%|██▍       | 154/625 [00:23<01:11,  6.63it/s]
reward: -4.0576, last reward: -2.0067, gradient norm:  8.935:  25%|██▍       | 155/625 [00:23<01:10,  6.64it/s]
reward: -4.4199, last reward: -5.1722, gradient norm:  18.7:  25%|██▍       | 155/625 [00:23<01:10,  6.64it/s]
reward: -4.4199, last reward: -5.1722, gradient norm:  18.7:  25%|██▍       | 156/625 [00:23<01:10,  6.64it/s]
reward: -4.8310, last reward: -7.3466, gradient norm:  28.52:  25%|██▍       | 156/625 [00:23<01:10,  6.64it/s]
reward: -4.8310, last reward: -7.3466, gradient norm:  28.52:  25%|██▌       | 157/625 [00:23<01:10,  6.64it/s]
reward: -4.8631, last reward: -6.2492, gradient norm:  89.17:  25%|██▌       | 157/625 [00:23<01:10,  6.64it/s]
reward: -4.8631, last reward: -6.2492, gradient norm:  89.17:  25%|██▌       | 158/625 [00:23<01:10,  6.64it/s]
reward: -4.8763, last reward: -6.1277, gradient norm:  24.43:  25%|██▌       | 158/625 [00:24<01:10,  6.64it/s]
reward: -4.8763, last reward: -6.1277, gradient norm:  24.43:  25%|██▌       | 159/625 [00:24<01:10,  6.64it/s]
reward: -4.5562, last reward: -5.7446, gradient norm:  23.35:  25%|██▌       | 159/625 [00:24<01:10,  6.64it/s]
reward: -4.5562, last reward: -5.7446, gradient norm:  23.35:  26%|██▌       | 160/625 [00:24<01:10,  6.64it/s]
reward: -4.1082, last reward: -4.9830, gradient norm:  22.14:  26%|██▌       | 160/625 [00:24<01:10,  6.64it/s]
reward: -4.1082, last reward: -4.9830, gradient norm:  22.14:  26%|██▌       | 161/625 [00:24<01:09,  6.64it/s]
reward: -4.0946, last reward: -2.5229, gradient norm:  10.47:  26%|██▌       | 161/625 [00:24<01:09,  6.64it/s]
reward: -4.0946, last reward: -2.5229, gradient norm:  10.47:  26%|██▌       | 162/625 [00:24<01:09,  6.64it/s]
reward: -4.4574, last reward: -4.6900, gradient norm:  112.6:  26%|██▌       | 162/625 [00:24<01:09,  6.64it/s]
reward: -4.4574, last reward: -4.6900, gradient norm:  112.6:  26%|██▌       | 163/625 [00:24<01:09,  6.63it/s]
reward: -5.2229, last reward: -4.0318, gradient norm:  6.482:  26%|██▌       | 163/625 [00:24<01:09,  6.63it/s]
reward: -5.2229, last reward: -4.0318, gradient norm:  6.482:  26%|██▌       | 164/625 [00:24<01:09,  6.64it/s]
reward: -5.0543, last reward: -4.0817, gradient norm:  5.761:  26%|██▌       | 164/625 [00:24<01:09,  6.64it/s]
reward: -5.0543, last reward: -4.0817, gradient norm:  5.761:  26%|██▋       | 165/625 [00:24<01:09,  6.64it/s]
reward: -5.2809, last reward: -4.5118, gradient norm:  5.366:  26%|██▋       | 165/625 [00:25<01:09,  6.64it/s]
reward: -5.2809, last reward: -4.5118, gradient norm:  5.366:  27%|██▋       | 166/625 [00:25<01:08,  6.65it/s]
reward: -5.1142, last reward: -4.5635, gradient norm:  5.04:  27%|██▋       | 166/625 [00:25<01:08,  6.65it/s]
reward: -5.1142, last reward: -4.5635, gradient norm:  5.04:  27%|██▋       | 167/625 [00:25<01:08,  6.64it/s]
reward: -5.1949, last reward: -4.2327, gradient norm:  4.982:  27%|██▋       | 167/625 [00:25<01:08,  6.64it/s]
reward: -5.1949, last reward: -4.2327, gradient norm:  4.982:  27%|██▋       | 168/625 [00:25<01:08,  6.65it/s]
reward: -5.0967, last reward: -5.0387, gradient norm:  7.457:  27%|██▋       | 168/625 [00:25<01:08,  6.65it/s]
reward: -5.0967, last reward: -5.0387, gradient norm:  7.457:  27%|██▋       | 169/625 [00:25<01:08,  6.65it/s]
reward: -5.0782, last reward: -5.2150, gradient norm:  10.54:  27%|██▋       | 169/625 [00:25<01:08,  6.65it/s]
reward: -5.0782, last reward: -5.2150, gradient norm:  10.54:  27%|██▋       | 170/625 [00:25<01:08,  6.65it/s]
reward: -4.5222, last reward: -4.3725, gradient norm:  22.63:  27%|██▋       | 170/625 [00:25<01:08,  6.65it/s]
reward: -4.5222, last reward: -4.3725, gradient norm:  22.63:  27%|██▋       | 171/625 [00:25<01:08,  6.64it/s]
reward: -3.9288, last reward: -3.9837, gradient norm:  83.59:  27%|██▋       | 171/625 [00:25<01:08,  6.64it/s]
reward: -3.9288, last reward: -3.9837, gradient norm:  83.59:  28%|██▊       | 172/625 [00:25<01:08,  6.63it/s]
reward: -4.1416, last reward: -4.1099, gradient norm:  30.57:  28%|██▊       | 172/625 [00:26<01:08,  6.63it/s]
reward: -4.1416, last reward: -4.1099, gradient norm:  30.57:  28%|██▊       | 173/625 [00:26<01:08,  6.62it/s]
reward: -4.8620, last reward: -6.8475, gradient norm:  18.91:  28%|██▊       | 173/625 [00:26<01:08,  6.62it/s]
reward: -4.8620, last reward: -6.8475, gradient norm:  18.91:  28%|██▊       | 174/625 [00:26<01:08,  6.62it/s]
reward: -5.1807, last reward: -6.4375, gradient norm:  18.48:  28%|██▊       | 174/625 [00:26<01:08,  6.62it/s]
reward: -5.1807, last reward: -6.4375, gradient norm:  18.48:  28%|██▊       | 175/625 [00:26<01:07,  6.63it/s]
reward: -5.1148, last reward: -5.0645, gradient norm:  14.36:  28%|██▊       | 175/625 [00:26<01:07,  6.63it/s]
reward: -5.1148, last reward: -5.0645, gradient norm:  14.36:  28%|██▊       | 176/625 [00:26<01:07,  6.62it/s]
reward: -5.2751, last reward: -4.8313, gradient norm:  15.32:  28%|██▊       | 176/625 [00:26<01:07,  6.62it/s]
reward: -5.2751, last reward: -4.8313, gradient norm:  15.32:  28%|██▊       | 177/625 [00:26<01:07,  6.63it/s]
reward: -4.9286, last reward: -6.9770, gradient norm:  24.75:  28%|██▊       | 177/625 [00:26<01:07,  6.63it/s]
reward: -4.9286, last reward: -6.9770, gradient norm:  24.75:  28%|██▊       | 178/625 [00:26<01:07,  6.62it/s]
reward: -4.5735, last reward: -5.2837, gradient norm:  15.2:  28%|██▊       | 178/625 [00:27<01:07,  6.62it/s]
reward: -4.5735, last reward: -5.2837, gradient norm:  15.2:  29%|██▊       | 179/625 [00:27<01:07,  6.63it/s]
reward: -4.2926, last reward: -1.9489, gradient norm:  18.24:  29%|██▊       | 179/625 [00:27<01:07,  6.63it/s]
reward: -4.2926, last reward: -1.9489, gradient norm:  18.24:  29%|██▉       | 180/625 [00:27<01:07,  6.63it/s]
reward: -4.1507, last reward: -3.5593, gradient norm:  37.66:  29%|██▉       | 180/625 [00:27<01:07,  6.63it/s]
reward: -4.1507, last reward: -3.5593, gradient norm:  37.66:  29%|██▉       | 181/625 [00:27<01:06,  6.64it/s]
reward: -3.8724, last reward: -4.3567, gradient norm:  16.67:  29%|██▉       | 181/625 [00:27<01:06,  6.64it/s]
reward: -3.8724, last reward: -4.3567, gradient norm:  16.67:  29%|██▉       | 182/625 [00:27<01:06,  6.64it/s]
reward: -4.3574, last reward: -3.6140, gradient norm:  13.96:  29%|██▉       | 182/625 [00:27<01:06,  6.64it/s]
reward: -4.3574, last reward: -3.6140, gradient norm:  13.96:  29%|██▉       | 183/625 [00:27<01:06,  6.63it/s]
reward: -4.7895, last reward: -6.2518, gradient norm:  14.74:  29%|██▉       | 183/625 [00:27<01:06,  6.63it/s]
reward: -4.7895, last reward: -6.2518, gradient norm:  14.74:  29%|██▉       | 184/625 [00:27<01:06,  6.63it/s]
reward: -4.6146, last reward: -5.6969, gradient norm:  11.45:  29%|██▉       | 184/625 [00:27<01:06,  6.63it/s]
reward: -4.6146, last reward: -5.6969, gradient norm:  11.45:  30%|██▉       | 185/625 [00:27<01:06,  6.63it/s]
reward: -4.8776, last reward: -5.7358, gradient norm:  13.16:  30%|██▉       | 185/625 [00:28<01:06,  6.63it/s]
reward: -4.8776, last reward: -5.7358, gradient norm:  13.16:  30%|██▉       | 186/625 [00:28<01:06,  6.62it/s]
reward: -4.3722, last reward: -4.8428, gradient norm:  23.57:  30%|██▉       | 186/625 [00:28<01:06,  6.62it/s]
reward: -4.3722, last reward: -4.8428, gradient norm:  23.57:  30%|██▉       | 187/625 [00:28<01:06,  6.62it/s]
reward: -4.2656, last reward: -3.7955, gradient norm:  54.67:  30%|██▉       | 187/625 [00:28<01:06,  6.62it/s]
reward: -4.2656, last reward: -3.7955, gradient norm:  54.67:  30%|███       | 188/625 [00:28<01:05,  6.63it/s]
reward: -4.0092, last reward: -1.7106, gradient norm:  7.829:  30%|███       | 188/625 [00:28<01:05,  6.63it/s]
reward: -4.0092, last reward: -1.7106, gradient norm:  7.829:  30%|███       | 189/625 [00:28<01:30,  4.79it/s]
reward: -4.2264, last reward: -3.6919, gradient norm:  16.17:  30%|███       | 189/625 [00:28<01:30,  4.79it/s]
reward: -4.2264, last reward: -3.6919, gradient norm:  16.17:  30%|███       | 190/625 [00:28<01:23,  5.23it/s]
reward: -4.1438, last reward: -2.1362, gradient norm:  19.43:  30%|███       | 190/625 [00:29<01:23,  5.23it/s]
reward: -4.1438, last reward: -2.1362, gradient norm:  19.43:  31%|███       | 191/625 [00:29<01:17,  5.58it/s]
reward: -4.0618, last reward: -2.8217, gradient norm:  73.63:  31%|███       | 191/625 [00:29<01:17,  5.58it/s]
reward: -4.0618, last reward: -2.8217, gradient norm:  73.63:  31%|███       | 192/625 [00:29<01:13,  5.85it/s]
reward: -3.9420, last reward: -3.6765, gradient norm:  34.1:  31%|███       | 192/625 [00:29<01:13,  5.85it/s]
reward: -3.9420, last reward: -3.6765, gradient norm:  34.1:  31%|███       | 193/625 [00:29<01:11,  6.07it/s]
reward: -3.7745, last reward: -4.0709, gradient norm:  26.48:  31%|███       | 193/625 [00:29<01:11,  6.07it/s]
reward: -3.7745, last reward: -4.0709, gradient norm:  26.48:  31%|███       | 194/625 [00:29<01:09,  6.23it/s]
reward: -3.9478, last reward: -2.6867, gradient norm:  22.82:  31%|███       | 194/625 [00:29<01:09,  6.23it/s]
reward: -3.9478, last reward: -2.6867, gradient norm:  22.82:  31%|███       | 195/625 [00:29<01:07,  6.34it/s]
reward: -3.6507, last reward: -2.6225, gradient norm:  37.44:  31%|███       | 195/625 [00:29<01:07,  6.34it/s]
reward: -3.6507, last reward: -2.6225, gradient norm:  37.44:  31%|███▏      | 196/625 [00:29<01:06,  6.40it/s]
reward: -4.2244, last reward: -3.2195, gradient norm:  10.71:  31%|███▏      | 196/625 [00:29<01:06,  6.40it/s]
reward: -4.2244, last reward: -3.2195, gradient norm:  10.71:  32%|███▏      | 197/625 [00:29<01:07,  6.37it/s]
reward: -4.5385, last reward: -3.9263, gradient norm:  31.03:  32%|███▏      | 197/625 [00:30<01:07,  6.37it/s]
reward: -4.5385, last reward: -3.9263, gradient norm:  31.03:  32%|███▏      | 198/625 [00:30<01:06,  6.44it/s]
reward: -4.1878, last reward: -3.2374, gradient norm:  34.35:  32%|███▏      | 198/625 [00:30<01:06,  6.44it/s]
reward: -4.1878, last reward: -3.2374, gradient norm:  34.35:  32%|███▏      | 199/625 [00:30<01:05,  6.49it/s]
reward: -3.8054, last reward: -2.3504, gradient norm:  5.557:  32%|███▏      | 199/625 [00:30<01:05,  6.49it/s]
reward: -3.8054, last reward: -2.3504, gradient norm:  5.557:  32%|███▏      | 200/625 [00:30<01:05,  6.52it/s]
reward: -4.0766, last reward: -4.6825, gradient norm:  38.72:  32%|███▏      | 200/625 [00:30<01:05,  6.52it/s]
reward: -4.0766, last reward: -4.6825, gradient norm:  38.72:  32%|███▏      | 201/625 [00:30<01:04,  6.55it/s]
reward: -4.2011, last reward: -5.8393, gradient norm:  21.06:  32%|███▏      | 201/625 [00:30<01:04,  6.55it/s]
reward: -4.2011, last reward: -5.8393, gradient norm:  21.06:  32%|███▏      | 202/625 [00:30<01:04,  6.56it/s]
reward: -4.0803, last reward: -3.7815, gradient norm:  10.6:  32%|███▏      | 202/625 [00:30<01:04,  6.56it/s]
reward: -4.0803, last reward: -3.7815, gradient norm:  10.6:  32%|███▏      | 203/625 [00:30<01:04,  6.58it/s]
reward: -3.8363, last reward: -3.2460, gradient norm:  32.57:  32%|███▏      | 203/625 [00:30<01:04,  6.58it/s]
reward: -3.8363, last reward: -3.2460, gradient norm:  32.57:  33%|███▎      | 204/625 [00:30<01:03,  6.59it/s]
reward: -3.8643, last reward: -3.2191, gradient norm:  8.593:  33%|███▎      | 204/625 [00:31<01:03,  6.59it/s]
reward: -3.8643, last reward: -3.2191, gradient norm:  8.593:  33%|███▎      | 205/625 [00:31<01:03,  6.59it/s]
reward: -4.0773, last reward: -5.1343, gradient norm:  14.49:  33%|███▎      | 205/625 [00:31<01:03,  6.59it/s]
reward: -4.0773, last reward: -5.1343, gradient norm:  14.49:  33%|███▎      | 206/625 [00:31<01:03,  6.60it/s]
reward: -4.1400, last reward: -5.8657, gradient norm:  17.05:  33%|███▎      | 206/625 [00:31<01:03,  6.60it/s]
reward: -4.1400, last reward: -5.8657, gradient norm:  17.05:  33%|███▎      | 207/625 [00:31<01:03,  6.60it/s]
reward: -3.9304, last reward: -2.7584, gradient norm:  33.25:  33%|███▎      | 207/625 [00:31<01:03,  6.60it/s]
reward: -3.9304, last reward: -2.7584, gradient norm:  33.25:  33%|███▎      | 208/625 [00:31<01:03,  6.60it/s]
reward: -3.8752, last reward: -4.2307, gradient norm:  10.76:  33%|███▎      | 208/625 [00:31<01:03,  6.60it/s]
reward: -3.8752, last reward: -4.2307, gradient norm:  10.76:  33%|███▎      | 209/625 [00:31<01:02,  6.61it/s]
reward: -3.5250, last reward: -1.4869, gradient norm:  40.8:  33%|███▎      | 209/625 [00:31<01:02,  6.61it/s]
reward: -3.5250, last reward: -1.4869, gradient norm:  40.8:  34%|███▎      | 210/625 [00:31<01:02,  6.61it/s]
reward: -3.7837, last reward: -2.5762, gradient norm:  193.3:  34%|███▎      | 210/625 [00:32<01:02,  6.61it/s]
reward: -3.7837, last reward: -2.5762, gradient norm:  193.3:  34%|███▍      | 211/625 [00:32<01:02,  6.61it/s]
reward: -3.6661, last reward: -1.8600, gradient norm:  136.5:  34%|███▍      | 211/625 [00:32<01:02,  6.61it/s]
reward: -3.6661, last reward: -1.8600, gradient norm:  136.5:  34%|███▍      | 212/625 [00:32<01:02,  6.61it/s]
reward: -4.2502, last reward: -3.1752, gradient norm:  21.44:  34%|███▍      | 212/625 [00:32<01:02,  6.61it/s]
reward: -4.2502, last reward: -3.1752, gradient norm:  21.44:  34%|███▍      | 213/625 [00:32<01:02,  6.61it/s]
reward: -4.3075, last reward: -2.8871, gradient norm:  30.65:  34%|███▍      | 213/625 [00:32<01:02,  6.61it/s]
reward: -4.3075, last reward: -2.8871, gradient norm:  30.65:  34%|███▍      | 214/625 [00:32<01:02,  6.58it/s]
reward: -3.9406, last reward: -2.8090, gradient norm:  20.18:  34%|███▍      | 214/625 [00:32<01:02,  6.58it/s]
reward: -3.9406, last reward: -2.8090, gradient norm:  20.18:  34%|███▍      | 215/625 [00:32<01:02,  6.54it/s]
reward: -3.6291, last reward: -2.8923, gradient norm:  7.876:  34%|███▍      | 215/625 [00:32<01:02,  6.54it/s]
reward: -3.6291, last reward: -2.8923, gradient norm:  7.876:  35%|███▍      | 216/625 [00:32<01:02,  6.52it/s]
reward: -3.5112, last reward: -3.9504, gradient norm:  3.21e+03:  35%|███▍      | 216/625 [00:32<01:02,  6.52it/s]
reward: -3.5112, last reward: -3.9504, gradient norm:  3.21e+03:  35%|███▍      | 217/625 [00:32<01:02,  6.52it/s]
reward: -3.7431, last reward: -2.7880, gradient norm:  13.73:  35%|███▍      | 217/625 [00:33<01:02,  6.52it/s]
reward: -3.7431, last reward: -2.7880, gradient norm:  13.73:  35%|███▍      | 218/625 [00:33<01:02,  6.52it/s]
reward: -3.4463, last reward: -4.5432, gradient norm:  32.37:  35%|███▍      | 218/625 [00:33<01:02,  6.52it/s]
reward: -3.4463, last reward: -4.5432, gradient norm:  32.37:  35%|███▌      | 219/625 [00:33<01:02,  6.51it/s]
reward: -3.3793, last reward: -3.3313, gradient norm:  60.63:  35%|███▌      | 219/625 [00:33<01:02,  6.51it/s]
reward: -3.3793, last reward: -3.3313, gradient norm:  60.63:  35%|███▌      | 220/625 [00:33<01:02,  6.51it/s]
reward: -3.8843, last reward: -3.0369, gradient norm:  5.065:  35%|███▌      | 220/625 [00:33<01:02,  6.51it/s]
reward: -3.8843, last reward: -3.0369, gradient norm:  5.065:  35%|███▌      | 221/625 [00:33<01:02,  6.51it/s]
reward: -3.4828, last reward: -3.8391, gradient norm:  59.85:  35%|███▌      | 221/625 [00:33<01:02,  6.51it/s]
reward: -3.4828, last reward: -3.8391, gradient norm:  59.85:  36%|███▌      | 222/625 [00:33<01:01,  6.51it/s]
reward: -3.6265, last reward: -4.2913, gradient norm:  8.947:  36%|███▌      | 222/625 [00:33<01:01,  6.51it/s]
reward: -3.6265, last reward: -4.2913, gradient norm:  8.947:  36%|███▌      | 223/625 [00:33<01:01,  6.50it/s]
reward: -3.5541, last reward: -4.1252, gradient norm:  255.9:  36%|███▌      | 223/625 [00:34<01:01,  6.50it/s]
reward: -3.5541, last reward: -4.1252, gradient norm:  255.9:  36%|███▌      | 224/625 [00:34<01:01,  6.49it/s]
reward: -3.7342, last reward: -2.2396, gradient norm:  7.995:  36%|███▌      | 224/625 [00:34<01:01,  6.49it/s]
reward: -3.7342, last reward: -2.2396, gradient norm:  7.995:  36%|███▌      | 225/625 [00:34<01:01,  6.48it/s]
reward: -3.5936, last reward: -4.1924, gradient norm:  59.49:  36%|███▌      | 225/625 [00:34<01:01,  6.48it/s]
reward: -3.5936, last reward: -4.1924, gradient norm:  59.49:  36%|███▌      | 226/625 [00:34<01:01,  6.48it/s]
reward: -3.9975, last reward: -4.2045, gradient norm:  21.77:  36%|███▌      | 226/625 [00:34<01:01,  6.48it/s]
reward: -3.9975, last reward: -4.2045, gradient norm:  21.77:  36%|███▋      | 227/625 [00:34<01:01,  6.48it/s]
reward: -3.8367, last reward: -1.9540, gradient norm:  32.26:  36%|███▋      | 227/625 [00:34<01:01,  6.48it/s]
reward: -3.8367, last reward: -1.9540, gradient norm:  32.26:  36%|███▋      | 228/625 [00:34<01:01,  6.48it/s]
reward: -3.7259, last reward: -3.6743, gradient norm:  28.62:  36%|███▋      | 228/625 [00:34<01:01,  6.48it/s]
reward: -3.7259, last reward: -3.6743, gradient norm:  28.62:  37%|███▋      | 229/625 [00:34<01:01,  6.41it/s]
reward: -3.4827, last reward: -3.7528, gradient norm:  64.85:  37%|███▋      | 229/625 [00:34<01:01,  6.41it/s]
reward: -3.4827, last reward: -3.7528, gradient norm:  64.85:  37%|███▋      | 230/625 [00:34<01:01,  6.46it/s]
reward: -3.7361, last reward: -3.8756, gradient norm:  24.69:  37%|███▋      | 230/625 [00:35<01:01,  6.46it/s]
reward: -3.7361, last reward: -3.8756, gradient norm:  24.69:  37%|███▋      | 231/625 [00:35<01:00,  6.51it/s]
reward: -3.7646, last reward: -3.1116, gradient norm:  14.25:  37%|███▋      | 231/625 [00:35<01:00,  6.51it/s]
reward: -3.7646, last reward: -3.1116, gradient norm:  14.25:  37%|███▋      | 232/625 [00:35<01:00,  6.53it/s]
reward: -3.5426, last reward: -2.8385, gradient norm:  34.07:  37%|███▋      | 232/625 [00:35<01:00,  6.53it/s]
reward: -3.5426, last reward: -2.8385, gradient norm:  34.07:  37%|███▋      | 233/625 [00:35<00:59,  6.56it/s]
reward: -3.5662, last reward: -1.8585, gradient norm:  11.26:  37%|███▋      | 233/625 [00:35<00:59,  6.56it/s]
reward: -3.5662, last reward: -1.8585, gradient norm:  11.26:  37%|███▋      | 234/625 [00:35<00:59,  6.58it/s]
reward: -3.8234, last reward: -2.7930, gradient norm:  32.18:  37%|███▋      | 234/625 [00:35<00:59,  6.58it/s]
reward: -3.8234, last reward: -2.7930, gradient norm:  32.18:  38%|███▊      | 235/625 [00:35<00:59,  6.59it/s]
reward: -4.2648, last reward: -4.9309, gradient norm:  24.83:  38%|███▊      | 235/625 [00:35<00:59,  6.59it/s]
reward: -4.2648, last reward: -4.9309, gradient norm:  24.83:  38%|███▊      | 236/625 [00:35<00:59,  6.59it/s]
reward: -4.2039, last reward: -3.6817, gradient norm:  19.24:  38%|███▊      | 236/625 [00:36<00:59,  6.59it/s]
reward: -4.2039, last reward: -3.6817, gradient norm:  19.24:  38%|███▊      | 237/625 [00:36<00:58,  6.60it/s]
reward: -4.0943, last reward: -3.1533, gradient norm:  145.1:  38%|███▊      | 237/625 [00:36<00:58,  6.60it/s]
reward: -4.0943, last reward: -3.1533, gradient norm:  145.1:  38%|███▊      | 238/625 [00:36<00:58,  6.61it/s]
reward: -4.3045, last reward: -3.0483, gradient norm:  20.89:  38%|███▊      | 238/625 [00:36<00:58,  6.61it/s]
reward: -4.3045, last reward: -3.0483, gradient norm:  20.89:  38%|███▊      | 239/625 [00:36<00:58,  6.56it/s]
reward: -4.4128, last reward: -5.2528, gradient norm:  24.97:  38%|███▊      | 239/625 [00:36<00:58,  6.56it/s]
reward: -4.4128, last reward: -5.2528, gradient norm:  24.97:  38%|███▊      | 240/625 [00:36<00:58,  6.56it/s]
reward: -4.6415, last reward: -8.0201, gradient norm:  26.74:  38%|███▊      | 240/625 [00:36<00:58,  6.56it/s]
reward: -4.6415, last reward: -8.0201, gradient norm:  26.74:  39%|███▊      | 241/625 [00:36<00:58,  6.53it/s]
reward: -4.4437, last reward: -5.4365, gradient norm:  132.7:  39%|███▊      | 241/625 [00:36<00:58,  6.53it/s]
reward: -4.4437, last reward: -5.4365, gradient norm:  132.7:  39%|███▊      | 242/625 [00:36<00:58,  6.53it/s]
reward: -4.0358, last reward: -3.4943, gradient norm:  11.46:  39%|███▊      | 242/625 [00:36<00:58,  6.53it/s]
reward: -4.0358, last reward: -3.4943, gradient norm:  11.46:  39%|███▉      | 243/625 [00:36<00:58,  6.52it/s]
reward: -4.1272, last reward: -3.5003, gradient norm:  68.09:  39%|███▉      | 243/625 [00:37<00:58,  6.52it/s]
reward: -4.1272, last reward: -3.5003, gradient norm:  68.09:  39%|███▉      | 244/625 [00:37<00:58,  6.52it/s]
reward: -4.1180, last reward: -4.2637, gradient norm:  39.25:  39%|███▉      | 244/625 [00:37<00:58,  6.52it/s]
reward: -4.1180, last reward: -4.2637, gradient norm:  39.25:  39%|███▉      | 245/625 [00:37<00:58,  6.52it/s]
reward: -4.7197, last reward: -3.0873, gradient norm:  12.2:  39%|███▉      | 245/625 [00:37<00:58,  6.52it/s]
reward: -4.7197, last reward: -3.0873, gradient norm:  12.2:  39%|███▉      | 246/625 [00:37<00:58,  6.52it/s]
reward: -4.2917, last reward: -3.6656, gradient norm:  17.17:  39%|███▉      | 246/625 [00:37<00:58,  6.52it/s]
reward: -4.2917, last reward: -3.6656, gradient norm:  17.17:  40%|███▉      | 247/625 [00:37<00:58,  6.51it/s]
reward: -4.0160, last reward: -3.0738, gradient norm:  43.07:  40%|███▉      | 247/625 [00:37<00:58,  6.51it/s]
reward: -4.0160, last reward: -3.0738, gradient norm:  43.07:  40%|███▉      | 248/625 [00:37<00:57,  6.51it/s]
reward: -4.3689, last reward: -4.0120, gradient norm:  11.81:  40%|███▉      | 248/625 [00:37<00:57,  6.51it/s]
reward: -4.3689, last reward: -4.0120, gradient norm:  11.81:  40%|███▉      | 249/625 [00:37<00:57,  6.51it/s]
reward: -4.5570, last reward: -7.0475, gradient norm:  22.45:  40%|███▉      | 249/625 [00:38<00:57,  6.51it/s]
reward: -4.5570, last reward: -7.0475, gradient norm:  22.45:  40%|████      | 250/625 [00:38<00:57,  6.49it/s]
reward: -4.4423, last reward: -5.2220, gradient norm:  18.4:  40%|████      | 250/625 [00:38<00:57,  6.49it/s]
reward: -4.4423, last reward: -5.2220, gradient norm:  18.4:  40%|████      | 251/625 [00:38<00:57,  6.48it/s]
reward: -4.2118, last reward: -4.6803, gradient norm:  15.86:  40%|████      | 251/625 [00:38<00:57,  6.48it/s]
reward: -4.2118, last reward: -4.6803, gradient norm:  15.86:  40%|████      | 252/625 [00:38<00:57,  6.48it/s]
reward: -4.1465, last reward: -3.7214, gradient norm:  25.93:  40%|████      | 252/625 [00:38<00:57,  6.48it/s]
reward: -4.1465, last reward: -3.7214, gradient norm:  25.93:  40%|████      | 253/625 [00:38<00:57,  6.49it/s]
reward: -3.8801, last reward: -2.7034, gradient norm:  103.6:  40%|████      | 253/625 [00:38<00:57,  6.49it/s]
reward: -3.8801, last reward: -2.7034, gradient norm:  103.6:  41%|████      | 254/625 [00:38<00:57,  6.48it/s]
reward: -3.9136, last reward: -4.4076, gradient norm:  17.63:  41%|████      | 254/625 [00:38<00:57,  6.48it/s]
reward: -3.9136, last reward: -4.4076, gradient norm:  17.63:  41%|████      | 255/625 [00:38<00:56,  6.50it/s]
reward: -3.7589, last reward: -4.5013, gradient norm:  143.3:  41%|████      | 255/625 [00:38<00:56,  6.50it/s]
reward: -3.7589, last reward: -4.5013, gradient norm:  143.3:  41%|████      | 256/625 [00:38<00:56,  6.49it/s]
reward: -3.8150, last reward: -3.2241, gradient norm:  113.9:  41%|████      | 256/625 [00:39<00:56,  6.49it/s]
reward: -3.8150, last reward: -3.2241, gradient norm:  113.9:  41%|████      | 257/625 [00:39<00:56,  6.51it/s]
reward: -4.0753, last reward: -3.8081, gradient norm:  14.8:  41%|████      | 257/625 [00:39<00:56,  6.51it/s]
reward: -4.0753, last reward: -3.8081, gradient norm:  14.8:  41%|████▏     | 258/625 [00:39<00:56,  6.50it/s]
reward: -4.1951, last reward: -4.8314, gradient norm:  27.63:  41%|████▏     | 258/625 [00:39<00:56,  6.50it/s]
reward: -4.1951, last reward: -4.8314, gradient norm:  27.63:  41%|████▏     | 259/625 [00:39<00:56,  6.53it/s]
reward: -4.0038, last reward: -2.5333, gradient norm:  42.85:  41%|████▏     | 259/625 [00:39<00:56,  6.53it/s]
reward: -4.0038, last reward: -2.5333, gradient norm:  42.85:  42%|████▏     | 260/625 [00:39<00:56,  6.51it/s]
reward: -4.0889, last reward: -2.4616, gradient norm:  13.78:  42%|████▏     | 260/625 [00:39<00:56,  6.51it/s]
reward: -4.0889, last reward: -2.4616, gradient norm:  13.78:  42%|████▏     | 261/625 [00:39<00:55,  6.51it/s]
reward: -4.0655, last reward: -2.6873, gradient norm:  10.98:  42%|████▏     | 261/625 [00:39<00:55,  6.51it/s]
reward: -4.0655, last reward: -2.6873, gradient norm:  10.98:  42%|████▏     | 262/625 [00:39<00:55,  6.51it/s]
reward: -3.8333, last reward: -1.9476, gradient norm:  13.47:  42%|████▏     | 262/625 [00:40<00:55,  6.51it/s]
reward: -3.8333, last reward: -1.9476, gradient norm:  13.47:  42%|████▏     | 263/625 [00:40<00:55,  6.50it/s]
reward: -3.7554, last reward: -4.3798, gradient norm:  41.76:  42%|████▏     | 263/625 [00:40<00:55,  6.50it/s]
reward: -3.7554, last reward: -4.3798, gradient norm:  41.76:  42%|████▏     | 264/625 [00:40<00:55,  6.49it/s]
reward: -3.3717, last reward: -2.3947, gradient norm:  6.529:  42%|████▏     | 264/625 [00:40<00:55,  6.49it/s]
reward: -3.3717, last reward: -2.3947, gradient norm:  6.529:  42%|████▏     | 265/625 [00:40<00:55,  6.50it/s]
reward: -4.3060, last reward: -4.6495, gradient norm:  11.24:  42%|████▏     | 265/625 [00:40<00:55,  6.50it/s]
reward: -4.3060, last reward: -4.6495, gradient norm:  11.24:  43%|████▎     | 266/625 [00:40<00:55,  6.49it/s]
reward: -4.7467, last reward: -5.8889, gradient norm:  12.35:  43%|████▎     | 266/625 [00:40<00:55,  6.49it/s]
reward: -4.7467, last reward: -5.8889, gradient norm:  12.35:  43%|████▎     | 267/625 [00:40<00:55,  6.50it/s]
reward: -4.9281, last reward: -4.8457, gradient norm:  6.591:  43%|████▎     | 267/625 [00:40<00:55,  6.50it/s]
reward: -4.9281, last reward: -4.8457, gradient norm:  6.591:  43%|████▎     | 268/625 [00:40<00:55,  6.49it/s]
reward: -4.7137, last reward: -4.0536, gradient norm:  5.771:  43%|████▎     | 268/625 [00:40<00:55,  6.49it/s]
reward: -4.7137, last reward: -4.0536, gradient norm:  5.771:  43%|████▎     | 269/625 [00:40<00:54,  6.48it/s]
reward: -4.7197, last reward: -4.1651, gradient norm:  5.388:  43%|████▎     | 269/625 [00:41<00:54,  6.48it/s]
reward: -4.7197, last reward: -4.1651, gradient norm:  5.388:  43%|████▎     | 270/625 [00:41<00:54,  6.47it/s]
reward: -4.8246, last reward: -5.5709, gradient norm:  8.281:  43%|████▎     | 270/625 [00:41<00:54,  6.47it/s]
reward: -4.8246, last reward: -5.5709, gradient norm:  8.281:  43%|████▎     | 271/625 [00:41<00:54,  6.48it/s]
reward: -4.7502, last reward: -5.0521, gradient norm:  9.032:  43%|████▎     | 271/625 [00:41<00:54,  6.48it/s]
reward: -4.7502, last reward: -5.0521, gradient norm:  9.032:  44%|████▎     | 272/625 [00:41<00:54,  6.52it/s]
reward: -4.5475, last reward: -4.7253, gradient norm:  21.18:  44%|████▎     | 272/625 [00:41<00:54,  6.52it/s]
reward: -4.5475, last reward: -4.7253, gradient norm:  21.18:  44%|████▎     | 273/625 [00:41<00:53,  6.53it/s]
reward: -4.2856, last reward: -3.7130, gradient norm:  13.53:  44%|████▎     | 273/625 [00:41<00:53,  6.53it/s]
reward: -4.2856, last reward: -3.7130, gradient norm:  13.53:  44%|████▍     | 274/625 [00:41<00:53,  6.51it/s]
reward: -3.2778, last reward: -3.4122, gradient norm:  28.52:  44%|████▍     | 274/625 [00:41<00:53,  6.51it/s]
reward: -3.2778, last reward: -3.4122, gradient norm:  28.52:  44%|████▍     | 275/625 [00:41<00:53,  6.52it/s]
reward: -3.8368, last reward: -2.1841, gradient norm:  2.07:  44%|████▍     | 275/625 [00:42<00:53,  6.52it/s]
reward: -3.8368, last reward: -2.1841, gradient norm:  2.07:  44%|████▍     | 276/625 [00:42<00:53,  6.51it/s]
reward: -3.9622, last reward: -3.1603, gradient norm:  1.003e+03:  44%|████▍     | 276/625 [00:42<00:53,  6.51it/s]
reward: -3.9622, last reward: -3.1603, gradient norm:  1.003e+03:  44%|████▍     | 277/625 [00:42<00:53,  6.53it/s]
reward: -4.0247, last reward: -2.9830, gradient norm:  8.346:  44%|████▍     | 277/625 [00:42<00:53,  6.53it/s]
reward: -4.0247, last reward: -2.9830, gradient norm:  8.346:  44%|████▍     | 278/625 [00:42<00:53,  6.51it/s]
reward: -4.2238, last reward: -4.6418, gradient norm:  14.55:  44%|████▍     | 278/625 [00:42<00:53,  6.51it/s]
reward: -4.2238, last reward: -4.6418, gradient norm:  14.55:  45%|████▍     | 279/625 [00:42<00:53,  6.51it/s]
reward: -4.0626, last reward: -4.2538, gradient norm:  17.88:  45%|████▍     | 279/625 [00:42<00:53,  6.51it/s]
reward: -4.0626, last reward: -4.2538, gradient norm:  17.88:  45%|████▍     | 280/625 [00:42<00:53,  6.50it/s]
reward: -4.0149, last reward: -3.7380, gradient norm:  13.13:  45%|████▍     | 280/625 [00:42<00:53,  6.50it/s]
reward: -4.0149, last reward: -3.7380, gradient norm:  13.13:  45%|████▍     | 281/625 [00:42<00:52,  6.50it/s]
reward: -4.2167, last reward: -2.8911, gradient norm:  11.41:  45%|████▍     | 281/625 [00:42<00:52,  6.50it/s]
reward: -4.2167, last reward: -2.8911, gradient norm:  11.41:  45%|████▌     | 282/625 [00:42<00:52,  6.47it/s]
reward: -3.8725, last reward: -4.1983, gradient norm:  18.88:  45%|████▌     | 282/625 [00:43<00:52,  6.47it/s]
reward: -3.8725, last reward: -4.1983, gradient norm:  18.88:  45%|████▌     | 283/625 [00:43<00:52,  6.48it/s]
reward: -2.8142, last reward: -2.3709, gradient norm:  43.73:  45%|████▌     | 283/625 [00:43<00:52,  6.48it/s]
reward: -2.8142, last reward: -2.3709, gradient norm:  43.73:  45%|████▌     | 284/625 [00:43<00:52,  6.47it/s]
reward: -3.2022, last reward: -2.4989, gradient norm:  11.14:  45%|████▌     | 284/625 [00:43<00:52,  6.47it/s]
reward: -3.2022, last reward: -2.4989, gradient norm:  11.14:  46%|████▌     | 285/625 [00:43<00:52,  6.50it/s]
reward: -3.6464, last reward: -1.6210, gradient norm:  43.37:  46%|████▌     | 285/625 [00:43<00:52,  6.50it/s]
reward: -3.6464, last reward: -1.6210, gradient norm:  43.37:  46%|████▌     | 286/625 [00:43<00:52,  6.50it/s]
reward: -3.9726, last reward: -3.0820, gradient norm:  39.93:  46%|████▌     | 286/625 [00:43<00:52,  6.50it/s]
reward: -3.9726, last reward: -3.0820, gradient norm:  39.93:  46%|████▌     | 287/625 [00:43<00:52,  6.49it/s]
reward: -3.6975, last reward: -2.9091, gradient norm:  29.46:  46%|████▌     | 287/625 [00:43<00:52,  6.49it/s]
reward: -3.6975, last reward: -2.9091, gradient norm:  29.46:  46%|████▌     | 288/625 [00:43<00:51,  6.49it/s]
reward: -3.4926, last reward: -2.4791, gradient norm:  160.7:  46%|████▌     | 288/625 [00:44<00:51,  6.49it/s]
reward: -3.4926, last reward: -2.4791, gradient norm:  160.7:  46%|████▌     | 289/625 [00:44<00:52,  6.40it/s]
reward: -3.0905, last reward: -1.3500, gradient norm:  31.38:  46%|████▌     | 289/625 [00:44<00:52,  6.40it/s]
reward: -3.0905, last reward: -1.3500, gradient norm:  31.38:  46%|████▋     | 290/625 [00:44<00:51,  6.46it/s]
reward: -3.2287, last reward: -2.7137, gradient norm:  26.31:  46%|████▋     | 290/625 [00:44<00:51,  6.46it/s]
reward: -3.2287, last reward: -2.7137, gradient norm:  26.31:  47%|████▋     | 291/625 [00:44<00:51,  6.50it/s]
reward: -2.9918, last reward: -1.5543, gradient norm:  29.73:  47%|████▋     | 291/625 [00:44<00:51,  6.50it/s]
reward: -2.9918, last reward: -1.5543, gradient norm:  29.73:  47%|████▋     | 292/625 [00:44<00:50,  6.54it/s]
reward: -2.9245, last reward: -0.6444, gradient norm:  2.631:  47%|████▋     | 292/625 [00:44<00:50,  6.54it/s]
reward: -2.9245, last reward: -0.6444, gradient norm:  2.631:  47%|████▋     | 293/625 [00:44<00:50,  6.56it/s]
reward: -3.0448, last reward: -0.4769, gradient norm:  7.266:  47%|████▋     | 293/625 [00:44<00:50,  6.56it/s]
reward: -3.0448, last reward: -0.4769, gradient norm:  7.266:  47%|████▋     | 294/625 [00:44<00:50,  6.58it/s]
reward: -2.8566, last reward: -1.7208, gradient norm:  25.22:  47%|████▋     | 294/625 [00:44<00:50,  6.58it/s]
reward: -2.8566, last reward: -1.7208, gradient norm:  25.22:  47%|████▋     | 295/625 [00:44<00:50,  6.60it/s]
reward: -2.8872, last reward: -1.0966, gradient norm:  8.247:  47%|████▋     | 295/625 [00:45<00:50,  6.60it/s]
reward: -2.8872, last reward: -1.0966, gradient norm:  8.247:  47%|████▋     | 296/625 [00:45<00:49,  6.61it/s]
reward: -2.5303, last reward: -0.1537, gradient norm:  2.023:  47%|████▋     | 296/625 [00:45<00:49,  6.61it/s]
reward: -2.5303, last reward: -0.1537, gradient norm:  2.023:  48%|████▊     | 297/625 [00:45<00:49,  6.62it/s]
reward: -2.6817, last reward: -0.2682, gradient norm:  7.564:  48%|████▊     | 297/625 [00:45<00:49,  6.62it/s]
reward: -2.6817, last reward: -0.2682, gradient norm:  7.564:  48%|████▊     | 298/625 [00:45<00:49,  6.63it/s]
reward: -2.4318, last reward: -0.5063, gradient norm:  14.87:  48%|████▊     | 298/625 [00:45<00:49,  6.63it/s]
reward: -2.4318, last reward: -0.5063, gradient norm:  14.87:  48%|████▊     | 299/625 [00:45<00:49,  6.62it/s]
reward: -2.7475, last reward: -1.4190, gradient norm:  21.66:  48%|████▊     | 299/625 [00:45<00:49,  6.62it/s]
reward: -2.7475, last reward: -1.4190, gradient norm:  21.66:  48%|████▊     | 300/625 [00:45<00:49,  6.63it/s]
reward: -2.8186, last reward: -2.5077, gradient norm:  22.4:  48%|████▊     | 300/625 [00:45<00:49,  6.63it/s]
reward: -2.8186, last reward: -2.5077, gradient norm:  22.4:  48%|████▊     | 301/625 [00:45<00:48,  6.63it/s]
reward: -3.1883, last reward: -1.5291, gradient norm:  7.472:  48%|████▊     | 301/625 [00:46<00:48,  6.63it/s]
reward: -3.1883, last reward: -1.5291, gradient norm:  7.472:  48%|████▊     | 302/625 [00:46<00:48,  6.63it/s]
reward: -2.1256, last reward: -0.3998, gradient norm:  11.01:  48%|████▊     | 302/625 [00:46<00:48,  6.63it/s]
reward: -2.1256, last reward: -0.3998, gradient norm:  11.01:  48%|████▊     | 303/625 [00:46<00:48,  6.63it/s]
reward: -2.3622, last reward: -0.0930, gradient norm:  1.626:  48%|████▊     | 303/625 [00:46<00:48,  6.63it/s]
reward: -2.3622, last reward: -0.0930, gradient norm:  1.626:  49%|████▊     | 304/625 [00:46<00:48,  6.63it/s]
reward: -1.9500, last reward: -0.0075, gradient norm:  0.5664:  49%|████▊     | 304/625 [00:46<00:48,  6.63it/s]
reward: -1.9500, last reward: -0.0075, gradient norm:  0.5664:  49%|████▉     | 305/625 [00:46<00:48,  6.64it/s]
reward: -2.5697, last reward: -0.3024, gradient norm:  22.61:  49%|████▉     | 305/625 [00:46<00:48,  6.64it/s]
reward: -2.5697, last reward: -0.3024, gradient norm:  22.61:  49%|████▉     | 306/625 [00:46<00:48,  6.64it/s]
reward: -2.3117, last reward: -0.0052, gradient norm:  1.006:  49%|████▉     | 306/625 [00:46<00:48,  6.64it/s]
reward: -2.3117, last reward: -0.0052, gradient norm:  1.006:  49%|████▉     | 307/625 [00:46<00:47,  6.63it/s]
reward: -2.0981, last reward: -0.0018, gradient norm:  0.9312:  49%|████▉     | 307/625 [00:46<00:47,  6.63it/s]
reward: -2.0981, last reward: -0.0018, gradient norm:  0.9312:  49%|████▉     | 308/625 [00:46<00:47,  6.63it/s]
reward: -2.5140, last reward: -0.3873, gradient norm:  3.93:  49%|████▉     | 308/625 [00:47<00:47,  6.63it/s]
reward: -2.5140, last reward: -0.3873, gradient norm:  3.93:  49%|████▉     | 309/625 [00:47<00:47,  6.63it/s]
reward: -2.0411, last reward: -0.2650, gradient norm:  3.183:  49%|████▉     | 309/625 [00:47<00:47,  6.63it/s]
reward: -2.0411, last reward: -0.2650, gradient norm:  3.183:  50%|████▉     | 310/625 [00:47<00:47,  6.63it/s]
reward: -2.1656, last reward: -0.0228, gradient norm:  2.004:  50%|████▉     | 310/625 [00:47<00:47,  6.63it/s]
reward: -2.1656, last reward: -0.0228, gradient norm:  2.004:  50%|████▉     | 311/625 [00:47<00:47,  6.62it/s]
reward: -2.1196, last reward: -0.2478, gradient norm:  11.78:  50%|████▉     | 311/625 [00:47<00:47,  6.62it/s]
reward: -2.1196, last reward: -0.2478, gradient norm:  11.78:  50%|████▉     | 312/625 [00:47<00:47,  6.62it/s]
reward: -2.7353, last reward: -3.0812, gradient norm:  82.91:  50%|████▉     | 312/625 [00:47<00:47,  6.62it/s]
reward: -2.7353, last reward: -3.0812, gradient norm:  82.91:  50%|█████     | 313/625 [00:47<00:47,  6.62it/s]
reward: -3.0995, last reward: -2.3022, gradient norm:  8.758:  50%|█████     | 313/625 [00:47<00:47,  6.62it/s]
reward: -3.0995, last reward: -2.3022, gradient norm:  8.758:  50%|█████     | 314/625 [00:47<00:46,  6.62it/s]
reward: -3.1406, last reward: -2.4626, gradient norm:  15.99:  50%|█████     | 314/625 [00:47<00:46,  6.62it/s]
reward: -3.1406, last reward: -2.4626, gradient norm:  15.99:  50%|█████     | 315/625 [00:47<00:46,  6.62it/s]
reward: -3.2156, last reward: -1.9055, gradient norm:  7.851:  50%|█████     | 315/625 [00:48<00:46,  6.62it/s]
reward: -3.2156, last reward: -1.9055, gradient norm:  7.851:  51%|█████     | 316/625 [00:48<00:46,  6.62it/s]
reward: -3.1953, last reward: -2.3774, gradient norm:  19.78:  51%|█████     | 316/625 [00:48<00:46,  6.62it/s]
reward: -3.1953, last reward: -2.3774, gradient norm:  19.78:  51%|█████     | 317/625 [00:48<00:46,  6.63it/s]
reward: -2.6385, last reward: -0.9917, gradient norm:  16.15:  51%|█████     | 317/625 [00:48<00:46,  6.63it/s]
reward: -2.6385, last reward: -0.9917, gradient norm:  16.15:  51%|█████     | 318/625 [00:48<00:46,  6.62it/s]
reward: -2.2764, last reward: -0.0536, gradient norm:  2.905:  51%|█████     | 318/625 [00:48<00:46,  6.62it/s]
reward: -2.2764, last reward: -0.0536, gradient norm:  2.905:  51%|█████     | 319/625 [00:48<00:46,  6.62it/s]
reward: -2.6391, last reward: -1.9317, gradient norm:  23.78:  51%|█████     | 319/625 [00:48<00:46,  6.62it/s]
reward: -2.6391, last reward: -1.9317, gradient norm:  23.78:  51%|█████     | 320/625 [00:48<00:46,  6.62it/s]
reward: -2.9748, last reward: -4.2679, gradient norm:  59.43:  51%|█████     | 320/625 [00:48<00:46,  6.62it/s]
reward: -2.9748, last reward: -4.2679, gradient norm:  59.43:  51%|█████▏    | 321/625 [00:48<00:45,  6.62it/s]
reward: -2.8495, last reward: -4.5125, gradient norm:  52.19:  51%|█████▏    | 321/625 [00:49<00:45,  6.62it/s]
reward: -2.8495, last reward: -4.5125, gradient norm:  52.19:  52%|█████▏    | 322/625 [00:49<00:45,  6.62it/s]
reward: -2.8177, last reward: -2.6602, gradient norm:  52.75:  52%|█████▏    | 322/625 [00:49<00:45,  6.62it/s]
reward: -2.8177, last reward: -2.6602, gradient norm:  52.75:  52%|█████▏    | 323/625 [00:49<00:45,  6.62it/s]
reward: -2.0704, last reward: -0.5776, gradient norm:  59.07:  52%|█████▏    | 323/625 [00:49<00:45,  6.62it/s]
reward: -2.0704, last reward: -0.5776, gradient norm:  59.07:  52%|█████▏    | 324/625 [00:49<00:45,  6.62it/s]
reward: -1.9833, last reward: -0.1339, gradient norm:  4.402:  52%|█████▏    | 324/625 [00:49<00:45,  6.62it/s]
reward: -1.9833, last reward: -0.1339, gradient norm:  4.402:  52%|█████▏    | 325/625 [00:49<00:45,  6.62it/s]
reward: -2.2760, last reward: -2.1238, gradient norm:  30.36:  52%|█████▏    | 325/625 [00:49<00:45,  6.62it/s]
reward: -2.2760, last reward: -2.1238, gradient norm:  30.36:  52%|█████▏    | 326/625 [00:49<00:45,  6.63it/s]
reward: -2.9299, last reward: -5.0227, gradient norm:  100.5:  52%|█████▏    | 326/625 [00:49<00:45,  6.63it/s]
reward: -2.9299, last reward: -5.0227, gradient norm:  100.5:  52%|█████▏    | 327/625 [00:49<00:44,  6.63it/s]
reward: -2.7727, last reward: -2.1607, gradient norm:  336.7:  52%|█████▏    | 327/625 [00:49<00:44,  6.63it/s]
reward: -2.7727, last reward: -2.1607, gradient norm:  336.7:  52%|█████▏    | 328/625 [00:49<00:44,  6.62it/s]
reward: -2.3958, last reward: -0.3223, gradient norm:  2.763:  52%|█████▏    | 328/625 [00:50<00:44,  6.62it/s]
reward: -2.3958, last reward: -0.3223, gradient norm:  2.763:  53%|█████▎    | 329/625 [00:50<00:44,  6.63it/s]
reward: -2.4742, last reward: -0.1797, gradient norm:  47.32:  53%|█████▎    | 329/625 [00:50<00:44,  6.63it/s]
reward: -2.4742, last reward: -0.1797, gradient norm:  47.32:  53%|█████▎    | 330/625 [00:50<00:44,  6.63it/s]
reward: -2.0144, last reward: -0.0085, gradient norm:  4.791:  53%|█████▎    | 330/625 [00:50<00:44,  6.63it/s]
reward: -2.0144, last reward: -0.0085, gradient norm:  4.791:  53%|█████▎    | 331/625 [00:50<00:44,  6.64it/s]
reward: -1.8284, last reward: -0.0428, gradient norm:  12.29:  53%|█████▎    | 331/625 [00:50<00:44,  6.64it/s]
reward: -1.8284, last reward: -0.0428, gradient norm:  12.29:  53%|█████▎    | 332/625 [00:50<00:44,  6.64it/s]
reward: -2.5229, last reward: -0.0098, gradient norm:  0.7365:  53%|█████▎    | 332/625 [00:50<00:44,  6.64it/s]
reward: -2.5229, last reward: -0.0098, gradient norm:  0.7365:  53%|█████▎    | 333/625 [00:50<00:44,  6.63it/s]
reward: -2.4566, last reward: -0.0781, gradient norm:  2.086:  53%|█████▎    | 333/625 [00:50<00:44,  6.63it/s]
reward: -2.4566, last reward: -0.0781, gradient norm:  2.086:  53%|█████▎    | 334/625 [00:50<00:43,  6.64it/s]
reward: -2.3355, last reward: -0.0230, gradient norm:  1.311:  53%|█████▎    | 334/625 [00:50<00:43,  6.64it/s]
reward: -2.3355, last reward: -0.0230, gradient norm:  1.311:  54%|█████▎    | 335/625 [00:50<00:43,  6.63it/s]
reward: -1.9346, last reward: -0.0423, gradient norm:  1.076:  54%|█████▎    | 335/625 [00:51<00:43,  6.63it/s]
reward: -1.9346, last reward: -0.0423, gradient norm:  1.076:  54%|█████▍    | 336/625 [00:51<00:43,  6.63it/s]
reward: -2.3711, last reward: -0.1335, gradient norm:  0.6855:  54%|█████▍    | 336/625 [00:51<00:43,  6.63it/s]
reward: -2.3711, last reward: -0.1335, gradient norm:  0.6855:  54%|█████▍    | 337/625 [00:51<00:43,  6.63it/s]
reward: -2.0304, last reward: -0.0023, gradient norm:  0.8459:  54%|█████▍    | 337/625 [00:51<00:43,  6.63it/s]
reward: -2.0304, last reward: -0.0023, gradient norm:  0.8459:  54%|█████▍    | 338/625 [00:51<00:43,  6.63it/s]
reward: -1.9998, last reward: -0.4399, gradient norm:  13.1:  54%|█████▍    | 338/625 [00:51<00:43,  6.63it/s]
reward: -1.9998, last reward: -0.4399, gradient norm:  13.1:  54%|█████▍    | 339/625 [00:51<00:43,  6.64it/s]
reward: -2.2303, last reward: -2.1346, gradient norm:  45.99:  54%|█████▍    | 339/625 [00:51<00:43,  6.64it/s]
reward: -2.2303, last reward: -2.1346, gradient norm:  45.99:  54%|█████▍    | 340/625 [00:51<00:43,  6.63it/s]
reward: -2.2915, last reward: -1.7116, gradient norm:  40.34:  54%|█████▍    | 340/625 [00:51<00:43,  6.63it/s]
reward: -2.2915, last reward: -1.7116, gradient norm:  40.34:  55%|█████▍    | 341/625 [00:51<00:42,  6.62it/s]
reward: -2.5560, last reward: -0.0487, gradient norm:  1.195:  55%|█████▍    | 341/625 [00:52<00:42,  6.62it/s]
reward: -2.5560, last reward: -0.0487, gradient norm:  1.195:  55%|█████▍    | 342/625 [00:52<00:42,  6.62it/s]
reward: -2.5119, last reward: -0.0358, gradient norm:  1.061:  55%|█████▍    | 342/625 [00:52<00:42,  6.62it/s]
reward: -2.5119, last reward: -0.0358, gradient norm:  1.061:  55%|█████▍    | 343/625 [00:52<00:42,  6.63it/s]
reward: -2.3305, last reward: -0.3705, gradient norm:  1.957:  55%|█████▍    | 343/625 [00:52<00:42,  6.63it/s]
reward: -2.3305, last reward: -0.3705, gradient norm:  1.957:  55%|█████▌    | 344/625 [00:52<00:42,  6.64it/s]
reward: -2.6068, last reward: -0.2112, gradient norm:  13.83:  55%|█████▌    | 344/625 [00:52<00:42,  6.64it/s]
reward: -2.6068, last reward: -0.2112, gradient norm:  13.83:  55%|█████▌    | 345/625 [00:52<00:42,  6.64it/s]
reward: -2.5731, last reward: -1.8455, gradient norm:  66.75:  55%|█████▌    | 345/625 [00:52<00:42,  6.64it/s]
reward: -2.5731, last reward: -1.8455, gradient norm:  66.75:  55%|█████▌    | 346/625 [00:52<00:42,  6.63it/s]
reward: -2.3897, last reward: -0.0376, gradient norm:  1.608:  55%|█████▌    | 346/625 [00:52<00:42,  6.63it/s]
reward: -2.3897, last reward: -0.0376, gradient norm:  1.608:  56%|█████▌    | 347/625 [00:52<00:41,  6.62it/s]
reward: -2.2264, last reward: -0.0434, gradient norm:  2.012:  56%|█████▌    | 347/625 [00:52<00:41,  6.62it/s]
reward: -2.2264, last reward: -0.0434, gradient norm:  2.012:  56%|█████▌    | 348/625 [00:52<00:41,  6.63it/s]
reward: -2.1300, last reward: -0.1215, gradient norm:  2.557:  56%|█████▌    | 348/625 [00:53<00:41,  6.63it/s]
reward: -2.1300, last reward: -0.1215, gradient norm:  2.557:  56%|█████▌    | 349/625 [00:53<00:41,  6.62it/s]
reward: -2.0968, last reward: -0.0885, gradient norm:  3.389:  56%|█████▌    | 349/625 [00:53<00:41,  6.62it/s]
reward: -2.0968, last reward: -0.0885, gradient norm:  3.389:  56%|█████▌    | 350/625 [00:53<00:41,  6.61it/s]
reward: -2.1348, last reward: -0.0073, gradient norm:  0.5052:  56%|█████▌    | 350/625 [00:53<00:41,  6.61it/s]
reward: -2.1348, last reward: -0.0073, gradient norm:  0.5052:  56%|█████▌    | 351/625 [00:53<00:41,  6.62it/s]
reward: -2.4184, last reward: -3.2817, gradient norm:  108.6:  56%|█████▌    | 351/625 [00:53<00:41,  6.62it/s]
reward: -2.4184, last reward: -3.2817, gradient norm:  108.6:  56%|█████▋    | 352/625 [00:53<00:41,  6.61it/s]
reward: -2.3774, last reward: -1.8887, gradient norm:  54.07:  56%|█████▋    | 352/625 [00:53<00:41,  6.61it/s]
reward: -2.3774, last reward: -1.8887, gradient norm:  54.07:  56%|█████▋    | 353/625 [00:53<00:41,  6.62it/s]
reward: -2.4779, last reward: -0.1009, gradient norm:  10.91:  56%|█████▋    | 353/625 [00:53<00:41,  6.62it/s]
reward: -2.4779, last reward: -0.1009, gradient norm:  10.91:  57%|█████▋    | 354/625 [00:53<00:40,  6.61it/s]
reward: -2.2588, last reward: -0.0604, gradient norm:  2.599:  57%|█████▋    | 354/625 [00:54<00:40,  6.61it/s]
reward: -2.2588, last reward: -0.0604, gradient norm:  2.599:  57%|█████▋    | 355/625 [00:54<00:40,  6.62it/s]
reward: -2.4486, last reward: -0.1176, gradient norm:  3.656:  57%|█████▋    | 355/625 [00:54<00:40,  6.62it/s]
reward: -2.4486, last reward: -0.1176, gradient norm:  3.656:  57%|█████▋    | 356/625 [00:54<00:40,  6.62it/s]
reward: -2.2436, last reward: -0.0668, gradient norm:  2.724:  57%|█████▋    | 356/625 [00:54<00:40,  6.62it/s]
reward: -2.2436, last reward: -0.0668, gradient norm:  2.724:  57%|█████▋    | 357/625 [00:54<00:40,  6.62it/s]
reward: -1.8849, last reward: -0.0012, gradient norm:  5.326:  57%|█████▋    | 357/625 [00:54<00:40,  6.62it/s]
reward: -1.8849, last reward: -0.0012, gradient norm:  5.326:  57%|█████▋    | 358/625 [00:54<00:40,  6.62it/s]
reward: -2.7511, last reward: -0.8804, gradient norm:  13.6:  57%|█████▋    | 358/625 [00:54<00:40,  6.62it/s]
reward: -2.7511, last reward: -0.8804, gradient norm:  13.6:  57%|█████▋    | 359/625 [00:54<00:40,  6.62it/s]
reward: -2.8870, last reward: -3.6728, gradient norm:  33.56:  57%|█████▋    | 359/625 [00:54<00:40,  6.62it/s]
reward: -2.8870, last reward: -3.6728, gradient norm:  33.56:  58%|█████▊    | 360/625 [00:54<00:40,  6.62it/s]
reward: -2.8841, last reward: -2.5508, gradient norm:  30.93:  58%|█████▊    | 360/625 [00:54<00:40,  6.62it/s]
reward: -2.8841, last reward: -2.5508, gradient norm:  30.93:  58%|█████▊    | 361/625 [00:54<00:40,  6.59it/s]
reward: -2.5242, last reward: -1.0268, gradient norm:  33.15:  58%|█████▊    | 361/625 [00:55<00:40,  6.59it/s]
reward: -2.5242, last reward: -1.0268, gradient norm:  33.15:  58%|█████▊    | 362/625 [00:55<00:39,  6.60it/s]
reward: -2.3232, last reward: -0.0013, gradient norm:  0.6185:  58%|█████▊    | 362/625 [00:55<00:39,  6.60it/s]
reward: -2.3232, last reward: -0.0013, gradient norm:  0.6185:  58%|█████▊    | 363/625 [00:55<00:39,  6.61it/s]
reward: -2.1378, last reward: -0.0204, gradient norm:  1.337:  58%|█████▊    | 363/625 [00:55<00:39,  6.61it/s]
reward: -2.1378, last reward: -0.0204, gradient norm:  1.337:  58%|█████▊    | 364/625 [00:55<00:39,  6.61it/s]
reward: -2.2677, last reward: -0.0355, gradient norm:  1.685:  58%|█████▊    | 364/625 [00:55<00:39,  6.61it/s]
reward: -2.2677, last reward: -0.0355, gradient norm:  1.685:  58%|█████▊    | 365/625 [00:55<00:39,  6.61it/s]
reward: -2.4884, last reward: -0.0231, gradient norm:  1.213:  58%|█████▊    | 365/625 [00:55<00:39,  6.61it/s]
reward: -2.4884, last reward: -0.0231, gradient norm:  1.213:  59%|█████▊    | 366/625 [00:55<00:39,  6.62it/s]
reward: -2.0770, last reward: -0.0014, gradient norm:  0.6793:  59%|█████▊    | 366/625 [00:55<00:39,  6.62it/s]
reward: -2.0770, last reward: -0.0014, gradient norm:  0.6793:  59%|█████▊    | 367/625 [00:55<00:39,  6.60it/s]
reward: -1.9834, last reward: -0.0349, gradient norm:  1.863:  59%|█████▊    | 367/625 [00:55<00:39,  6.60it/s]
reward: -1.9834, last reward: -0.0349, gradient norm:  1.863:  59%|█████▉    | 368/625 [00:55<00:38,  6.60it/s]
reward: -2.6709, last reward: -0.1416, gradient norm:  5.462:  59%|█████▉    | 368/625 [00:56<00:38,  6.60it/s]
reward: -2.6709, last reward: -0.1416, gradient norm:  5.462:  59%|█████▉    | 369/625 [00:56<00:38,  6.61it/s]
reward: -2.5199, last reward: -3.9790, gradient norm:  47.67:  59%|█████▉    | 369/625 [00:56<00:38,  6.61it/s]
reward: -2.5199, last reward: -3.9790, gradient norm:  47.67:  59%|█████▉    | 370/625 [00:56<00:38,  6.62it/s]
reward: -2.9401, last reward: -3.7802, gradient norm:  32.47:  59%|█████▉    | 370/625 [00:56<00:38,  6.62it/s]
reward: -2.9401, last reward: -3.7802, gradient norm:  32.47:  59%|█████▉    | 371/625 [00:56<00:38,  6.60it/s]
reward: -2.6723, last reward: -3.6507, gradient norm:  45.1:  59%|█████▉    | 371/625 [00:56<00:38,  6.60it/s]
reward: -2.6723, last reward: -3.6507, gradient norm:  45.1:  60%|█████▉    | 372/625 [00:56<00:38,  6.60it/s]
reward: -2.2678, last reward: -0.6201, gradient norm:  32.94:  60%|█████▉    | 372/625 [00:56<00:38,  6.60it/s]
reward: -2.2678, last reward: -0.6201, gradient norm:  32.94:  60%|█████▉    | 373/625 [00:56<00:38,  6.61it/s]
reward: -2.2184, last reward: -0.0075, gradient norm:  0.7385:  60%|█████▉    | 373/625 [00:56<00:38,  6.61it/s]
reward: -2.2184, last reward: -0.0075, gradient norm:  0.7385:  60%|█████▉    | 374/625 [00:56<00:37,  6.62it/s]
reward: -2.6344, last reward: -0.0576, gradient norm:  1.617:  60%|█████▉    | 374/625 [00:57<00:37,  6.62it/s]
reward: -2.6344, last reward: -0.0576, gradient norm:  1.617:  60%|██████    | 375/625 [00:57<00:37,  6.62it/s]
reward: -1.9945, last reward: -0.0772, gradient norm:  2.567:  60%|██████    | 375/625 [00:57<00:37,  6.62it/s]
reward: -1.9945, last reward: -0.0772, gradient norm:  2.567:  60%|██████    | 376/625 [00:57<00:37,  6.59it/s]
reward: -1.7576, last reward: -0.0398, gradient norm:  1.961:  60%|██████    | 376/625 [00:57<00:37,  6.59it/s]
reward: -1.7576, last reward: -0.0398, gradient norm:  1.961:  60%|██████    | 377/625 [00:57<00:37,  6.61it/s]
reward: -2.3396, last reward: -0.0022, gradient norm:  1.094:  60%|██████    | 377/625 [00:57<00:37,  6.61it/s]
reward: -2.3396, last reward: -0.0022, gradient norm:  1.094:  60%|██████    | 378/625 [00:57<00:37,  6.61it/s]
reward: -2.3073, last reward: -0.4018, gradient norm:  29.23:  60%|██████    | 378/625 [00:57<00:37,  6.61it/s]
reward: -2.3073, last reward: -0.4018, gradient norm:  29.23:  61%|██████    | 379/625 [00:57<00:51,  4.77it/s]
reward: -2.3313, last reward: -1.1869, gradient norm:  38.62:  61%|██████    | 379/625 [00:57<00:51,  4.77it/s]
reward: -2.3313, last reward: -1.1869, gradient norm:  38.62:  61%|██████    | 380/625 [00:57<00:47,  5.20it/s]
reward: -2.0481, last reward: -0.1117, gradient norm:  5.321:  61%|██████    | 380/625 [00:58<00:47,  5.20it/s]
reward: -2.0481, last reward: -0.1117, gradient norm:  5.321:  61%|██████    | 381/625 [00:58<00:43,  5.56it/s]
reward: -1.6823, last reward: -0.0001, gradient norm:  1.981:  61%|██████    | 381/625 [00:58<00:43,  5.56it/s]
reward: -1.6823, last reward: -0.0001, gradient norm:  1.981:  61%|██████    | 382/625 [00:58<00:41,  5.84it/s]
reward: -1.8305, last reward: -0.0210, gradient norm:  1.228:  61%|██████    | 382/625 [00:58<00:41,  5.84it/s]
reward: -1.8305, last reward: -0.0210, gradient norm:  1.228:  61%|██████▏   | 383/625 [00:58<00:39,  6.06it/s]
reward: -1.4908, last reward: -0.0272, gradient norm:  1.538:  61%|██████▏   | 383/625 [00:58<00:39,  6.06it/s]
reward: -1.4908, last reward: -0.0272, gradient norm:  1.538:  61%|██████▏   | 384/625 [00:58<00:38,  6.22it/s]
reward: -2.3267, last reward: -0.0111, gradient norm:  0.7965:  61%|██████▏   | 384/625 [00:58<00:38,  6.22it/s]
reward: -2.3267, last reward: -0.0111, gradient norm:  0.7965:  62%|██████▏   | 385/625 [00:58<00:37,  6.33it/s]
reward: -2.1796, last reward: -0.0039, gradient norm:  0.5396:  62%|██████▏   | 385/625 [00:58<00:37,  6.33it/s]
reward: -2.1796, last reward: -0.0039, gradient norm:  0.5396:  62%|██████▏   | 386/625 [00:58<00:37,  6.41it/s]
reward: -2.3757, last reward: -0.0490, gradient norm:  2.237:  62%|██████▏   | 386/625 [00:59<00:37,  6.41it/s]
reward: -2.3757, last reward: -0.0490, gradient norm:  2.237:  62%|██████▏   | 387/625 [00:59<00:36,  6.47it/s]
reward: -2.1394, last reward: -0.4187, gradient norm:  52.11:  62%|██████▏   | 387/625 [00:59<00:36,  6.47it/s]
reward: -2.1394, last reward: -0.4187, gradient norm:  52.11:  62%|██████▏   | 388/625 [00:59<00:36,  6.52it/s]
reward: -2.2986, last reward: -0.0038, gradient norm:  0.7954:  62%|██████▏   | 388/625 [00:59<00:36,  6.52it/s]
reward: -2.2986, last reward: -0.0038, gradient norm:  0.7954:  62%|██████▏   | 389/625 [00:59<00:36,  6.55it/s]
reward: -2.1274, last reward: -0.0063, gradient norm:  0.813:  62%|██████▏   | 389/625 [00:59<00:36,  6.55it/s]
reward: -2.1274, last reward: -0.0063, gradient norm:  0.813:  62%|██████▏   | 390/625 [00:59<00:35,  6.57it/s]
reward: -1.8706, last reward: -0.0114, gradient norm:  3.325:  62%|██████▏   | 390/625 [00:59<00:35,  6.57it/s]
reward: -1.8706, last reward: -0.0114, gradient norm:  3.325:  63%|██████▎   | 391/625 [00:59<00:35,  6.58it/s]
reward: -1.6922, last reward: -0.0004, gradient norm:  0.2423:  63%|██████▎   | 391/625 [00:59<00:35,  6.58it/s]
reward: -1.6922, last reward: -0.0004, gradient norm:  0.2423:  63%|██████▎   | 392/625 [00:59<00:35,  6.59it/s]
reward: -1.9115, last reward: -0.2602, gradient norm:  2.599:  63%|██████▎   | 392/625 [00:59<00:35,  6.59it/s]
reward: -1.9115, last reward: -0.2602, gradient norm:  2.599:  63%|██████▎   | 393/625 [00:59<00:35,  6.60it/s]
reward: -2.2449, last reward: -0.0783, gradient norm:  5.199:  63%|██████▎   | 393/625 [01:00<00:35,  6.60it/s]
reward: -2.2449, last reward: -0.0783, gradient norm:  5.199:  63%|██████▎   | 394/625 [01:00<00:34,  6.61it/s]
reward: -2.0631, last reward: -0.0057, gradient norm:  0.7444:  63%|██████▎   | 394/625 [01:00<00:34,  6.61it/s]
reward: -2.0631, last reward: -0.0057, gradient norm:  0.7444:  63%|██████▎   | 395/625 [01:00<00:34,  6.61it/s]
reward: -2.3339, last reward: -0.0167, gradient norm:  1.39:  63%|██████▎   | 395/625 [01:00<00:34,  6.61it/s]
reward: -2.3339, last reward: -0.0167, gradient norm:  1.39:  63%|██████▎   | 396/625 [01:00<00:34,  6.62it/s]
reward: -2.4806, last reward: -0.0023, gradient norm:  2.317:  63%|██████▎   | 396/625 [01:00<00:34,  6.62it/s]
reward: -2.4806, last reward: -0.0023, gradient norm:  2.317:  64%|██████▎   | 397/625 [01:00<00:34,  6.61it/s]
reward: -2.4171, last reward: -0.1438, gradient norm:  5.067:  64%|██████▎   | 397/625 [01:00<00:34,  6.61it/s]
reward: -2.4171, last reward: -0.1438, gradient norm:  5.067:  64%|██████▎   | 398/625 [01:00<00:34,  6.62it/s]
reward: -2.2618, last reward: -0.5809, gradient norm:  20.39:  64%|██████▎   | 398/625 [01:00<00:34,  6.62it/s]
reward: -2.2618, last reward: -0.5809, gradient norm:  20.39:  64%|██████▍   | 399/625 [01:00<00:34,  6.62it/s]
reward: -2.0115, last reward: -0.0054, gradient norm:  0.3364:  64%|██████▍   | 399/625 [01:01<00:34,  6.62it/s]
reward: -2.0115, last reward: -0.0054, gradient norm:  0.3364:  64%|██████▍   | 400/625 [01:01<00:34,  6.62it/s]
reward: -1.8733, last reward: -0.0184, gradient norm:  2.275:  64%|██████▍   | 400/625 [01:01<00:34,  6.62it/s]
reward: -1.8733, last reward: -0.0184, gradient norm:  2.275:  64%|██████▍   | 401/625 [01:01<00:33,  6.61it/s]
reward: -1.9137, last reward: -0.0113, gradient norm:  1.025:  64%|██████▍   | 401/625 [01:01<00:33,  6.61it/s]
reward: -1.9137, last reward: -0.0113, gradient norm:  1.025:  64%|██████▍   | 402/625 [01:01<00:33,  6.61it/s]
reward: -2.0386, last reward: -0.0625, gradient norm:  2.763:  64%|██████▍   | 402/625 [01:01<00:33,  6.61it/s]
reward: -2.0386, last reward: -0.0625, gradient norm:  2.763:  64%|██████▍   | 403/625 [01:01<00:33,  6.62it/s]
reward: -2.1332, last reward: -0.0582, gradient norm:  0.7816:  64%|██████▍   | 403/625 [01:01<00:33,  6.62it/s]
reward: -2.1332, last reward: -0.0582, gradient norm:  0.7816:  65%|██████▍   | 404/625 [01:01<00:33,  6.63it/s]
reward: -1.8341, last reward: -0.0941, gradient norm:  5.854:  65%|██████▍   | 404/625 [01:01<00:33,  6.63it/s]
reward: -1.8341, last reward: -0.0941, gradient norm:  5.854:  65%|██████▍   | 405/625 [01:01<00:33,  6.63it/s]
reward: -1.8615, last reward: -0.0968, gradient norm:  4.588:  65%|██████▍   | 405/625 [01:01<00:33,  6.63it/s]
reward: -1.8615, last reward: -0.0968, gradient norm:  4.588:  65%|██████▍   | 406/625 [01:01<00:33,  6.62it/s]
reward: -2.0981, last reward: -0.3849, gradient norm:  6.008:  65%|██████▍   | 406/625 [01:02<00:33,  6.62it/s]
reward: -2.0981, last reward: -0.3849, gradient norm:  6.008:  65%|██████▌   | 407/625 [01:02<00:32,  6.62it/s]
reward: -1.9395, last reward: -0.0765, gradient norm:  4.055:  65%|██████▌   | 407/625 [01:02<00:32,  6.62it/s]
reward: -1.9395, last reward: -0.0765, gradient norm:  4.055:  65%|██████▌   | 408/625 [01:02<00:32,  6.62it/s]
reward: -2.2685, last reward: -0.2235, gradient norm:  1.688:  65%|██████▌   | 408/625 [01:02<00:32,  6.62it/s]
reward: -2.2685, last reward: -0.2235, gradient norm:  1.688:  65%|██████▌   | 409/625 [01:02<00:32,  6.62it/s]
reward: -2.3052, last reward: -1.4249, gradient norm:  25.99:  65%|██████▌   | 409/625 [01:02<00:32,  6.62it/s]
reward: -2.3052, last reward: -1.4249, gradient norm:  25.99:  66%|██████▌   | 410/625 [01:02<00:32,  6.61it/s]
reward: -2.6806, last reward: -1.6383, gradient norm:  30.59:  66%|██████▌   | 410/625 [01:02<00:32,  6.61it/s]
reward: -2.6806, last reward: -1.6383, gradient norm:  30.59:  66%|██████▌   | 411/625 [01:02<00:32,  6.58it/s]
reward: -2.3721, last reward: -2.9981, gradient norm:  74.37:  66%|██████▌   | 411/625 [01:02<00:32,  6.58it/s]
reward: -2.3721, last reward: -2.9981, gradient norm:  74.37:  66%|██████▌   | 412/625 [01:02<00:32,  6.60it/s]
reward: -2.1862, last reward: -0.0063, gradient norm:  1.822:  66%|██████▌   | 412/625 [01:02<00:32,  6.60it/s]
reward: -2.1862, last reward: -0.0063, gradient norm:  1.822:  66%|██████▌   | 413/625 [01:02<00:32,  6.61it/s]
reward: -1.9811, last reward: -0.0171, gradient norm:  1.013:  66%|██████▌   | 413/625 [01:03<00:32,  6.61it/s]
reward: -1.9811, last reward: -0.0171, gradient norm:  1.013:  66%|██████▌   | 414/625 [01:03<00:31,  6.62it/s]
reward: -2.0252, last reward: -0.0049, gradient norm:  0.6205:  66%|██████▌   | 414/625 [01:03<00:31,  6.62it/s]
reward: -2.0252, last reward: -0.0049, gradient norm:  0.6205:  66%|██████▋   | 415/625 [01:03<00:31,  6.62it/s]
reward: -2.1108, last reward: -0.4921, gradient norm:  23.74:  66%|██████▋   | 415/625 [01:03<00:31,  6.62it/s]
reward: -2.1108, last reward: -0.4921, gradient norm:  23.74:  67%|██████▋   | 416/625 [01:03<00:31,  6.64it/s]
reward: -1.9142, last reward: -0.8130, gradient norm:  52.65:  67%|██████▋   | 416/625 [01:03<00:31,  6.64it/s]
reward: -1.9142, last reward: -0.8130, gradient norm:  52.65:  67%|██████▋   | 417/625 [01:03<00:31,  6.64it/s]
reward: -2.1725, last reward: -0.0036, gradient norm:  0.3196:  67%|██████▋   | 417/625 [01:03<00:31,  6.64it/s]
reward: -2.1725, last reward: -0.0036, gradient norm:  0.3196:  67%|██████▋   | 418/625 [01:03<00:31,  6.65it/s]
reward: -1.7795, last reward: -0.0242, gradient norm:  1.799:  67%|██████▋   | 418/625 [01:03<00:31,  6.65it/s]
reward: -1.7795, last reward: -0.0242, gradient norm:  1.799:  67%|██████▋   | 419/625 [01:03<00:31,  6.64it/s]
reward: -1.7737, last reward: -0.0138, gradient norm:  1.39:  67%|██████▋   | 419/625 [01:04<00:31,  6.64it/s]
reward: -1.7737, last reward: -0.0138, gradient norm:  1.39:  67%|██████▋   | 420/625 [01:04<00:30,  6.64it/s]
reward: -2.1462, last reward: -0.0053, gradient norm:  0.47:  67%|██████▋   | 420/625 [01:04<00:30,  6.64it/s]
reward: -2.1462, last reward: -0.0053, gradient norm:  0.47:  67%|██████▋   | 421/625 [01:04<00:30,  6.63it/s]
reward: -1.9226, last reward: -0.6139, gradient norm:  40.3:  67%|██████▋   | 421/625 [01:04<00:30,  6.63it/s]
reward: -1.9226, last reward: -0.6139, gradient norm:  40.3:  68%|██████▊   | 422/625 [01:04<00:30,  6.63it/s]
reward: -1.9889, last reward: -0.0403, gradient norm:  1.112:  68%|██████▊   | 422/625 [01:04<00:30,  6.63it/s]
reward: -1.9889, last reward: -0.0403, gradient norm:  1.112:  68%|██████▊   | 423/625 [01:04<00:30,  6.63it/s]
reward: -1.6194, last reward: -0.0032, gradient norm:  0.79:  68%|██████▊   | 423/625 [01:04<00:30,  6.63it/s]
reward: -1.6194, last reward: -0.0032, gradient norm:  0.79:  68%|██████▊   | 424/625 [01:04<00:30,  6.62it/s]
reward: -2.3989, last reward: -0.0104, gradient norm:  1.134:  68%|██████▊   | 424/625 [01:04<00:30,  6.62it/s]
reward: -2.3989, last reward: -0.0104, gradient norm:  1.134:  68%|██████▊   | 425/625 [01:04<00:30,  6.62it/s]
reward: -1.9960, last reward: -0.0009, gradient norm:  0.6009:  68%|██████▊   | 425/625 [01:04<00:30,  6.62it/s]
reward: -1.9960, last reward: -0.0009, gradient norm:  0.6009:  68%|██████▊   | 426/625 [01:04<00:30,  6.62it/s]
reward: -2.2697, last reward: -0.0914, gradient norm:  2.905:  68%|██████▊   | 426/625 [01:05<00:30,  6.62it/s]
reward: -2.2697, last reward: -0.0914, gradient norm:  2.905:  68%|██████▊   | 427/625 [01:05<00:29,  6.62it/s]
reward: -2.4256, last reward: -0.1114, gradient norm:  2.102:  68%|██████▊   | 427/625 [01:05<00:29,  6.62it/s]
reward: -2.4256, last reward: -0.1114, gradient norm:  2.102:  68%|██████▊   | 428/625 [01:05<00:29,  6.62it/s]
reward: -1.9862, last reward: -0.1932, gradient norm:  22.44:  68%|██████▊   | 428/625 [01:05<00:29,  6.62it/s]
reward: -1.9862, last reward: -0.1932, gradient norm:  22.44:  69%|██████▊   | 429/625 [01:05<00:29,  6.62it/s]
reward: -2.0637, last reward: -0.0623, gradient norm:  3.082:  69%|██████▊   | 429/625 [01:05<00:29,  6.62it/s]
reward: -2.0637, last reward: -0.0623, gradient norm:  3.082:  69%|██████▉   | 430/625 [01:05<00:29,  6.63it/s]
reward: -1.9906, last reward: -0.2031, gradient norm:  5.5:  69%|██████▉   | 430/625 [01:05<00:29,  6.63it/s]
reward: -1.9906, last reward: -0.2031, gradient norm:  5.5:  69%|██████▉   | 431/625 [01:05<00:29,  6.63it/s]
reward: -1.9948, last reward: -0.0895, gradient norm:  3.456:  69%|██████▉   | 431/625 [01:05<00:29,  6.63it/s]
reward: -1.9948, last reward: -0.0895, gradient norm:  3.456:  69%|██████▉   | 432/625 [01:05<00:29,  6.64it/s]
reward: -2.1970, last reward: -0.0256, gradient norm:  1.593:  69%|██████▉   | 432/625 [01:05<00:29,  6.64it/s]
reward: -2.1970, last reward: -0.0256, gradient norm:  1.593:  69%|██████▉   | 433/625 [01:05<00:29,  6.62it/s]
reward: -2.4231, last reward: -0.0449, gradient norm:  3.644:  69%|██████▉   | 433/625 [01:06<00:29,  6.62it/s]
reward: -2.4231, last reward: -0.0449, gradient norm:  3.644:  69%|██████▉   | 434/625 [01:06<00:28,  6.62it/s]
reward: -2.1039, last reward: -3.1973, gradient norm:  87.37:  69%|██████▉   | 434/625 [01:06<00:28,  6.62it/s]
reward: -2.1039, last reward: -3.1973, gradient norm:  87.37:  70%|██████▉   | 435/625 [01:06<00:28,  6.62it/s]
reward: -2.4561, last reward: -0.1225, gradient norm:  6.119:  70%|██████▉   | 435/625 [01:06<00:28,  6.62it/s]
reward: -2.4561, last reward: -0.1225, gradient norm:  6.119:  70%|██████▉   | 436/625 [01:06<00:28,  6.58it/s]
reward: -2.0211, last reward: -0.2125, gradient norm:  2.94:  70%|██████▉   | 436/625 [01:06<00:28,  6.58it/s]
reward: -2.0211, last reward: -0.2125, gradient norm:  2.94:  70%|██████▉   | 437/625 [01:06<00:28,  6.57it/s]
reward: -2.3866, last reward: -0.0050, gradient norm:  0.7202:  70%|██████▉   | 437/625 [01:06<00:28,  6.57it/s]
reward: -2.3866, last reward: -0.0050, gradient norm:  0.7202:  70%|███████   | 438/625 [01:06<00:28,  6.54it/s]
reward: -1.6388, last reward: -0.0072, gradient norm:  0.8657:  70%|███████   | 438/625 [01:06<00:28,  6.54it/s]
reward: -1.6388, last reward: -0.0072, gradient norm:  0.8657:  70%|███████   | 439/625 [01:06<00:28,  6.54it/s]
reward: -2.1187, last reward: -0.0015, gradient norm:  0.5116:  70%|███████   | 439/625 [01:07<00:28,  6.54it/s]
reward: -2.1187, last reward: -0.0015, gradient norm:  0.5116:  70%|███████   | 440/625 [01:07<00:28,  6.52it/s]
reward: -2.0432, last reward: -0.0025, gradient norm:  0.7809:  70%|███████   | 440/625 [01:07<00:28,  6.52it/s]
reward: -2.0432, last reward: -0.0025, gradient norm:  0.7809:  71%|███████   | 441/625 [01:07<00:28,  6.52it/s]
reward: -2.1925, last reward: -0.0103, gradient norm:  2.83:  71%|███████   | 441/625 [01:07<00:28,  6.52it/s]
reward: -2.1925, last reward: -0.0103, gradient norm:  2.83:  71%|███████   | 442/625 [01:07<00:27,  6.54it/s]
reward: -1.9570, last reward: -0.0002, gradient norm:  0.35:  71%|███████   | 442/625 [01:07<00:27,  6.54it/s]
reward: -1.9570, last reward: -0.0002, gradient norm:  0.35:  71%|███████   | 443/625 [01:07<00:27,  6.53it/s]
reward: -2.0871, last reward: -0.0022, gradient norm:  0.5601:  71%|███████   | 443/625 [01:07<00:27,  6.53it/s]
reward: -2.0871, last reward: -0.0022, gradient norm:  0.5601:  71%|███████   | 444/625 [01:07<00:27,  6.51it/s]
reward: -2.0165, last reward: -0.0047, gradient norm:  0.6061:  71%|███████   | 444/625 [01:07<00:27,  6.51it/s]
reward: -2.0165, last reward: -0.0047, gradient norm:  0.6061:  71%|███████   | 445/625 [01:07<00:27,  6.49it/s]
reward: -2.2746, last reward: -0.0027, gradient norm:  0.7887:  71%|███████   | 445/625 [01:07<00:27,  6.49it/s]
reward: -2.2746, last reward: -0.0027, gradient norm:  0.7887:  71%|███████▏  | 446/625 [01:07<00:27,  6.50it/s]
reward: -2.1835, last reward: -0.0035, gradient norm:  0.855:  71%|███████▏  | 446/625 [01:08<00:27,  6.50it/s]
reward: -2.1835, last reward: -0.0035, gradient norm:  0.855:  72%|███████▏  | 447/625 [01:08<00:27,  6.52it/s]
reward: -1.8420, last reward: -0.0103, gradient norm:  1.548:  72%|███████▏  | 447/625 [01:08<00:27,  6.52it/s]
reward: -1.8420, last reward: -0.0103, gradient norm:  1.548:  72%|███████▏  | 448/625 [01:08<00:27,  6.51it/s]
reward: -2.2653, last reward: -0.0126, gradient norm:  0.9736:  72%|███████▏  | 448/625 [01:08<00:27,  6.51it/s]
reward: -2.2653, last reward: -0.0126, gradient norm:  0.9736:  72%|███████▏  | 449/625 [01:08<00:27,  6.52it/s]
reward: -2.0594, last reward: -0.0119, gradient norm:  0.6196:  72%|███████▏  | 449/625 [01:08<00:27,  6.52it/s]
reward: -2.0594, last reward: -0.0119, gradient norm:  0.6196:  72%|███████▏  | 450/625 [01:08<00:26,  6.50it/s]
reward: -2.4509, last reward: -0.0373, gradient norm:  11.44:  72%|███████▏  | 450/625 [01:08<00:26,  6.50it/s]
reward: -2.4509, last reward: -0.0373, gradient norm:  11.44:  72%|███████▏  | 451/625 [01:08<00:26,  6.51it/s]
reward: -2.2528, last reward: -0.0620, gradient norm:  3.992:  72%|███████▏  | 451/625 [01:08<00:26,  6.51it/s]
reward: -2.2528, last reward: -0.0620, gradient norm:  3.992:  72%|███████▏  | 452/625 [01:08<00:26,  6.51it/s]
reward: -1.6898, last reward: -0.3235, gradient norm:  6.687:  72%|███████▏  | 452/625 [01:09<00:26,  6.51it/s]
reward: -1.6898, last reward: -0.3235, gradient norm:  6.687:  72%|███████▏  | 453/625 [01:09<00:26,  6.48it/s]
reward: -1.5879, last reward: -0.0905, gradient norm:  2.84:  72%|███████▏  | 453/625 [01:09<00:26,  6.48it/s]
reward: -1.5879, last reward: -0.0905, gradient norm:  2.84:  73%|███████▎  | 454/625 [01:09<00:26,  6.49it/s]
reward: -1.8406, last reward: -0.0694, gradient norm:  2.288:  73%|███████▎  | 454/625 [01:09<00:26,  6.49it/s]
reward: -1.8406, last reward: -0.0694, gradient norm:  2.288:  73%|███████▎  | 455/625 [01:09<00:26,  6.49it/s]
reward: -1.8259, last reward: -0.0235, gradient norm:  1.304:  73%|███████▎  | 455/625 [01:09<00:26,  6.49it/s]
reward: -1.8259, last reward: -0.0235, gradient norm:  1.304:  73%|███████▎  | 456/625 [01:09<00:26,  6.49it/s]
reward: -1.8500, last reward: -0.0024, gradient norm:  1.416:  73%|███████▎  | 456/625 [01:09<00:26,  6.49it/s]
reward: -1.8500, last reward: -0.0024, gradient norm:  1.416:  73%|███████▎  | 457/625 [01:09<00:25,  6.48it/s]
reward: -1.9649, last reward: -0.4054, gradient norm:  39.3:  73%|███████▎  | 457/625 [01:09<00:25,  6.48it/s]
reward: -1.9649, last reward: -0.4054, gradient norm:  39.3:  73%|███████▎  | 458/625 [01:09<00:25,  6.49it/s]
reward: -2.2027, last reward: -0.0894, gradient norm:  4.275:  73%|███████▎  | 458/625 [01:09<00:25,  6.49it/s]
reward: -2.2027, last reward: -0.0894, gradient norm:  4.275:  73%|███████▎  | 459/625 [01:09<00:25,  6.49it/s]
reward: -1.5966, last reward: -0.0113, gradient norm:  1.368:  73%|███████▎  | 459/625 [01:10<00:25,  6.49it/s]
reward: -1.5966, last reward: -0.0113, gradient norm:  1.368:  74%|███████▎  | 460/625 [01:10<00:25,  6.48it/s]
reward: -1.6942, last reward: -0.0016, gradient norm:  0.4254:  74%|███████▎  | 460/625 [01:10<00:25,  6.48it/s]
reward: -1.6942, last reward: -0.0016, gradient norm:  0.4254:  74%|███████▍  | 461/625 [01:10<00:25,  6.48it/s]
reward: -1.6703, last reward: -0.0145, gradient norm:  2.142:  74%|███████▍  | 461/625 [01:10<00:25,  6.48it/s]
reward: -1.6703, last reward: -0.0145, gradient norm:  2.142:  74%|███████▍  | 462/625 [01:10<00:25,  6.49it/s]
reward: -1.8124, last reward: -0.0218, gradient norm:  0.9196:  74%|███████▍  | 462/625 [01:10<00:25,  6.49it/s]
reward: -1.8124, last reward: -0.0218, gradient norm:  0.9196:  74%|███████▍  | 463/625 [01:10<00:24,  6.49it/s]
reward: -1.8657, last reward: -0.0188, gradient norm:  0.8986:  74%|███████▍  | 463/625 [01:10<00:24,  6.49it/s]
reward: -1.8657, last reward: -0.0188, gradient norm:  0.8986:  74%|███████▍  | 464/625 [01:10<00:24,  6.49it/s]
reward: -2.0884, last reward: -0.0084, gradient norm:  0.5624:  74%|███████▍  | 464/625 [01:10<00:24,  6.49it/s]
reward: -2.0884, last reward: -0.0084, gradient norm:  0.5624:  74%|███████▍  | 465/625 [01:10<00:24,  6.49it/s]
reward: -1.8862, last reward: -0.0006, gradient norm:  0.5384:  74%|███████▍  | 465/625 [01:11<00:24,  6.49it/s]
reward: -1.8862, last reward: -0.0006, gradient norm:  0.5384:  75%|███████▍  | 466/625 [01:11<00:24,  6.50it/s]
reward: -2.1973, last reward: -0.0022, gradient norm:  0.5837:  75%|███████▍  | 466/625 [01:11<00:24,  6.50it/s]
reward: -2.1973, last reward: -0.0022, gradient norm:  0.5837:  75%|███████▍  | 467/625 [01:11<00:24,  6.53it/s]
reward: -1.8954, last reward: -0.0101, gradient norm:  0.6751:  75%|███████▍  | 467/625 [01:11<00:24,  6.53it/s]
reward: -1.8954, last reward: -0.0101, gradient norm:  0.6751:  75%|███████▍  | 468/625 [01:11<00:24,  6.51it/s]
reward: -1.8063, last reward: -0.0122, gradient norm:  0.9635:  75%|███████▍  | 468/625 [01:11<00:24,  6.51it/s]
reward: -1.8063, last reward: -0.0122, gradient norm:  0.9635:  75%|███████▌  | 469/625 [01:11<00:23,  6.51it/s]
reward: -2.0692, last reward: -0.0027, gradient norm:  0.4216:  75%|███████▌  | 469/625 [01:11<00:23,  6.51it/s]
reward: -2.0692, last reward: -0.0027, gradient norm:  0.4216:  75%|███████▌  | 470/625 [01:11<00:23,  6.51it/s]
reward: -2.1227, last reward: -0.0586, gradient norm:  3.162e+03:  75%|███████▌  | 470/625 [01:11<00:23,  6.51it/s]
reward: -2.1227, last reward: -0.0586, gradient norm:  3.162e+03:  75%|███████▌  | 471/625 [01:11<00:23,  6.50it/s]
reward: -1.9690, last reward: -0.0074, gradient norm:  0.4166:  75%|███████▌  | 471/625 [01:11<00:23,  6.50it/s]
reward: -1.9690, last reward: -0.0074, gradient norm:  0.4166:  76%|███████▌  | 472/625 [01:11<00:23,  6.52it/s]
reward: -2.6324, last reward: -0.0119, gradient norm:  1.345:  76%|███████▌  | 472/625 [01:12<00:23,  6.52it/s]
reward: -2.6324, last reward: -0.0119, gradient norm:  1.345:  76%|███████▌  | 473/625 [01:12<00:23,  6.54it/s]
reward: -2.0778, last reward: -0.0098, gradient norm:  1.166:  76%|███████▌  | 473/625 [01:12<00:23,  6.54it/s]
reward: -2.0778, last reward: -0.0098, gradient norm:  1.166:  76%|███████▌  | 474/625 [01:12<00:23,  6.56it/s]
reward: -1.8548, last reward: -0.0017, gradient norm:  0.4408:  76%|███████▌  | 474/625 [01:12<00:23,  6.56it/s]
reward: -1.8548, last reward: -0.0017, gradient norm:  0.4408:  76%|███████▌  | 475/625 [01:12<00:22,  6.58it/s]
reward: -1.8125, last reward: -0.0003, gradient norm:  0.1515:  76%|███████▌  | 475/625 [01:12<00:22,  6.58it/s]
reward: -1.8125, last reward: -0.0003, gradient norm:  0.1515:  76%|███████▌  | 476/625 [01:12<00:22,  6.59it/s]
reward: -2.2733, last reward: -0.0044, gradient norm:  0.2836:  76%|███████▌  | 476/625 [01:12<00:22,  6.59it/s]
reward: -2.2733, last reward: -0.0044, gradient norm:  0.2836:  76%|███████▋  | 477/625 [01:12<00:22,  6.60it/s]
reward: -1.7497, last reward: -0.0149, gradient norm:  0.7681:  76%|███████▋  | 477/625 [01:12<00:22,  6.60it/s]
reward: -1.7497, last reward: -0.0149, gradient norm:  0.7681:  76%|███████▋  | 478/625 [01:12<00:22,  6.60it/s]
reward: -1.8547, last reward: -0.0105, gradient norm:  0.7212:  76%|███████▋  | 478/625 [01:13<00:22,  6.60it/s]
reward: -1.8547, last reward: -0.0105, gradient norm:  0.7212:  77%|███████▋  | 479/625 [01:13<00:22,  6.59it/s]
reward: -1.9848, last reward: -0.0019, gradient norm:  0.6498:  77%|███████▋  | 479/625 [01:13<00:22,  6.59it/s]
reward: -1.9848, last reward: -0.0019, gradient norm:  0.6498:  77%|███████▋  | 480/625 [01:13<00:21,  6.60it/s]
reward: -2.1987, last reward: -0.0011, gradient norm:  0.5473:  77%|███████▋  | 480/625 [01:13<00:21,  6.60it/s]
reward: -2.1987, last reward: -0.0011, gradient norm:  0.5473:  77%|███████▋  | 481/625 [01:13<00:21,  6.59it/s]
reward: -1.8991, last reward: -0.0033, gradient norm:  0.6091:  77%|███████▋  | 481/625 [01:13<00:21,  6.59it/s]
reward: -1.8991, last reward: -0.0033, gradient norm:  0.6091:  77%|███████▋  | 482/625 [01:13<00:21,  6.60it/s]
reward: -1.9189, last reward: -0.0032, gradient norm:  0.5771:  77%|███████▋  | 482/625 [01:13<00:21,  6.60it/s]
reward: -1.9189, last reward: -0.0032, gradient norm:  0.5771:  77%|███████▋  | 483/625 [01:13<00:21,  6.61it/s]
reward: -1.6781, last reward: -0.0004, gradient norm:  0.7542:  77%|███████▋  | 483/625 [01:13<00:21,  6.61it/s]
reward: -1.6781, last reward: -0.0004, gradient norm:  0.7542:  77%|███████▋  | 484/625 [01:13<00:21,  6.61it/s]
reward: -1.5959, last reward: -0.0064, gradient norm:  0.4295:  77%|███████▋  | 484/625 [01:13<00:21,  6.61it/s]
reward: -1.5959, last reward: -0.0064, gradient norm:  0.4295:  78%|███████▊  | 485/625 [01:13<00:21,  6.60it/s]
reward: -2.2547, last reward: -0.0103, gradient norm:  0.4641:  78%|███████▊  | 485/625 [01:14<00:21,  6.60it/s]
reward: -2.2547, last reward: -0.0103, gradient norm:  0.4641:  78%|███████▊  | 486/625 [01:14<00:21,  6.58it/s]
reward: -2.1509, last reward: -0.0636, gradient norm:  6.547:  78%|███████▊  | 486/625 [01:14<00:21,  6.58it/s]
reward: -2.1509, last reward: -0.0636, gradient norm:  6.547:  78%|███████▊  | 487/625 [01:14<00:20,  6.60it/s]
reward: -2.0972, last reward: -0.0065, gradient norm:  0.2593:  78%|███████▊  | 487/625 [01:14<00:20,  6.60it/s]
reward: -2.0972, last reward: -0.0065, gradient norm:  0.2593:  78%|███████▊  | 488/625 [01:14<00:20,  6.61it/s]
reward: -2.1694, last reward: -0.0083, gradient norm:  0.5759:  78%|███████▊  | 488/625 [01:14<00:20,  6.61it/s]
reward: -2.1694, last reward: -0.0083, gradient norm:  0.5759:  78%|███████▊  | 489/625 [01:14<00:20,  6.60it/s]
reward: -2.0493, last reward: -0.0021, gradient norm:  0.7805:  78%|███████▊  | 489/625 [01:14<00:20,  6.60it/s]
reward: -2.0493, last reward: -0.0021, gradient norm:  0.7805:  78%|███████▊  | 490/625 [01:14<00:20,  6.61it/s]
reward: -2.0950, last reward: -0.0021, gradient norm:  0.497:  78%|███████▊  | 490/625 [01:14<00:20,  6.61it/s]
reward: -2.0950, last reward: -0.0021, gradient norm:  0.497:  79%|███████▊  | 491/625 [01:14<00:20,  6.57it/s]
reward: -1.9717, last reward: -0.0012, gradient norm:  0.3672:  79%|███████▊  | 491/625 [01:15<00:20,  6.57it/s]
reward: -1.9717, last reward: -0.0012, gradient norm:  0.3672:  79%|███████▊  | 492/625 [01:15<00:20,  6.58it/s]
reward: -2.0207, last reward: -0.0009, gradient norm:  0.331:  79%|███████▊  | 492/625 [01:15<00:20,  6.58it/s]
reward: -2.0207, last reward: -0.0009, gradient norm:  0.331:  79%|███████▉  | 493/625 [01:15<00:20,  6.59it/s]
reward: -1.8266, last reward: -0.0069, gradient norm:  0.5365:  79%|███████▉  | 493/625 [01:15<00:20,  6.59it/s]
reward: -1.8266, last reward: -0.0069, gradient norm:  0.5365:  79%|███████▉  | 494/625 [01:15<00:19,  6.57it/s]
reward: -2.2623, last reward: -0.0065, gradient norm:  0.5078:  79%|███████▉  | 494/625 [01:15<00:19,  6.57it/s]
reward: -2.2623, last reward: -0.0065, gradient norm:  0.5078:  79%|███████▉  | 495/625 [01:15<00:19,  6.59it/s]
reward: -2.0230, last reward: -0.0027, gradient norm:  0.4545:  79%|███████▉  | 495/625 [01:15<00:19,  6.59it/s]
reward: -2.0230, last reward: -0.0027, gradient norm:  0.4545:  79%|███████▉  | 496/625 [01:15<00:19,  6.60it/s]
reward: -1.6047, last reward: -0.0000, gradient norm:  0.09636:  79%|███████▉  | 496/625 [01:15<00:19,  6.60it/s]
reward: -1.6047, last reward: -0.0000, gradient norm:  0.09636:  80%|███████▉  | 497/625 [01:15<00:19,  6.60it/s]
reward: -1.8754, last reward: -0.0010, gradient norm:  0.2:  80%|███████▉  | 497/625 [01:15<00:19,  6.60it/s]
reward: -1.8754, last reward: -0.0010, gradient norm:  0.2:  80%|███████▉  | 498/625 [01:15<00:19,  6.58it/s]
reward: -2.6216, last reward: -0.0031, gradient norm:  0.8269:  80%|███████▉  | 498/625 [01:16<00:19,  6.58it/s]
reward: -2.6216, last reward: -0.0031, gradient norm:  0.8269:  80%|███████▉  | 499/625 [01:16<00:19,  6.58it/s]
reward: -1.7361, last reward: -0.0023, gradient norm:  0.4082:  80%|███████▉  | 499/625 [01:16<00:19,  6.58it/s]
reward: -1.7361, last reward: -0.0023, gradient norm:  0.4082:  80%|████████  | 500/625 [01:16<00:18,  6.60it/s]
reward: -1.6642, last reward: -0.0006, gradient norm:  0.2284:  80%|████████  | 500/625 [01:16<00:18,  6.60it/s]
reward: -1.6642, last reward: -0.0006, gradient norm:  0.2284:  80%|████████  | 501/625 [01:16<00:18,  6.61it/s]
reward: -1.9130, last reward: -0.0008, gradient norm:  0.3031:  80%|████████  | 501/625 [01:16<00:18,  6.61it/s]
reward: -1.9130, last reward: -0.0008, gradient norm:  0.3031:  80%|████████  | 502/625 [01:16<00:18,  6.62it/s]
reward: -2.2944, last reward: -0.0035, gradient norm:  0.2986:  80%|████████  | 502/625 [01:16<00:18,  6.62it/s]
reward: -2.2944, last reward: -0.0035, gradient norm:  0.2986:  80%|████████  | 503/625 [01:16<00:18,  6.61it/s]
reward: -1.7624, last reward: -0.0056, gradient norm:  0.3858:  80%|████████  | 503/625 [01:16<00:18,  6.61it/s]
reward: -1.7624, last reward: -0.0056, gradient norm:  0.3858:  81%|████████  | 504/625 [01:16<00:18,  6.62it/s]
reward: -2.0890, last reward: -0.0042, gradient norm:  0.38:  81%|████████  | 504/625 [01:16<00:18,  6.62it/s]
reward: -2.0890, last reward: -0.0042, gradient norm:  0.38:  81%|████████  | 505/625 [01:16<00:18,  6.63it/s]
reward: -1.7505, last reward: -0.0017, gradient norm:  0.2157:  81%|████████  | 505/625 [01:17<00:18,  6.63it/s]
reward: -1.7505, last reward: -0.0017, gradient norm:  0.2157:  81%|████████  | 506/625 [01:17<00:17,  6.64it/s]
reward: -1.8394, last reward: -0.0013, gradient norm:  0.3413:  81%|████████  | 506/625 [01:17<00:17,  6.64it/s]
reward: -1.8394, last reward: -0.0013, gradient norm:  0.3413:  81%|████████  | 507/625 [01:17<00:17,  6.64it/s]
reward: -1.9609, last reward: -0.0041, gradient norm:  0.6905:  81%|████████  | 507/625 [01:17<00:17,  6.64it/s]
reward: -1.9609, last reward: -0.0041, gradient norm:  0.6905:  81%|████████▏ | 508/625 [01:17<00:17,  6.65it/s]
reward: -1.8467, last reward: -0.0011, gradient norm:  0.4409:  81%|████████▏ | 508/625 [01:17<00:17,  6.65it/s]
reward: -1.8467, last reward: -0.0011, gradient norm:  0.4409:  81%|████████▏ | 509/625 [01:17<00:17,  6.65it/s]
reward: -2.0252, last reward: -0.0021, gradient norm:  0.213:  81%|████████▏ | 509/625 [01:17<00:17,  6.65it/s]
reward: -2.0252, last reward: -0.0021, gradient norm:  0.213:  82%|████████▏ | 510/625 [01:17<00:17,  6.65it/s]
reward: -1.8128, last reward: -0.0073, gradient norm:  0.3559:  82%|████████▏ | 510/625 [01:17<00:17,  6.65it/s]
reward: -1.8128, last reward: -0.0073, gradient norm:  0.3559:  82%|████████▏ | 511/625 [01:17<00:17,  6.60it/s]
reward: -2.1479, last reward: -0.0264, gradient norm:  3.68:  82%|████████▏ | 511/625 [01:18<00:17,  6.60it/s]
reward: -2.1479, last reward: -0.0264, gradient norm:  3.68:  82%|████████▏ | 512/625 [01:18<00:17,  6.61it/s]
reward: -2.1589, last reward: -0.0025, gradient norm:  5.566:  82%|████████▏ | 512/625 [01:18<00:17,  6.61it/s]
reward: -2.1589, last reward: -0.0025, gradient norm:  5.566:  82%|████████▏ | 513/625 [01:18<00:16,  6.59it/s]
reward: -2.2756, last reward: -0.0046, gradient norm:  0.5266:  82%|████████▏ | 513/625 [01:18<00:16,  6.59it/s]
reward: -2.2756, last reward: -0.0046, gradient norm:  0.5266:  82%|████████▏ | 514/625 [01:18<00:16,  6.56it/s]
reward: -1.9873, last reward: -0.0112, gradient norm:  0.9314:  82%|████████▏ | 514/625 [01:18<00:16,  6.56it/s]
reward: -1.9873, last reward: -0.0112, gradient norm:  0.9314:  82%|████████▏ | 515/625 [01:18<00:16,  6.56it/s]
reward: -2.3791, last reward: -0.0721, gradient norm:  1.14:  82%|████████▏ | 515/625 [01:18<00:16,  6.56it/s]
reward: -2.3791, last reward: -0.0721, gradient norm:  1.14:  83%|████████▎ | 516/625 [01:18<00:16,  6.53it/s]
reward: -2.4580, last reward: -0.0758, gradient norm:  0.6114:  83%|████████▎ | 516/625 [01:18<00:16,  6.53it/s]
reward: -2.4580, last reward: -0.0758, gradient norm:  0.6114:  83%|████████▎ | 517/625 [01:18<00:16,  6.53it/s]
reward: -1.9748, last reward: -0.0001, gradient norm:  0.2431:  83%|████████▎ | 517/625 [01:18<00:16,  6.53it/s]
reward: -1.9748, last reward: -0.0001, gradient norm:  0.2431:  83%|████████▎ | 518/625 [01:18<00:16,  6.52it/s]
reward: -2.1958, last reward: -0.0044, gradient norm:  0.5553:  83%|████████▎ | 518/625 [01:19<00:16,  6.52it/s]
reward: -2.1958, last reward: -0.0044, gradient norm:  0.5553:  83%|████████▎ | 519/625 [01:19<00:16,  6.51it/s]
reward: -1.8924, last reward: -0.0097, gradient norm:  17.34:  83%|████████▎ | 519/625 [01:19<00:16,  6.51it/s]
reward: -1.8924, last reward: -0.0097, gradient norm:  17.34:  83%|████████▎ | 520/625 [01:19<00:16,  6.52it/s]
reward: -2.3737, last reward: -0.0234, gradient norm:  1.899:  83%|████████▎ | 520/625 [01:19<00:16,  6.52it/s]
reward: -2.3737, last reward: -0.0234, gradient norm:  1.899:  83%|████████▎ | 521/625 [01:19<00:15,  6.52it/s]
reward: -1.9125, last reward: -0.0063, gradient norm:  0.4623:  83%|████████▎ | 521/625 [01:19<00:15,  6.52it/s]
reward: -1.9125, last reward: -0.0063, gradient norm:  0.4623:  84%|████████▎ | 522/625 [01:19<00:15,  6.52it/s]
reward: -2.3230, last reward: -0.0589, gradient norm:  0.3784:  84%|████████▎ | 522/625 [01:19<00:15,  6.52it/s]
reward: -2.3230, last reward: -0.0589, gradient norm:  0.3784:  84%|████████▎ | 523/625 [01:19<00:15,  6.55it/s]
reward: -1.9482, last reward: -0.0051, gradient norm:  1.105:  84%|████████▎ | 523/625 [01:19<00:15,  6.55it/s]
reward: -1.9482, last reward: -0.0051, gradient norm:  1.105:  84%|████████▍ | 524/625 [01:19<00:15,  6.54it/s]
reward: -2.1979, last reward: -0.0045, gradient norm:  0.6401:  84%|████████▍ | 524/625 [01:20<00:15,  6.54it/s]
reward: -2.1979, last reward: -0.0045, gradient norm:  0.6401:  84%|████████▍ | 525/625 [01:20<00:15,  6.53it/s]
reward: -2.1588, last reward: -0.0048, gradient norm:  0.6255:  84%|████████▍ | 525/625 [01:20<00:15,  6.53it/s]
reward: -2.1588, last reward: -0.0048, gradient norm:  0.6255:  84%|████████▍ | 526/625 [01:20<00:15,  6.53it/s]
reward: -1.6084, last reward: -0.0010, gradient norm:  0.3477:  84%|████████▍ | 526/625 [01:20<00:15,  6.53it/s]
reward: -1.6084, last reward: -0.0010, gradient norm:  0.3477:  84%|████████▍ | 527/625 [01:20<00:15,  6.52it/s]
reward: -2.1475, last reward: -0.0209, gradient norm:  0.3456:  84%|████████▍ | 527/625 [01:20<00:15,  6.52it/s]
reward: -2.1475, last reward: -0.0209, gradient norm:  0.3456:  84%|████████▍ | 528/625 [01:20<00:14,  6.52it/s]
reward: -1.7611, last reward: -0.1040, gradient norm:  18.52:  84%|████████▍ | 528/625 [01:20<00:14,  6.52it/s]
reward: -1.7611, last reward: -0.1040, gradient norm:  18.52:  85%|████████▍ | 529/625 [01:20<00:14,  6.51it/s]
reward: -2.0099, last reward: -0.0173, gradient norm:  1.643:  85%|████████▍ | 529/625 [01:20<00:14,  6.51it/s]
reward: -2.0099, last reward: -0.0173, gradient norm:  1.643:  85%|████████▍ | 530/625 [01:20<00:14,  6.51it/s]
reward: -2.8189, last reward: -1.4358, gradient norm:  46.61:  85%|████████▍ | 530/625 [01:20<00:14,  6.51it/s]
reward: -2.8189, last reward: -1.4358, gradient norm:  46.61:  85%|████████▍ | 531/625 [01:20<00:14,  6.51it/s]
reward: -2.9897, last reward: -2.4869, gradient norm:  51.23:  85%|████████▍ | 531/625 [01:21<00:14,  6.51it/s]
reward: -2.9897, last reward: -2.4869, gradient norm:  51.23:  85%|████████▌ | 532/625 [01:21<00:14,  6.52it/s]
reward: -2.1548, last reward: -0.9751, gradient norm:  72.21:  85%|████████▌ | 532/625 [01:21<00:14,  6.52it/s]
reward: -2.1548, last reward: -0.9751, gradient norm:  72.21:  85%|████████▌ | 533/625 [01:21<00:14,  6.55it/s]
reward: -1.6362, last reward: -0.0022, gradient norm:  0.7495:  85%|████████▌ | 533/625 [01:21<00:14,  6.55it/s]
reward: -1.6362, last reward: -0.0022, gradient norm:  0.7495:  85%|████████▌ | 534/625 [01:21<00:13,  6.53it/s]
reward: -2.1749, last reward: -0.0105, gradient norm:  0.9513:  85%|████████▌ | 534/625 [01:21<00:13,  6.53it/s]
reward: -2.1749, last reward: -0.0105, gradient norm:  0.9513:  86%|████████▌ | 535/625 [01:21<00:13,  6.51it/s]
reward: -1.7708, last reward: -0.0371, gradient norm:  1.432:  86%|████████▌ | 535/625 [01:21<00:13,  6.51it/s]
reward: -1.7708, last reward: -0.0371, gradient norm:  1.432:  86%|████████▌ | 536/625 [01:21<00:13,  6.51it/s]
reward: -2.2649, last reward: -0.0437, gradient norm:  2.327:  86%|████████▌ | 536/625 [01:21<00:13,  6.51it/s]
reward: -2.2649, last reward: -0.0437, gradient norm:  2.327:  86%|████████▌ | 537/625 [01:21<00:13,  6.49it/s]
reward: -2.5491, last reward: -0.0276, gradient norm:  1.246:  86%|████████▌ | 537/625 [01:22<00:13,  6.49it/s]
reward: -2.5491, last reward: -0.0276, gradient norm:  1.246:  86%|████████▌ | 538/625 [01:22<00:13,  6.49it/s]
reward: -2.6426, last reward: -0.7294, gradient norm:  1.078e+03:  86%|████████▌ | 538/625 [01:22<00:13,  6.49it/s]
reward: -2.6426, last reward: -0.7294, gradient norm:  1.078e+03:  86%|████████▌ | 539/625 [01:22<00:13,  6.49it/s]
reward: -1.9928, last reward: -0.0003, gradient norm:  1.576:  86%|████████▌ | 539/625 [01:22<00:13,  6.49it/s]
reward: -1.9928, last reward: -0.0003, gradient norm:  1.576:  86%|████████▋ | 540/625 [01:22<00:13,  6.48it/s]
reward: -1.7937, last reward: -0.0124, gradient norm:  0.9664:  86%|████████▋ | 540/625 [01:22<00:13,  6.48it/s]
reward: -1.7937, last reward: -0.0124, gradient norm:  0.9664:  87%|████████▋ | 541/625 [01:22<00:12,  6.49it/s]
reward: -2.3342, last reward: -0.0204, gradient norm:  1.81:  87%|████████▋ | 541/625 [01:22<00:12,  6.49it/s]
reward: -2.3342, last reward: -0.0204, gradient norm:  1.81:  87%|████████▋ | 542/625 [01:22<00:12,  6.48it/s]
reward: -2.2046, last reward: -0.0122, gradient norm:  1.004:  87%|████████▋ | 542/625 [01:22<00:12,  6.48it/s]
reward: -2.2046, last reward: -0.0122, gradient norm:  1.004:  87%|████████▋ | 543/625 [01:22<00:12,  6.49it/s]
reward: -2.0000, last reward: -0.0014, gradient norm:  0.5496:  87%|████████▋ | 543/625 [01:22<00:12,  6.49it/s]
reward: -2.0000, last reward: -0.0014, gradient norm:  0.5496:  87%|████████▋ | 544/625 [01:22<00:12,  6.50it/s]
reward: -2.0956, last reward: -0.0059, gradient norm:  1.425:  87%|████████▋ | 544/625 [01:23<00:12,  6.50it/s]
reward: -2.0956, last reward: -0.0059, gradient norm:  1.425:  87%|████████▋ | 545/625 [01:23<00:12,  6.50it/s]
reward: -2.9028, last reward: -0.5843, gradient norm:  21.12:  87%|████████▋ | 545/625 [01:23<00:12,  6.50it/s]
reward: -2.9028, last reward: -0.5843, gradient norm:  21.12:  87%|████████▋ | 546/625 [01:23<00:12,  6.48it/s]
reward: -2.0674, last reward: -0.0178, gradient norm:  0.797:  87%|████████▋ | 546/625 [01:23<00:12,  6.48it/s]
reward: -2.0674, last reward: -0.0178, gradient norm:  0.797:  88%|████████▊ | 547/625 [01:23<00:12,  6.50it/s]
reward: -2.2815, last reward: -0.0599, gradient norm:  1.227:  88%|████████▊ | 547/625 [01:23<00:12,  6.50it/s]
reward: -2.2815, last reward: -0.0599, gradient norm:  1.227:  88%|████████▊ | 548/625 [01:23<00:11,  6.50it/s]
reward: -3.1587, last reward: -0.9276, gradient norm:  20.56:  88%|████████▊ | 548/625 [01:23<00:11,  6.50it/s]
reward: -3.1587, last reward: -0.9276, gradient norm:  20.56:  88%|████████▊ | 549/625 [01:23<00:11,  6.50it/s]
reward: -3.8228, last reward: -2.9229, gradient norm:  308.2:  88%|████████▊ | 549/625 [01:23<00:11,  6.50it/s]
reward: -3.8228, last reward: -2.9229, gradient norm:  308.2:  88%|████████▊ | 550/625 [01:23<00:11,  6.49it/s]
reward: -1.6164, last reward: -0.0120, gradient norm:  2.259:  88%|████████▊ | 550/625 [01:24<00:11,  6.49it/s]
reward: -1.6164, last reward: -0.0120, gradient norm:  2.259:  88%|████████▊ | 551/625 [01:24<00:11,  6.49it/s]
reward: -1.6850, last reward: -0.0227, gradient norm:  0.9167:  88%|████████▊ | 551/625 [01:24<00:11,  6.49it/s]
reward: -1.6850, last reward: -0.0227, gradient norm:  0.9167:  88%|████████▊ | 552/625 [01:24<00:11,  6.50it/s]
reward: -2.3092, last reward: -0.0670, gradient norm:  0.9177:  88%|████████▊ | 552/625 [01:24<00:11,  6.50it/s]
reward: -2.3092, last reward: -0.0670, gradient norm:  0.9177:  88%|████████▊ | 553/625 [01:24<00:11,  6.49it/s]
reward: -2.1599, last reward: -0.0043, gradient norm:  1.195:  88%|████████▊ | 553/625 [01:24<00:11,  6.49it/s]
reward: -2.1599, last reward: -0.0043, gradient norm:  1.195:  89%|████████▊ | 554/625 [01:24<00:10,  6.50it/s]
reward: -2.4672, last reward: -0.0057, gradient norm:  0.6367:  89%|████████▊ | 554/625 [01:24<00:10,  6.50it/s]
reward: -2.4672, last reward: -0.0057, gradient norm:  0.6367:  89%|████████▉ | 555/625 [01:24<00:10,  6.49it/s]
reward: -2.3657, last reward: -0.1970, gradient norm:  4.202:  89%|████████▉ | 555/625 [01:24<00:10,  6.49it/s]
reward: -2.3657, last reward: -0.1970, gradient norm:  4.202:  89%|████████▉ | 556/625 [01:24<00:10,  6.49it/s]
reward: -2.6694, last reward: -0.1215, gradient norm:  1.324:  89%|████████▉ | 556/625 [01:24<00:10,  6.49it/s]
reward: -2.6694, last reward: -0.1215, gradient norm:  1.324:  89%|████████▉ | 557/625 [01:24<00:10,  6.50it/s]
reward: -2.2622, last reward: -0.0372, gradient norm:  0.4841:  89%|████████▉ | 557/625 [01:25<00:10,  6.50it/s]
reward: -2.2622, last reward: -0.0372, gradient norm:  0.4841:  89%|████████▉ | 558/625 [01:25<00:10,  6.50it/s]
reward: -2.2707, last reward: -0.0058, gradient norm:  5.757:  89%|████████▉ | 558/625 [01:25<00:10,  6.50it/s]
reward: -2.2707, last reward: -0.0058, gradient norm:  5.757:  89%|████████▉ | 559/625 [01:25<00:10,  6.49it/s]
reward: -2.2267, last reward: -0.0014, gradient norm:  0.5415:  89%|████████▉ | 559/625 [01:25<00:10,  6.49it/s]
reward: -2.2267, last reward: -0.0014, gradient norm:  0.5415:  90%|████████▉ | 560/625 [01:25<00:10,  6.49it/s]
reward: -2.4556, last reward: -0.0163, gradient norm:  1.146:  90%|████████▉ | 560/625 [01:25<00:10,  6.49it/s]
reward: -2.4556, last reward: -0.0163, gradient norm:  1.146:  90%|████████▉ | 561/625 [01:25<00:09,  6.47it/s]
reward: -2.1839, last reward: -0.0809, gradient norm:  0.6262:  90%|████████▉ | 561/625 [01:25<00:09,  6.47it/s]
reward: -2.1839, last reward: -0.0809, gradient norm:  0.6262:  90%|████████▉ | 562/625 [01:25<00:09,  6.49it/s]
reward: -2.0278, last reward: -0.0018, gradient norm:  1.327:  90%|████████▉ | 562/625 [01:25<00:09,  6.49it/s]
reward: -2.0278, last reward: -0.0018, gradient norm:  1.327:  90%|█████████ | 563/625 [01:25<00:09,  6.50it/s]
reward: -2.1112, last reward: -0.0011, gradient norm:  0.354:  90%|█████████ | 563/625 [01:26<00:09,  6.50it/s]
reward: -2.1112, last reward: -0.0011, gradient norm:  0.354:  90%|█████████ | 564/625 [01:26<00:09,  6.52it/s]
reward: -2.6155, last reward: -0.0004, gradient norm:  2.008:  90%|█████████ | 564/625 [01:26<00:09,  6.52it/s]
reward: -2.6155, last reward: -0.0004, gradient norm:  2.008:  90%|█████████ | 565/625 [01:26<00:09,  6.54it/s]
reward: -3.1427, last reward: -0.3582, gradient norm:  7.624:  90%|█████████ | 565/625 [01:26<00:09,  6.54it/s]
reward: -3.1427, last reward: -0.3582, gradient norm:  7.624:  91%|█████████ | 566/625 [01:26<00:08,  6.57it/s]
reward: -2.7870, last reward: -0.9490, gradient norm:  18.26:  91%|█████████ | 566/625 [01:26<00:08,  6.57it/s]
reward: -2.7870, last reward: -0.9490, gradient norm:  18.26:  91%|█████████ | 567/625 [01:26<00:08,  6.59it/s]
reward: -3.0439, last reward: -0.8796, gradient norm:  29.89:  91%|█████████ | 567/625 [01:26<00:08,  6.59it/s]
reward: -3.0439, last reward: -0.8796, gradient norm:  29.89:  91%|█████████ | 568/625 [01:26<00:08,  6.59it/s]
reward: -2.8026, last reward: -0.2720, gradient norm:  8.612:  91%|█████████ | 568/625 [01:26<00:08,  6.59it/s]
reward: -2.8026, last reward: -0.2720, gradient norm:  8.612:  91%|█████████ | 569/625 [01:26<00:08,  6.60it/s]
reward: -2.3147, last reward: -0.8486, gradient norm:  41.13:  91%|█████████ | 569/625 [01:27<00:08,  6.60it/s]
reward: -2.3147, last reward: -0.8486, gradient norm:  41.13:  91%|█████████ | 570/625 [01:27<00:11,  4.78it/s]
reward: -1.7917, last reward: -0.0129, gradient norm:  2.365:  91%|█████████ | 570/625 [01:27<00:11,  4.78it/s]
reward: -1.7917, last reward: -0.0129, gradient norm:  2.365:  91%|█████████▏| 571/625 [01:27<00:10,  5.22it/s]
reward: -1.9553, last reward: -0.0020, gradient norm:  0.6871:  91%|█████████▏| 571/625 [01:27<00:10,  5.22it/s]
reward: -1.9553, last reward: -0.0020, gradient norm:  0.6871:  92%|█████████▏| 572/625 [01:27<00:09,  5.57it/s]
reward: -2.3132, last reward: -0.0159, gradient norm:  0.8646:  92%|█████████▏| 572/625 [01:27<00:09,  5.57it/s]
reward: -2.3132, last reward: -0.0159, gradient norm:  0.8646:  92%|█████████▏| 573/625 [01:27<00:08,  5.85it/s]
reward: -1.5320, last reward: -0.0269, gradient norm:  1.02:  92%|█████████▏| 573/625 [01:27<00:08,  5.85it/s]
reward: -1.5320, last reward: -0.0269, gradient norm:  1.02:  92%|█████████▏| 574/625 [01:27<00:08,  6.05it/s]
reward: -2.2955, last reward: -0.0245, gradient norm:  1.267:  92%|█████████▏| 574/625 [01:27<00:08,  6.05it/s]
reward: -2.2955, last reward: -0.0245, gradient norm:  1.267:  92%|█████████▏| 575/625 [01:27<00:08,  6.21it/s]
reward: -2.3347, last reward: -0.0179, gradient norm:  1.528:  92%|█████████▏| 575/625 [01:28<00:08,  6.21it/s]
reward: -2.3347, last reward: -0.0179, gradient norm:  1.528:  92%|█████████▏| 576/625 [01:28<00:07,  6.33it/s]
reward: -1.9718, last reward: -0.1629, gradient norm:  8.804:  92%|█████████▏| 576/625 [01:28<00:07,  6.33it/s]
reward: -1.9718, last reward: -0.1629, gradient norm:  8.804:  92%|█████████▏| 577/625 [01:28<00:07,  6.42it/s]
reward: -2.4164, last reward: -0.0070, gradient norm:  0.4335:  92%|█████████▏| 577/625 [01:28<00:07,  6.42it/s]
reward: -2.4164, last reward: -0.0070, gradient norm:  0.4335:  92%|█████████▏| 578/625 [01:28<00:07,  6.47it/s]
reward: -2.2993, last reward: -0.0011, gradient norm:  1.371:  92%|█████████▏| 578/625 [01:28<00:07,  6.47it/s]
reward: -2.2993, last reward: -0.0011, gradient norm:  1.371:  93%|█████████▎| 579/625 [01:28<00:07,  6.52it/s]
reward: -3.3049, last reward: -0.9063, gradient norm:  34.23:  93%|█████████▎| 579/625 [01:28<00:07,  6.52it/s]
reward: -3.3049, last reward: -0.9063, gradient norm:  34.23:  93%|█████████▎| 580/625 [01:28<00:06,  6.56it/s]
reward: -2.8785, last reward: -0.3295, gradient norm:  10.91:  93%|█████████▎| 580/625 [01:28<00:06,  6.56it/s]
reward: -2.8785, last reward: -0.3295, gradient norm:  10.91:  93%|█████████▎| 581/625 [01:28<00:06,  6.58it/s]
reward: -2.5184, last reward: -0.0546, gradient norm:  21.09:  93%|█████████▎| 581/625 [01:28<00:06,  6.58it/s]
reward: -2.5184, last reward: -0.0546, gradient norm:  21.09:  93%|█████████▎| 582/625 [01:28<00:06,  6.59it/s]
reward: -2.4039, last reward: -0.4589, gradient norm:  10.86:  93%|█████████▎| 582/625 [01:29<00:06,  6.59it/s]
reward: -2.4039, last reward: -0.4589, gradient norm:  10.86:  93%|█████████▎| 583/625 [01:29<00:06,  6.61it/s]
reward: -2.4697, last reward: -0.2476, gradient norm:  4.689:  93%|█████████▎| 583/625 [01:29<00:06,  6.61it/s]
reward: -2.4697, last reward: -0.2476, gradient norm:  4.689:  93%|█████████▎| 584/625 [01:29<00:06,  6.60it/s]
reward: -2.0018, last reward: -0.2397, gradient norm:  8.393:  93%|█████████▎| 584/625 [01:29<00:06,  6.60it/s]
reward: -2.0018, last reward: -0.2397, gradient norm:  8.393:  94%|█████████▎| 585/625 [01:29<00:06,  6.61it/s]
reward: -2.4953, last reward: -0.1775, gradient norm:  24.17:  94%|█████████▎| 585/625 [01:29<00:06,  6.61it/s]
reward: -2.4953, last reward: -0.1775, gradient norm:  24.17:  94%|█████████▍| 586/625 [01:29<00:05,  6.61it/s]
reward: -2.2258, last reward: -0.0110, gradient norm:  0.7671:  94%|█████████▍| 586/625 [01:29<00:05,  6.61it/s]
reward: -2.2258, last reward: -0.0110, gradient norm:  0.7671:  94%|█████████▍| 587/625 [01:29<00:05,  6.62it/s]
reward: -2.3981, last reward: -0.0011, gradient norm:  1.617:  94%|█████████▍| 587/625 [01:29<00:05,  6.62it/s]
reward: -2.3981, last reward: -0.0011, gradient norm:  1.617:  94%|█████████▍| 588/625 [01:29<00:05,  6.63it/s]
reward: -1.8590, last reward: -0.0007, gradient norm:  1.131:  94%|█████████▍| 588/625 [01:29<00:05,  6.63it/s]
reward: -1.8590, last reward: -0.0007, gradient norm:  1.131:  94%|█████████▍| 589/625 [01:29<00:05,  6.64it/s]
reward: -1.9820, last reward: -0.4221, gradient norm:  49.4:  94%|█████████▍| 589/625 [01:30<00:05,  6.64it/s]
reward: -1.9820, last reward: -0.4221, gradient norm:  49.4:  94%|█████████▍| 590/625 [01:30<00:05,  6.64it/s]
reward: -2.1293, last reward: -0.0116, gradient norm:  0.868:  94%|█████████▍| 590/625 [01:30<00:05,  6.64it/s]
reward: -2.1293, last reward: -0.0116, gradient norm:  0.868:  95%|█████████▍| 591/625 [01:30<00:05,  6.63it/s]
reward: -2.1675, last reward: -0.0173, gradient norm:  0.5931:  95%|█████████▍| 591/625 [01:30<00:05,  6.63it/s]
reward: -2.1675, last reward: -0.0173, gradient norm:  0.5931:  95%|█████████▍| 592/625 [01:30<00:04,  6.64it/s]
reward: -2.2910, last reward: -0.0207, gradient norm:  0.5219:  95%|█████████▍| 592/625 [01:30<00:04,  6.64it/s]
reward: -2.2910, last reward: -0.0207, gradient norm:  0.5219:  95%|█████████▍| 593/625 [01:30<00:04,  6.63it/s]
reward: -2.2124, last reward: -0.1730, gradient norm:  5.737:  95%|█████████▍| 593/625 [01:30<00:04,  6.63it/s]
reward: -2.2124, last reward: -0.1730, gradient norm:  5.737:  95%|█████████▌| 594/625 [01:30<00:04,  6.63it/s]
reward: -2.2914, last reward: -0.0206, gradient norm:  0.485:  95%|█████████▌| 594/625 [01:30<00:04,  6.63it/s]
reward: -2.2914, last reward: -0.0206, gradient norm:  0.485:  95%|█████████▌| 595/625 [01:30<00:04,  6.63it/s]
reward: -2.0890, last reward: -0.0172, gradient norm:  0.3982:  95%|█████████▌| 595/625 [01:31<00:04,  6.63it/s]
reward: -2.0890, last reward: -0.0172, gradient norm:  0.3982:  95%|█████████▌| 596/625 [01:31<00:04,  6.63it/s]
reward: -2.0945, last reward: -0.0121, gradient norm:  0.4789:  95%|█████████▌| 596/625 [01:31<00:04,  6.63it/s]
reward: -2.0945, last reward: -0.0121, gradient norm:  0.4789:  96%|█████████▌| 597/625 [01:31<00:04,  6.63it/s]
reward: -2.3805, last reward: -0.0069, gradient norm:  0.4074:  96%|█████████▌| 597/625 [01:31<00:04,  6.63it/s]
reward: -2.3805, last reward: -0.0069, gradient norm:  0.4074:  96%|█████████▌| 598/625 [01:31<00:04,  6.62it/s]
reward: -2.3310, last reward: -0.0031, gradient norm:  0.5065:  96%|█████████▌| 598/625 [01:31<00:04,  6.62it/s]
reward: -2.3310, last reward: -0.0031, gradient norm:  0.5065:  96%|█████████▌| 599/625 [01:31<00:03,  6.62it/s]
reward: -2.6028, last reward: -0.0006, gradient norm:  0.6316:  96%|█████████▌| 599/625 [01:31<00:03,  6.62it/s]
reward: -2.6028, last reward: -0.0006, gradient norm:  0.6316:  96%|█████████▌| 600/625 [01:31<00:03,  6.62it/s]
reward: -2.6724, last reward: -0.0001, gradient norm:  0.6523:  96%|█████████▌| 600/625 [01:31<00:03,  6.62it/s]
reward: -2.6724, last reward: -0.0001, gradient norm:  0.6523:  96%|█████████▌| 601/625 [01:31<00:03,  6.63it/s]
reward: -2.2481, last reward: -0.0136, gradient norm:  0.4298:  96%|█████████▌| 601/625 [01:31<00:03,  6.63it/s]
reward: -2.2481, last reward: -0.0136, gradient norm:  0.4298:  96%|█████████▋| 602/625 [01:31<00:03,  6.63it/s]
reward: -2.3524, last reward: -0.0043, gradient norm:  0.2629:  96%|█████████▋| 602/625 [01:32<00:03,  6.63it/s]
reward: -2.3524, last reward: -0.0043, gradient norm:  0.2629:  96%|█████████▋| 603/625 [01:32<00:03,  6.62it/s]
reward: -2.2635, last reward: -0.0069, gradient norm:  0.7839:  96%|█████████▋| 603/625 [01:32<00:03,  6.62it/s]
reward: -2.2635, last reward: -0.0069, gradient norm:  0.7839:  97%|█████████▋| 604/625 [01:32<00:03,  6.63it/s]
reward: -2.6041, last reward: -0.8027, gradient norm:  11.7:  97%|█████████▋| 604/625 [01:32<00:03,  6.63it/s]
reward: -2.6041, last reward: -0.8027, gradient norm:  11.7:  97%|█████████▋| 605/625 [01:32<00:03,  6.63it/s]
reward: -4.4170, last reward: -3.4675, gradient norm:  60.04:  97%|█████████▋| 605/625 [01:32<00:03,  6.63it/s]
reward: -4.4170, last reward: -3.4675, gradient norm:  60.04:  97%|█████████▋| 606/625 [01:32<00:02,  6.63it/s]
reward: -4.3153, last reward: -2.9316, gradient norm:  53.11:  97%|█████████▋| 606/625 [01:32<00:02,  6.63it/s]
reward: -4.3153, last reward: -2.9316, gradient norm:  53.11:  97%|█████████▋| 607/625 [01:32<00:02,  6.63it/s]
reward: -3.0649, last reward: -0.9722, gradient norm:  30.84:  97%|█████████▋| 607/625 [01:32<00:02,  6.63it/s]
reward: -3.0649, last reward: -0.9722, gradient norm:  30.84:  97%|█████████▋| 608/625 [01:32<00:02,  6.63it/s]
reward: -2.7989, last reward: -0.0329, gradient norm:  1.261:  97%|█████████▋| 608/625 [01:33<00:02,  6.63it/s]
reward: -2.7989, last reward: -0.0329, gradient norm:  1.261:  97%|█████████▋| 609/625 [01:33<00:02,  6.64it/s]
reward: -2.1976, last reward: -0.6852, gradient norm:  20.33:  97%|█████████▋| 609/625 [01:33<00:02,  6.64it/s]
reward: -2.1976, last reward: -0.6852, gradient norm:  20.33:  98%|█████████▊| 610/625 [01:33<00:02,  6.64it/s]
reward: -2.4793, last reward: -0.1255, gradient norm:  14.69:  98%|█████████▊| 610/625 [01:33<00:02,  6.64it/s]
reward: -2.4793, last reward: -0.1255, gradient norm:  14.69:  98%|█████████▊| 611/625 [01:33<00:02,  6.64it/s]
reward: -2.4581, last reward: -0.0394, gradient norm:  2.429:  98%|█████████▊| 611/625 [01:33<00:02,  6.64it/s]
reward: -2.4581, last reward: -0.0394, gradient norm:  2.429:  98%|█████████▊| 612/625 [01:33<00:01,  6.64it/s]
reward: -2.2047, last reward: -0.0326, gradient norm:  1.147:  98%|█████████▊| 612/625 [01:33<00:01,  6.64it/s]
reward: -2.2047, last reward: -0.0326, gradient norm:  1.147:  98%|█████████▊| 613/625 [01:33<00:01,  6.63it/s]
reward: -1.8967, last reward: -0.0129, gradient norm:  0.8619:  98%|█████████▊| 613/625 [01:33<00:01,  6.63it/s]
reward: -1.8967, last reward: -0.0129, gradient norm:  0.8619:  98%|█████████▊| 614/625 [01:33<00:01,  6.63it/s]
reward: -2.5906, last reward: -0.0015, gradient norm:  0.6491:  98%|█████████▊| 614/625 [01:33<00:01,  6.63it/s]
reward: -2.5906, last reward: -0.0015, gradient norm:  0.6491:  98%|█████████▊| 615/625 [01:33<00:01,  6.63it/s]
reward: -1.6634, last reward: -0.0007, gradient norm:  0.4394:  98%|█████████▊| 615/625 [01:34<00:01,  6.63it/s]
reward: -1.6634, last reward: -0.0007, gradient norm:  0.4394:  99%|█████████▊| 616/625 [01:34<00:01,  6.64it/s]
reward: -2.0624, last reward: -0.0061, gradient norm:  0.5676:  99%|█████████▊| 616/625 [01:34<00:01,  6.64it/s]
reward: -2.0624, last reward: -0.0061, gradient norm:  0.5676:  99%|█████████▊| 617/625 [01:34<00:01,  6.64it/s]
reward: -2.3259, last reward: -0.0131, gradient norm:  0.7733:  99%|█████████▊| 617/625 [01:34<00:01,  6.64it/s]
reward: -2.3259, last reward: -0.0131, gradient norm:  0.7733:  99%|█████████▉| 618/625 [01:34<00:01,  6.65it/s]
reward: -1.7515, last reward: -0.0189, gradient norm:  0.5575:  99%|█████████▉| 618/625 [01:34<00:01,  6.65it/s]
reward: -1.7515, last reward: -0.0189, gradient norm:  0.5575:  99%|█████████▉| 619/625 [01:34<00:00,  6.64it/s]
reward: -1.9313, last reward: -0.0207, gradient norm:  0.6286:  99%|█████████▉| 619/625 [01:34<00:00,  6.64it/s]
reward: -1.9313, last reward: -0.0207, gradient norm:  0.6286:  99%|█████████▉| 620/625 [01:34<00:00,  6.64it/s]
reward: -2.4325, last reward: -0.0171, gradient norm:  0.7832:  99%|█████████▉| 620/625 [01:34<00:00,  6.64it/s]
reward: -2.4325, last reward: -0.0171, gradient norm:  0.7832:  99%|█████████▉| 621/625 [01:34<00:00,  6.63it/s]
reward: -2.1134, last reward: -0.0144, gradient norm:  1.96:  99%|█████████▉| 621/625 [01:34<00:00,  6.63it/s]
reward: -2.1134, last reward: -0.0144, gradient norm:  1.96: 100%|█████████▉| 622/625 [01:34<00:00,  6.63it/s]
reward: -2.4572, last reward: -0.0500, gradient norm:  0.5838: 100%|█████████▉| 622/625 [01:35<00:00,  6.63it/s]
reward: -2.4572, last reward: -0.0500, gradient norm:  0.5838: 100%|█████████▉| 623/625 [01:35<00:00,  6.63it/s]
reward: -2.3818, last reward: -0.0019, gradient norm:  0.8623: 100%|█████████▉| 623/625 [01:35<00:00,  6.63it/s]
reward: -2.3818, last reward: -0.0019, gradient norm:  0.8623: 100%|█████████▉| 624/625 [01:35<00:00,  6.62it/s]
reward: -2.1253, last reward: -0.0001, gradient norm:  0.6622: 100%|█████████▉| 624/625 [01:35<00:00,  6.62it/s]
reward: -2.1253, last reward: -0.0001, gradient norm:  0.6622: 100%|██████████| 625/625 [01:35<00:00,  6.63it/s]
reward: -2.1253, last reward: -0.0001, gradient norm:  0.6622: 100%|██████████| 625/625 [01:35<00:00,  6.55it/s]

Conclusion

In this tutorial, we have learned how to code a stateless environment from scratch. We touched the subjects of:

  • The four essential components that need to be taken care of when coding an environment (step, reset, seeding and building specs). We saw how these methods and classes interact with the TensorDict class;

  • How to test that an environment is properly coded using check_env_specs();

  • How to append transforms in the context of stateless environments and how to write custom transformations;

  • How to train a policy on a fully differentiable simulator.

Total running time of the script: (2 minutes 51.454 seconds)

Estimated memory usage: 320 MB

Gallery generated by Sphinx-Gallery

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources