.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "tutorials/coding_dqn.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_tutorials_coding_dqn.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_tutorials_coding_dqn.py:


TorchRL trainer: A DQN example
==============================
**Author**: `Vincent Moens <https://github.com/vmoens>`_

.. _coding_dqn:

.. GENERATED FROM PYTHON SOURCE LINES 12-86

TorchRL provides a generic :class:`~torchrl.trainers.Trainer` class to handle
your training loop. The trainer executes a nested loop where the outer loop
is the data collection and the inner loop consumes this data or some data
retrieved from the replay buffer to train the model.
At various points in this training loop, hooks can be attached and executed at
given intervals.

In this tutorial, we will be using the trainer class to train a DQN algorithm
to solve the CartPole task from scratch.

Main takeaways:

- Building a trainer with its essential components: data collector, loss
  module, replay buffer and optimizer.
- Adding hooks to a trainer, such as loggers, target network updaters and such.

The trainer is fully customisable and offers a large set of functionalities.
The tutorial is organised around its construction.
We will be detailing how to build each of the components of the library first,
and then put the pieces together using the :class:`~torchrl.trainers.Trainer`
class.

Along the road, we will also focus on some other aspects of the library:

- how to build an environment in TorchRL, including transforms (e.g. data
  normalization, frame concatenation, resizing and turning to grayscale)
  and parallel execution. Unlike what we did in the
  `DDPG tutorial <https://pytorch.org/rl/tutorials/coding_ddpg.html>`_, we
  will normalize the pixels and not the state vector.
- how to design a :class:`~torchrl.modules.QValueActor` object, i.e. an actor
  that estimates the action values and picks up the action with the highest
  estimated return;
- how to collect data from your environment efficiently and store them
  in a replay buffer;
- how to use multi-step, a simple preprocessing step for off-policy sota-implementations;
- and finally how to evaluate your model.

**Prerequisites**: We encourage you to get familiar with torchrl through the
`PPO tutorial <https://pytorch.org/rl/tutorials/coding_ppo.html>`_ first.

DQN
---

DQN (`Deep Q-Learning <https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf>`_) was
the founding work in deep reinforcement learning.

On a high level, the algorithm is quite simple: Q-learning consists in
learning a table of state-action values in such a way that, when
encountering any particular state, we know which action to pick just by
searching for the one with the highest value. This simple setting
requires the actions and states to be
discrete, otherwise a lookup table cannot be built.

DQN uses a neural network that encodes a map from the state-action space to
a value (scalar) space, which amortizes the cost of storing and exploring all
the possible state-action combinations: if a state has not been seen in the
past, we can still pass it in conjunction with the various actions available
through our neural network and get an interpolated value for each of the
actions available.

We will solve the classic control problem of the cart pole. From the
Gymnasium doc from where this environment is retrieved:

| A pole is attached by an un-actuated joint to a cart, which moves along a
| frictionless track. The pendulum is placed upright on the cart and the goal
| is to balance the pole by applying forces in the left and right direction
| on the cart.

.. figure:: /_static/img/cartpole_demo.gif
   :alt: Cart Pole

We do not aim at giving a SOTA implementation of the algorithm, but rather
to provide a high-level illustration of TorchRL features in the context
of this algorithm.

.. GENERATED FROM PYTHON SOURCE LINES 86-137

.. code-block:: Python


    import os
    import uuid

    import torch
    from torch import nn
    from torchrl.collectors import MultiaSyncDataCollector
    from torchrl.data import LazyMemmapStorage, MultiStep, TensorDictReplayBuffer
    from torchrl.envs import (
        EnvCreator,
        ExplorationType,
        ParallelEnv,
        RewardScaling,
        StepCounter,
    )
    from torchrl.envs.libs.gym import GymEnv
    from torchrl.envs.transforms import (
        CatFrames,
        Compose,
        GrayScale,
        ObservationNorm,
        Resize,
        ToTensorImage,
        TransformedEnv,
    )
    from torchrl.modules import DuelingCnnDQNet, EGreedyModule, QValueActor

    from torchrl.objectives import DQNLoss, SoftUpdate
    from torchrl.record.loggers.csv import CSVLogger
    from torchrl.trainers import (
        LogReward,
        Recorder,
        ReplayBufferTrainer,
        Trainer,
        UpdateWeights,
    )


    def is_notebook() -> bool:
        try:
            shell = get_ipython().__class__.__name__
            if shell == "ZMQInteractiveShell":
                return True  # Jupyter notebook or qtconsole
            elif shell == "TerminalInteractiveShell":
                return False  # Terminal running IPython
            else:
                return False  # Other type (?)
        except NameError:
            return False  # Probably standard Python interpreter


.. GENERATED FROM PYTHON SOURCE LINES 163-215

Let's get started with the various pieces we need for our algorithm:

- An environment;
- A policy (and related modules that we group under the "model" umbrella);
- A data collector, which makes the policy play in the environment and
  delivers training data;
- A replay buffer to store the training data;
- A loss module, which computes the objective function to train our policy
  to maximise the return;
- An optimizer, which performs parameter updates based on our loss.

Additional modules include a logger, a recorder (executes the policy in
"eval" mode) and a target network updater. With all these components into
place, it is easy to see how one could misplace or misuse one component in
the training script. The trainer is there to orchestrate everything for you!

Building the environment
------------------------

First let's write a helper function that will output an environment. As usual,
the "raw" environment may be too simple to be used in practice and we'll need
some data transformation to expose its output to the policy.

We will be using five transforms:

- :class:`~torchrl.envs.StepCounter` to count the number of steps in each trajectory;
- :class:`~torchrl.envs.transforms.ToTensorImage` will convert a ``[W, H, C]`` uint8
  tensor in a floating point tensor in the ``[0, 1]`` space with shape
  ``[C, W, H]``;
- :class:`~torchrl.envs.transforms.RewardScaling` to reduce the scale of the return;
- :class:`~torchrl.envs.transforms.GrayScale` will turn our image into grayscale;
- :class:`~torchrl.envs.transforms.Resize` will resize the image in a 64x64 format;
- :class:`~torchrl.envs.transforms.CatFrames` will concatenate an arbitrary number of
  successive frames (``N=4``) in a single tensor along the channel dimension.
  This is useful as a single image does not carry information about the
  motion of the cartpole. Some memory about past observations and actions
  is needed, either via a recurrent neural network or using a stack of
  frames.
- :class:`~torchrl.envs.transforms.ObservationNorm` which will normalize our observations
  given some custom summary statistics.

In practice, our environment builder has two arguments:

- ``parallel``: determines whether multiple environments have to be run in
  parallel. We stack the transforms after the
  :class:`~torchrl.envs.ParallelEnv` to take advantage
  of vectorization of the operations on device, although this would
  technically work with every single environment attached to its own set of
  transforms.
- ``obs_norm_sd`` will contain the normalizing constants for
  the :class:`~torchrl.envs.ObservationNorm` transform.


.. GENERATED FROM PYTHON SOURCE LINES 215-258

.. code-block:: Python


    def make_env(
        parallel=False,
        obs_norm_sd=None,
    ):
        if obs_norm_sd is None:
            obs_norm_sd = {"standard_normal": True}
        if parallel:
            base_env = ParallelEnv(
                num_workers,
                EnvCreator(
                    lambda: GymEnv(
                        "CartPole-v1",
                        from_pixels=True,
                        pixels_only=True,
                        device=device,
                    )
                ),
            )
        else:
            base_env = GymEnv(
                "CartPole-v1",
                from_pixels=True,
                pixels_only=True,
                device=device,
            )

        env = TransformedEnv(
            base_env,
            Compose(
                StepCounter(),  # to count the steps of each trajectory
                ToTensorImage(),
                RewardScaling(loc=0.0, scale=0.1),
                GrayScale(),
                Resize(64, 64),
                CatFrames(4, in_keys=["pixels"], dim=-3),
                ObservationNorm(in_keys=["pixels"], **obs_norm_sd),
            ),
        )
        return env


.. GENERATED FROM PYTHON SOURCE LINES 259-270

Compute normalizing constants
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To normalize images, we don't want to normalize each pixel independently
with a full ``[C, W, H]`` normalizing mask, but with simpler ``[C, 1, 1]``
shaped set of normalizing constants (loc and scale parameters).
We will be using the ``reduce_dim`` argument
of :meth:`~torchrl.envs.ObservationNorm.init_stats` to instruct which
dimensions must be reduced, and the ``keep_dims`` parameter to ensure that
not all dimensions disappear in the process:


.. GENERATED FROM PYTHON SOURCE LINES 270-285

.. code-block:: Python


    def get_norm_stats():
        test_env = make_env()
        test_env.transform[-1].init_stats(
            num_iter=1000, cat_dim=0, reduce_dim=[-1, -2, -4], keep_dims=(-1, -2)
        )
        obs_norm_sd = test_env.transform[-1].state_dict()
        # let's check that normalizing constants have a size of ``[C, 1, 1]`` where
        # ``C=4`` (because of :class:`~torchrl.envs.CatFrames`).
        print("state dict of the observation norm:", obs_norm_sd)
        test_env.close()
        return obs_norm_sd


.. GENERATED FROM PYTHON SOURCE LINES 286-308

Building the model (Deep Q-network)
-----------------------------------

The following function builds a :class:`~torchrl.modules.DuelingCnnDQNet`
object which is a simple CNN followed by a two-layer MLP. The only trick used
here is that the action values (i.e. left and right action value) are
computed using

.. math::

   \mathbb{v} = b(obs) + v(obs) - \mathbb{E}[v(obs)]

where :math:`\mathbb{v}` is our vector of action values,
:math:`b` is a :math:`\mathbb{R}^n \rightarrow 1` function and :math:`v` is a
:math:`\mathbb{R}^n \rightarrow \mathbb{R}^m` function, for
:math:`n = \# obs` and :math:`m = \# actions`.

Our network is wrapped in a :class:`~torchrl.modules.QValueActor`,
which will read the state-action
values, pick up the one with the maximum value and write all those results
in the input :class:`tensordict.TensorDict`.


.. GENERATED FROM PYTHON SOURCE LINES 308-352

.. code-block:: Python


    def make_model(dummy_env):
        cnn_kwargs = {
            "num_cells": [32, 64, 64],
            "kernel_sizes": [6, 4, 3],
            "strides": [2, 2, 1],
            "activation_class": nn.ELU,
            # This can be used to reduce the size of the last layer of the CNN
            # "squeeze_output": True,
            # "aggregator_class": nn.AdaptiveAvgPool2d,
            # "aggregator_kwargs": {"output_size": (1, 1)},
        }
        mlp_kwargs = {
            "depth": 2,
            "num_cells": [
                64,
                64,
            ],
            "activation_class": nn.ELU,
        }
        net = DuelingCnnDQNet(
            dummy_env.action_spec.shape[-1], 1, cnn_kwargs, mlp_kwargs
        ).to(device)
        net.value[-1].bias.data.fill_(init_bias)

        actor = QValueActor(net, in_keys=["pixels"], spec=dummy_env.action_spec).to(device)
        # init actor: because the model is composed of lazy conv/linear layers,
        # we must pass a fake batch of data through it to instantiate them.
        tensordict = dummy_env.fake_tensordict()
        actor(tensordict)

        # we join our actor with an EGreedyModule for data collection
        exploration_module = EGreedyModule(
            spec=dummy_env.action_spec,
            annealing_num_steps=total_frames,
            eps_init=eps_greedy_val,
            eps_end=eps_greedy_val_env,
        )
        actor_explore = TensorDictSequential(actor, exploration_module)

        return actor, actor_explore


.. GENERATED FROM PYTHON SOURCE LINES 353-372

Collecting and storing data
---------------------------

Replay buffers
~~~~~~~~~~~~~~

Replay buffers play a central role in off-policy RL sota-implementations such as DQN.
They constitute the dataset we will be sampling from during training.

Here, we will use a regular sampling strategy, although a prioritized RB
could improve the performance significantly.

We place the storage on disk using
:class:`~torchrl.data.replay_buffers.storages.LazyMemmapStorage` class. This
storage is created in a lazy manner: it will only be instantiated once the
first batch of data is passed to it.

The only requirement of this storage is that the data passed to it at write
time must always have the same shape.

.. GENERATED FROM PYTHON SOURCE LINES 372-383

.. code-block:: Python


    def get_replay_buffer(buffer_size, n_optim, batch_size):
        replay_buffer = TensorDictReplayBuffer(
            batch_size=batch_size,
            storage=LazyMemmapStorage(buffer_size),
            prefetch=n_optim,
        )
        return replay_buffer


.. GENERATED FROM PYTHON SOURCE LINES 384-419

Data collector
~~~~~~~~~~~~~~

As in `PPO <https://pytorch.org/rl/tutorials/coding_ppo.html>`_ and
`DDPG <https://pytorch.org/rl/tutorials/coding_ddpg.html>`_, we will be using
a data collector as a dataloader in the outer loop.

We choose the following configuration: we will be running a series of
parallel environments synchronously in parallel in different collectors,
themselves running in parallel but asynchronously.

.. note::
  This feature is only available when running the code within the "spawn"
  start method of python multiprocessing library. If this tutorial is run
  directly as a script (thereby using the "fork" method) we will be using
  a regular :class:`~torchrl.collectors.SyncDataCollector`.

The advantage of this configuration is that we can balance the amount of
compute that is executed in batch with what we want to be executed
asynchronously. We encourage the reader to experiment how the collection
speed is impacted by modifying the number of collectors (ie the number of
environment constructors passed to the collector) and the number of
environment executed in parallel in each collector (controlled by the
``num_workers`` hyperparameter).

Collector's devices are fully parametrizable through the ``device`` (general),
``policy_device``, ``env_device`` and ``storing_device`` arguments.
The ``storing_device`` argument will modify the
location of the data being collected: if the batches that we are gathering
have a considerable size, we may want to store them on a different location
than the device where the computation is happening. For asynchronous data
collectors such as ours, different storing devices mean that the data that
we collect won't sit on the same device each time, which is something that
out training loop must account for. For simplicity, we set the devices to
the same value for all sub-collectors.

.. GENERATED FROM PYTHON SOURCE LINES 419-448

.. code-block:: Python


    def get_collector(
        stats,
        num_collectors,
        actor_explore,
        frames_per_batch,
        total_frames,
        device,
    ):
        cls = MultiaSyncDataCollector
        env_arg = [make_env(parallel=True, obs_norm_sd=stats)] * num_collectors
        data_collector = cls(
            env_arg,
            policy=actor_explore,
            frames_per_batch=frames_per_batch,
            total_frames=total_frames,
            # this is the default behaviour: the collector runs in ``"random"`` (or explorative) mode
            exploration_type=ExplorationType.RANDOM,
            # We set the all the devices to be identical. Below is an example of
            # heterogeneous devices
            device=device,
            storing_device=device,
            split_trajs=False,
            postproc=MultiStep(gamma=gamma, n_steps=5),
        )
        return data_collector


.. GENERATED FROM PYTHON SOURCE LINES 449-466

Loss function
-------------

Building our loss function is straightforward: we only need to provide
the model and a bunch of hyperparameters to the DQNLoss class.

Target parameters
~~~~~~~~~~~~~~~~~

Many off-policy RL sota-implementations use the concept of "target parameters" when it
comes to estimate the value of the next state or state-action pair.
The target parameters are lagged copies of the model parameters. Because
their predictions mismatch those of the current model configuration, they
help learning by putting a pessimistic bound on the value being estimated.
This is a powerful trick (known as "Double Q-Learning") that is ubiquitous
in similar sota-implementations.


.. GENERATED FROM PYTHON SOURCE LINES 466-475

.. code-block:: Python


    def get_loss_module(actor, gamma):
        loss_module = DQNLoss(actor, delay_value=True)
        loss_module.make_value_estimator(gamma=gamma)
        target_updater = SoftUpdate(loss_module, eps=0.995)
        return loss_module, target_updater


.. GENERATED FROM PYTHON SOURCE LINES 476-482

Hyperparameters
---------------

Let's start with our hyperparameters. The following setting should work well
in practice, and the performance of the algorithm should hopefully not be
too sensitive to slight variations of these.

.. GENERATED FROM PYTHON SOURCE LINES 482-490

.. code-block:: Python


    is_fork = multiprocessing.get_start_method() == "fork"
    device = (
        torch.device(0)
        if torch.cuda.is_available() and not is_fork
        else torch.device("cpu")
    )


.. GENERATED FROM PYTHON SOURCE LINES 491-493

Optimizer
~~~~~~~~~

.. GENERATED FROM PYTHON SOURCE LINES 493-503

.. code-block:: Python


    # the learning rate of the optimizer
    lr = 2e-3
    # weight decay
    wd = 1e-5
    # the beta parameters of Adam
    betas = (0.9, 0.999)
    # Optimization steps per batch collected (aka UPD or updates per data)
    n_optim = 8


.. GENERATED FROM PYTHON SOURCE LINES 504-507

DQN parameters
~~~~~~~~~~~~~~
gamma decay factor

.. GENERATED FROM PYTHON SOURCE LINES 507-509

.. code-block:: Python

    gamma = 0.99


.. GENERATED FROM PYTHON SOURCE LINES 510-513

Smooth target network update decay parameter.
This loosely corresponds to a 1/tau interval with hard target network
update

.. GENERATED FROM PYTHON SOURCE LINES 513-515

.. code-block:: Python

    tau = 0.02


.. GENERATED FROM PYTHON SOURCE LINES 516-529

Data collection and replay buffer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. note::
  Values to be used for proper training have been commented.

Total frames collected in the environment. In other implementations, the
user defines a maximum number of episodes.
This is harder to do with our data collectors since they return batches
of N collected frames, where N is a constant.
However, one can easily get the same restriction on number of episodes by
breaking the training loop when a certain number
episodes has been collected.

.. GENERATED FROM PYTHON SOURCE LINES 529-531

.. code-block:: Python

    total_frames = 5_000  # 500000


.. GENERATED FROM PYTHON SOURCE LINES 532-533

Random frames used to initialize the replay buffer.

.. GENERATED FROM PYTHON SOURCE LINES 533-535

.. code-block:: Python

    init_random_frames = 100  # 1000


.. GENERATED FROM PYTHON SOURCE LINES 536-537

Frames in each batch collected.

.. GENERATED FROM PYTHON SOURCE LINES 537-539

.. code-block:: Python

    frames_per_batch = 32  # 128


.. GENERATED FROM PYTHON SOURCE LINES 540-541

Frames sampled from the replay buffer at each optimization step

.. GENERATED FROM PYTHON SOURCE LINES 541-543

.. code-block:: Python

    batch_size = 32  # 256


.. GENERATED FROM PYTHON SOURCE LINES 544-545

Size of the replay buffer in terms of frames

.. GENERATED FROM PYTHON SOURCE LINES 545-547

.. code-block:: Python

    buffer_size = min(total_frames, 100000)


.. GENERATED FROM PYTHON SOURCE LINES 548-549

Number of environments run in parallel in each data collector

.. GENERATED FROM PYTHON SOURCE LINES 549-552

.. code-block:: Python

    num_workers = 2  # 8
    num_collectors = 2  # 4


.. GENERATED FROM PYTHON SOURCE LINES 553-560

Environment and exploration
~~~~~~~~~~~~~~~~~~~~~~~~~~~

We set the initial and final value of the epsilon factor in Epsilon-greedy
exploration.
Since our policy is deterministic, exploration is crucial: without it, the
only source of randomness would be the environment reset.

.. GENERATED FROM PYTHON SOURCE LINES 560-564

.. code-block:: Python


    eps_greedy_val = 0.1
    eps_greedy_val_env = 0.005


.. GENERATED FROM PYTHON SOURCE LINES 565-567

To speed up learning, we set the bias of the last layer of our value network
to a predefined value (this is not mandatory)

.. GENERATED FROM PYTHON SOURCE LINES 567-569

.. code-block:: Python

    init_bias = 2.0


.. GENERATED FROM PYTHON SOURCE LINES 570-575

.. note::
  For fast rendering of the tutorial ``total_frames`` hyperparameter
  was set to a very low number. To get a reasonable performance, use a greater
  value e.g. 500000


.. GENERATED FROM PYTHON SOURCE LINES 577-594

Building a Trainer
------------------

TorchRL's :class:`~torchrl.trainers.Trainer` class constructor takes the
following keyword-only arguments:

- ``collector``
- ``loss_module``
- ``optimizer``
- ``logger``: A logger can be
- ``total_frames``: this parameter defines the lifespan of the trainer.
- ``frame_skip``: when a frame-skip is used, the collector must be made
  aware of it in order to accurately count the number of frames
  collected etc. Making the trainer aware of this parameter is not
  mandatory but helps to have a fairer comparison between settings where
  the total number of frames (budget) is fixed but the frame-skip is
  variable.

.. GENERATED FROM PYTHON SOURCE LINES 594-617

.. code-block:: Python


    stats = get_norm_stats()
    test_env = make_env(parallel=False, obs_norm_sd=stats)
    # Get model
    actor, actor_explore = make_model(test_env)
    loss_module, target_net_updater = get_loss_module(actor, gamma)

    collector = get_collector(
        stats=stats,
        num_collectors=num_collectors,
        actor_explore=actor_explore,
        frames_per_batch=frames_per_batch,
        total_frames=total_frames,
        device=device,
    )
    optimizer = torch.optim.Adam(
        loss_module.parameters(), lr=lr, weight_decay=wd, betas=betas
    )
    exp_name = f"dqn_exp_{uuid.uuid1()}"
    tmpdir = tempfile.TemporaryDirectory()
    logger = CSVLogger(exp_name=exp_name, log_dir=tmpdir.name)
    warnings.warn(f"log dir: {logger.experiment.log_dir}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    state dict of the observation norm: OrderedDict([('standard_normal', tensor(True)), ('loc', tensor([[[0.9895]],

            [[0.9895]],

            [[0.9895]],

            [[0.9895]]])), ('scale', tensor([[[0.0737]],

            [[0.0737]],

            [[0.0737]],

            [[0.0737]]]))])


.. GENERATED FROM PYTHON SOURCE LINES 618-620

We can control how often the scalars should be logged. Here we set this
to a low value as our training loop is short:

.. GENERATED FROM PYTHON SOURCE LINES 620-634

.. code-block:: Python


    log_interval = 500

    trainer = Trainer(
        collector=collector,
        total_frames=total_frames,
        frame_skip=1,
        loss_module=loss_module,
        optimizer=optimizer,
        logger=logger,
        optim_steps_per_batch=n_optim,
        log_interval=log_interval,
    )


.. GENERATED FROM PYTHON SOURCE LINES 635-646

Registering hooks
~~~~~~~~~~~~~~~~~

Registering hooks can be achieved in two separate ways:

- If the hook has it, the :meth:`~torchrl.trainers.TrainerHookBase.register`
  method is the first choice. One just needs to provide the trainer as input
  and the hook will be registered with a default name at a default location.
  For some hooks, the registration can be quite complex: :class:`~torchrl.trainers.ReplayBufferTrainer`
  requires 3 hooks (``extend``, ``sample`` and ``update_priority``) which
  can be cumbersome to implement.

.. GENERATED FROM PYTHON SOURCE LINES 646-666

.. code-block:: Python

    buffer_hook = ReplayBufferTrainer(
        get_replay_buffer(buffer_size, n_optim, batch_size=batch_size),
        flatten_tensordicts=True,
    )
    buffer_hook.register(trainer)
    weight_updater = UpdateWeights(collector, update_weights_interval=1)
    weight_updater.register(trainer)
    recorder = Recorder(
        record_interval=100,  # log every 100 optimization steps
        record_frames=1000,  # maximum number of frames in the record
        frame_skip=1,
        policy_exploration=actor_explore,
        environment=test_env,
        exploration_type=ExplorationType.MODE,
        log_keys=[("next", "reward")],
        out_keys={("next", "reward"): "rewards"},
        log_pbar=True,
    )
    recorder.register(trainer)


.. GENERATED FROM PYTHON SOURCE LINES 667-669

The exploration module epsilon factor is also annealed:


.. GENERATED FROM PYTHON SOURCE LINES 669-672

.. code-block:: Python


    trainer.register_op("post_steps", actor_explore[1].step, frames=frames_per_batch)


.. GENERATED FROM PYTHON SOURCE LINES 673-681

- Any callable (including :class:`~torchrl.trainers.TrainerHookBase`
  subclasses) can be registered using :meth:`~torchrl.trainers.Trainer.register_op`.
  In this case, a location must be explicitly passed (). This method gives
  more control over the location of the hook but it also requires more
  understanding of the Trainer mechanism.
  Check the `trainer documentation <https://pytorch.org/rl/reference/trainers.html>`_
  for a detailed description of the trainer hooks.


.. GENERATED FROM PYTHON SOURCE LINES 681-683

.. code-block:: Python

    trainer.register_op("post_optim", target_net_updater.step)


.. GENERATED FROM PYTHON SOURCE LINES 684-691

We can log the training rewards too. Note that this is of limited interest
with CartPole, as rewards are always 1. The discounted sum of rewards is
maximised not by getting higher rewards but by keeping the cart-pole alive
for longer.
This will be reflected by the `total_rewards` value displayed in the
progress bar.


.. GENERATED FROM PYTHON SOURCE LINES 691-694

.. code-block:: Python

    log_reward = LogReward(log_pbar=True)
    log_reward.register(trainer)


.. GENERATED FROM PYTHON SOURCE LINES 695-704

.. note::
  It is possible to link multiple optimizers to the trainer if needed.
  In this case, each optimizer will be tied to a field in the loss
  dictionary.
  Check the :class:`~torchrl.trainers.OptimizerHook` to learn more.

Here we are, ready to train our algorithm! A simple call to
``trainer.train()`` and we'll be getting our results logged in.


.. GENERATED FROM PYTHON SOURCE LINES 704-706

.. code-block:: Python

    trainer.train()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

      0%|          | 0/5000 [00:00<?, ?it/s]      1%|          | 32/5000 [00:07<19:07,  4.33it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:   1%|          | 32/5000 [00:07<19:07,  4.33it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:   1%|▏         | 64/5000 [00:07<08:21,  9.84it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:   1%|▏         | 64/5000 [00:07<08:21,  9.84it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:   2%|▏         | 96/5000 [00:08<04:53, 16.70it/s]    r_training: 0.3323, rewards: 0.1000, total_rewards: 0.9434:   2%|▏         | 96/5000 [00:08<04:53, 16.70it/s]    r_training: 0.3323, rewards: 0.1000, total_rewards: 0.9434:   3%|▎         | 128/5000 [00:08<03:15, 24.87it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:   3%|▎         | 128/5000 [00:08<03:15, 24.87it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:   3%|▎         | 160/5000 [00:08<02:23, 33.70it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:   3%|▎         | 160/5000 [00:08<02:23, 33.70it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:   4%|▍         | 192/5000 [00:09<01:51, 43.22it/s]    r_training: 0.3718, rewards: 0.1000, total_rewards: 0.9434:   4%|▍         | 192/5000 [00:09<01:51, 43.22it/s]    r_training: 0.3718, rewards: 0.1000, total_rewards: 0.9434:   4%|▍         | 224/5000 [00:09<01:32, 51.72it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:   4%|▍         | 224/5000 [00:09<01:32, 51.72it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:   5%|▌         | 256/5000 [00:09<01:18, 60.75it/s]    r_training: 0.3899, rewards: 0.1000, total_rewards: 0.9434:   5%|▌         | 256/5000 [00:09<01:18, 60.75it/s]    r_training: 0.3899, rewards: 0.1000, total_rewards: 0.9434:   6%|▌         | 288/5000 [00:10<01:08, 68.85it/s]    r_training: 0.3778, rewards: 0.1000, total_rewards: 0.9434:   6%|▌         | 288/5000 [00:10<01:08, 68.85it/s]    r_training: 0.3778, rewards: 0.1000, total_rewards: 0.9434:   6%|▋         | 320/5000 [00:10<01:02, 75.42it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:   6%|▋         | 320/5000 [00:10<01:02, 75.42it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:   7%|▋         | 352/5000 [00:10<00:58, 79.47it/s]    r_training: 0.3808, rewards: 0.1000, total_rewards: 0.9434:   7%|▋         | 352/5000 [00:10<00:58, 79.47it/s]    r_training: 0.3808, rewards: 0.1000, total_rewards: 0.9434:   8%|▊         | 384/5000 [00:11<00:55, 82.45it/s]    r_training: 0.3475, rewards: 0.1000, total_rewards: 0.9434:   8%|▊         | 384/5000 [00:11<00:55, 82.45it/s]    r_training: 0.3475, rewards: 0.1000, total_rewards: 0.9434:   8%|▊         | 416/5000 [00:11<00:53, 85.93it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:   8%|▊         | 416/5000 [00:11<00:53, 85.93it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:   9%|▉         | 448/5000 [00:11<00:51, 87.81it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:   9%|▉         | 448/5000 [00:11<00:51, 87.81it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:  10%|▉         | 480/5000 [00:12<00:50, 89.53it/s]    r_training: 0.3899, rewards: 0.1000, total_rewards: 0.9434:  10%|▉         | 480/5000 [00:12<00:50, 89.53it/s]    r_training: 0.3899, rewards: 0.1000, total_rewards: 0.9434:  10%|█         | 512/5000 [00:12<00:50, 89.66it/s]    r_training: 0.3475, rewards: 0.1000, total_rewards: 0.9434:  10%|█         | 512/5000 [00:12<00:50, 89.66it/s]    r_training: 0.3475, rewards: 0.1000, total_rewards: 0.9434:  11%|█         | 544/5000 [00:12<00:48, 91.46it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:  11%|█         | 544/5000 [00:12<00:48, 91.46it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:  12%|█▏        | 576/5000 [00:13<00:46, 95.67it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  12%|█▏        | 576/5000 [00:13<00:46, 95.67it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  12%|█▏        | 608/5000 [00:13<00:46, 94.14it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  12%|█▏        | 608/5000 [00:13<00:46, 94.14it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  13%|█▎        | 640/5000 [00:13<00:47, 92.11it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  13%|█▎        | 640/5000 [00:13<00:47, 92.11it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  13%|█▎        | 672/5000 [00:14<00:46, 92.27it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:  13%|█▎        | 672/5000 [00:14<00:46, 92.27it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:  14%|█▍        | 704/5000 [00:14<00:46, 92.10it/s]    r_training: 0.3869, rewards: 0.1000, total_rewards: 0.9434:  14%|█▍        | 704/5000 [00:14<00:46, 92.10it/s]    r_training: 0.3869, rewards: 0.1000, total_rewards: 0.9434:  15%|█▍        | 736/5000 [00:14<00:46, 91.05it/s]    r_training: 0.3869, rewards: 0.1000, total_rewards: 0.9434:  15%|█▍        | 736/5000 [00:14<00:46, 91.05it/s]    r_training: 0.3869, rewards: 0.1000, total_rewards: 0.9434:  15%|█▌        | 768/5000 [00:15<00:47, 89.63it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  15%|█▌        | 768/5000 [00:15<00:47, 89.63it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  16%|█▌        | 800/5000 [00:15<00:46, 89.89it/s]    r_training: 0.3778, rewards: 0.1000, total_rewards: 0.9434:  16%|█▌        | 800/5000 [00:15<00:46, 89.89it/s]    r_training: 0.3778, rewards: 0.1000, total_rewards: 0.9434:  17%|█▋        | 832/5000 [00:16<00:46, 89.07it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:  17%|█▋        | 832/5000 [00:16<00:46, 89.07it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:  17%|█▋        | 864/5000 [00:16<00:46, 88.07it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  17%|█▋        | 864/5000 [00:16<00:46, 88.07it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  18%|█▊        | 896/5000 [00:16<00:46, 89.06it/s]    r_training: 0.3899, rewards: 0.1000, total_rewards: 0.9434:  18%|█▊        | 896/5000 [00:16<00:46, 89.06it/s]    r_training: 0.3899, rewards: 0.1000, total_rewards: 0.9434:  19%|█▊        | 928/5000 [00:17<00:44, 90.53it/s]    r_training: 0.3445, rewards: 0.1000, total_rewards: 0.9434:  19%|█▊        | 928/5000 [00:17<00:44, 90.53it/s]    r_training: 0.3445, rewards: 0.1000, total_rewards: 0.9434:  19%|█▉        | 960/5000 [00:17<00:44, 91.34it/s]    r_training: 0.3869, rewards: 0.1000, total_rewards: 0.9434:  19%|█▉        | 960/5000 [00:17<00:44, 91.34it/s]    r_training: 0.3869, rewards: 0.1000, total_rewards: 0.9434:  20%|█▉        | 992/5000 [00:17<00:43, 92.97it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  20%|█▉        | 992/5000 [00:17<00:43, 92.97it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  20%|██        | 1024/5000 [00:18<00:41, 94.70it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 0.9434:  20%|██        | 1024/5000 [00:18<00:41, 94.70it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 0.9434:  21%|██        | 1056/5000 [00:18<00:41, 94.55it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  21%|██        | 1056/5000 [00:18<00:41, 94.55it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  22%|██▏       | 1088/5000 [00:18<00:41, 93.90it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:  22%|██▏       | 1088/5000 [00:18<00:41, 93.90it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:  22%|██▏       | 1120/5000 [00:19<00:41, 94.12it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 0.9434:  22%|██▏       | 1120/5000 [00:19<00:41, 94.12it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 0.9434:  23%|██▎       | 1152/5000 [00:19<00:40, 94.11it/s]    r_training: 0.3778, rewards: 0.1000, total_rewards: 0.9434:  23%|██▎       | 1152/5000 [00:19<00:40, 94.11it/s]    r_training: 0.3778, rewards: 0.1000, total_rewards: 0.9434:  24%|██▎       | 1184/5000 [00:19<00:40, 93.53it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  24%|██▎       | 1184/5000 [00:19<00:40, 93.53it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  24%|██▍       | 1216/5000 [00:20<00:39, 95.29it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 0.9434:  24%|██▍       | 1216/5000 [00:20<00:39, 95.29it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 0.9434:  25%|██▍       | 1248/5000 [00:20<00:39, 95.72it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  25%|██▍       | 1248/5000 [00:20<00:39, 95.72it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  26%|██▌       | 1280/5000 [00:20<00:38, 96.88it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  26%|██▌       | 1280/5000 [00:20<00:38, 96.88it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  26%|██▌       | 1312/5000 [00:21<00:38, 94.91it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  26%|██▌       | 1312/5000 [00:21<00:38, 94.91it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  27%|██▋       | 1344/5000 [00:21<00:39, 93.29it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  27%|██▋       | 1344/5000 [00:21<00:39, 93.29it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  28%|██▊       | 1376/5000 [00:21<00:39, 92.48it/s]    r_training: 0.3869, rewards: 0.1000, total_rewards: 0.9434:  28%|██▊       | 1376/5000 [00:21<00:39, 92.48it/s]    r_training: 0.3869, rewards: 0.1000, total_rewards: 0.9434:  28%|██▊       | 1408/5000 [00:22<00:38, 92.19it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 0.9434:  28%|██▊       | 1408/5000 [00:22<00:38, 92.19it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 0.9434:  29%|██▉       | 1440/5000 [00:22<00:38, 92.46it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  29%|██▉       | 1440/5000 [00:22<00:38, 92.46it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  29%|██▉       | 1472/5000 [00:22<00:37, 93.31it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  29%|██▉       | 1472/5000 [00:22<00:37, 93.31it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  30%|███       | 1504/5000 [00:23<00:36, 94.85it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  30%|███       | 1504/5000 [00:23<00:36, 94.85it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  31%|███       | 1536/5000 [00:23<00:36, 94.70it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 0.9434:  31%|███       | 1536/5000 [00:23<00:36, 94.70it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 0.9434:  31%|███▏      | 1568/5000 [00:23<00:35, 96.43it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  31%|███▏      | 1568/5000 [00:23<00:35, 96.43it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  32%|███▏      | 1600/5000 [00:24<00:35, 96.06it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  32%|███▏      | 1600/5000 [00:24<00:35, 96.06it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  33%|███▎      | 1632/5000 [00:24<00:35, 94.79it/s]    r_training: 0.3899, rewards: 0.1000, total_rewards: 0.9434:  33%|███▎      | 1632/5000 [00:24<00:35, 94.79it/s]    r_training: 0.3899, rewards: 0.1000, total_rewards: 0.9434:  33%|███▎      | 1664/5000 [00:24<00:36, 92.37it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  33%|███▎      | 1664/5000 [00:24<00:36, 92.37it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  34%|███▍      | 1696/5000 [00:25<00:36, 91.57it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  34%|███▍      | 1696/5000 [00:25<00:36, 91.57it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  35%|███▍      | 1728/5000 [00:25<00:35, 92.77it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 0.9434:  35%|███▍      | 1728/5000 [00:25<00:35, 92.77it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 0.9434:  35%|███▌      | 1760/5000 [00:25<00:34, 92.65it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  35%|███▌      | 1760/5000 [00:25<00:34, 92.65it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  36%|███▌      | 1792/5000 [00:26<00:34, 92.66it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  36%|███▌      | 1792/5000 [00:26<00:34, 92.66it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  36%|███▋      | 1824/5000 [00:26<00:33, 94.93it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  36%|███▋      | 1824/5000 [00:26<00:33, 94.93it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  37%|███▋      | 1856/5000 [00:26<00:33, 92.57it/s]    r_training: 0.3869, rewards: 0.1000, total_rewards: 0.9434:  37%|███▋      | 1856/5000 [00:26<00:33, 92.57it/s]    r_training: 0.3869, rewards: 0.1000, total_rewards: 0.9434:  38%|███▊      | 1888/5000 [00:27<00:33, 91.98it/s]    r_training: 0.4021, rewards: 0.1000, total_rewards: 0.9434:  38%|███▊      | 1888/5000 [00:27<00:33, 91.98it/s]    r_training: 0.4021, rewards: 0.1000, total_rewards: 0.9434:  38%|███▊      | 1920/5000 [00:27<00:33, 90.86it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  38%|███▊      | 1920/5000 [00:27<00:33, 90.86it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  39%|███▉      | 1952/5000 [00:28<00:33, 91.87it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  39%|███▉      | 1952/5000 [00:28<00:33, 91.87it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  40%|███▉      | 1984/5000 [00:28<00:32, 92.78it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  40%|███▉      | 1984/5000 [00:28<00:32, 92.78it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  40%|████      | 2016/5000 [00:28<00:31, 94.57it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 0.9434:  40%|████      | 2016/5000 [00:28<00:31, 94.57it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 0.9434:  41%|████      | 2048/5000 [00:28<00:30, 96.63it/s]    r_training: 0.4021, rewards: 0.1000, total_rewards: 0.9434:  41%|████      | 2048/5000 [00:28<00:30, 96.63it/s]    r_training: 0.4021, rewards: 0.1000, total_rewards: 0.9434:  42%|████▏     | 2080/5000 [00:29<00:30, 97.18it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  42%|████▏     | 2080/5000 [00:29<00:30, 97.18it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  42%|████▏     | 2112/5000 [00:29<00:29, 96.44it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  42%|████▏     | 2112/5000 [00:29<00:29, 96.44it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  43%|████▎     | 2144/5000 [00:29<00:29, 97.31it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  43%|████▎     | 2144/5000 [00:29<00:29, 97.31it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  44%|████▎     | 2176/5000 [00:30<00:29, 95.34it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 0.9434:  44%|████▎     | 2176/5000 [00:30<00:29, 95.34it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 0.9434:  44%|████▍     | 2208/5000 [00:30<00:29, 95.25it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 0.9434:  44%|████▍     | 2208/5000 [00:30<00:29, 95.25it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 0.9434:  45%|████▍     | 2240/5000 [00:31<00:29, 94.25it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  45%|████▍     | 2240/5000 [00:31<00:29, 94.25it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  45%|████▌     | 2272/5000 [00:31<00:29, 91.91it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  45%|████▌     | 2272/5000 [00:31<00:29, 91.91it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  46%|████▌     | 2304/5000 [00:31<00:29, 91.60it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  46%|████▌     | 2304/5000 [00:31<00:29, 91.60it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  47%|████▋     | 2336/5000 [00:32<00:29, 90.65it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 0.9434:  47%|████▋     | 2336/5000 [00:32<00:29, 90.65it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 0.9434:  47%|████▋     | 2368/5000 [00:32<00:29, 90.37it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  47%|████▋     | 2368/5000 [00:32<00:29, 90.37it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  48%|████▊     | 2400/5000 [00:32<00:28, 91.13it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  48%|████▊     | 2400/5000 [00:32<00:28, 91.13it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  49%|████▊     | 2432/5000 [00:33<00:28, 90.70it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:  49%|████▊     | 2432/5000 [00:33<00:28, 90.70it/s]    r_training: 0.3688, rewards: 0.1000, total_rewards: 0.9434:  49%|████▉     | 2464/5000 [00:33<00:27, 91.81it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  49%|████▉     | 2464/5000 [00:33<00:27, 91.81it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  50%|████▉     | 2496/5000 [00:33<00:27, 91.18it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  50%|████▉     | 2496/5000 [00:33<00:27, 91.18it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  51%|█████     | 2528/5000 [00:34<00:27, 90.90it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  51%|█████     | 2528/5000 [00:34<00:27, 90.90it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  51%|█████     | 2560/5000 [00:34<00:26, 90.84it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  51%|█████     | 2560/5000 [00:34<00:26, 90.84it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  52%|█████▏    | 2592/5000 [00:34<00:26, 92.18it/s]    r_training: 0.4021, rewards: 0.1000, total_rewards: 0.9434:  52%|█████▏    | 2592/5000 [00:34<00:26, 92.18it/s]    r_training: 0.4021, rewards: 0.1000, total_rewards: 0.9434:  52%|█████▏    | 2624/5000 [00:35<00:26, 91.04it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 0.9434:  52%|█████▏    | 2624/5000 [00:35<00:26, 91.04it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 0.9434:  53%|█████▎    | 2656/5000 [00:35<00:26, 89.48it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  53%|█████▎    | 2656/5000 [00:35<00:26, 89.48it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  54%|█████▍    | 2688/5000 [00:35<00:25, 91.03it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  54%|█████▍    | 2688/5000 [00:35<00:25, 91.03it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  54%|█████▍    | 2720/5000 [00:36<00:24, 92.01it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  54%|█████▍    | 2720/5000 [00:36<00:24, 92.01it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  55%|█████▌    | 2752/5000 [00:36<00:24, 93.01it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  55%|█████▌    | 2752/5000 [00:36<00:24, 93.01it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  56%|█████▌    | 2784/5000 [00:36<00:23, 94.76it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  56%|█████▌    | 2784/5000 [00:36<00:23, 94.76it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  56%|█████▋    | 2816/5000 [00:37<00:22, 95.82it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  56%|█████▋    | 2816/5000 [00:37<00:22, 95.82it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  57%|█████▋    | 2848/5000 [00:37<00:22, 93.97it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 0.9434:  57%|█████▋    | 2848/5000 [00:37<00:22, 93.97it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 0.9434:  58%|█████▊    | 2880/5000 [00:37<00:22, 94.13it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  58%|█████▊    | 2880/5000 [00:37<00:22, 94.13it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  58%|█████▊    | 2912/5000 [00:38<00:22, 93.44it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  58%|█████▊    | 2912/5000 [00:38<00:22, 93.44it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  59%|█████▉    | 2944/5000 [00:38<00:21, 93.58it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  59%|█████▉    | 2944/5000 [00:38<00:21, 93.58it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  60%|█████▉    | 2976/5000 [00:39<00:21, 92.27it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  60%|█████▉    | 2976/5000 [00:39<00:21, 92.27it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  60%|██████    | 3008/5000 [00:39<00:21, 93.03it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  60%|██████    | 3008/5000 [00:39<00:21, 93.03it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  61%|██████    | 3040/5000 [00:39<00:21, 91.52it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  61%|██████    | 3040/5000 [00:39<00:21, 91.52it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  61%|██████▏   | 3072/5000 [00:40<00:20, 92.16it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  61%|██████▏   | 3072/5000 [00:40<00:20, 92.16it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  62%|██████▏   | 3104/5000 [00:40<00:20, 91.32it/s]    r_training: 0.3899, rewards: 0.1000, total_rewards: 0.9434:  62%|██████▏   | 3104/5000 [00:40<00:20, 91.32it/s]    r_training: 0.3899, rewards: 0.1000, total_rewards: 0.9434:  63%|██████▎   | 3136/5000 [00:40<00:20, 91.58it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 0.9434:  63%|██████▎   | 3136/5000 [00:40<00:20, 91.58it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 0.9434:  63%|██████▎   | 3168/5000 [00:41<00:20, 90.12it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  63%|██████▎   | 3168/5000 [00:41<00:20, 90.12it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 0.9434:  64%|██████▍   | 3200/5000 [00:41<00:20, 89.78it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  64%|██████▍   | 3200/5000 [00:41<00:20, 89.78it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 0.9434:  65%|██████▍   | 3232/5000 [00:48<02:03, 14.32it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  65%|██████▍   | 3232/5000 [00:48<02:03, 14.32it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  65%|██████▌   | 3264/5000 [00:48<01:30, 19.26it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  65%|██████▌   | 3264/5000 [00:48<01:30, 19.26it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  66%|██████▌   | 3296/5000 [00:48<01:07, 25.39it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  66%|██████▌   | 3296/5000 [00:48<01:07, 25.39it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  67%|██████▋   | 3328/5000 [00:49<00:51, 32.37it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  67%|██████▋   | 3328/5000 [00:49<00:51, 32.37it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  67%|██████▋   | 3360/5000 [00:49<00:41, 39.84it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  67%|██████▋   | 3360/5000 [00:49<00:41, 39.84it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  68%|██████▊   | 3392/5000 [00:49<00:32, 48.75it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 4.5455:  68%|██████▊   | 3392/5000 [00:49<00:32, 48.75it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 4.5455:  68%|██████▊   | 3424/5000 [00:50<00:27, 56.74it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  68%|██████▊   | 3424/5000 [00:50<00:27, 56.74it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  69%|██████▉   | 3456/5000 [00:50<00:24, 63.61it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  69%|██████▉   | 3456/5000 [00:50<00:24, 63.61it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  70%|██████▉   | 3488/5000 [00:50<00:21, 69.81it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  70%|██████▉   | 3488/5000 [00:50<00:21, 69.81it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  70%|███████   | 3520/5000 [00:51<00:19, 75.70it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  70%|███████   | 3520/5000 [00:51<00:19, 75.70it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  71%|███████   | 3552/5000 [00:51<00:18, 79.60it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  71%|███████   | 3552/5000 [00:51<00:18, 79.60it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  72%|███████▏  | 3584/5000 [00:51<00:17, 81.86it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  72%|███████▏  | 3584/5000 [00:51<00:17, 81.86it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  72%|███████▏  | 3616/5000 [00:52<00:16, 83.39it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  72%|███████▏  | 3616/5000 [00:52<00:16, 83.39it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  73%|███████▎  | 3648/5000 [00:52<00:15, 85.82it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  73%|███████▎  | 3648/5000 [00:52<00:15, 85.82it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  74%|███████▎  | 3680/5000 [00:52<00:15, 87.83it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 4.5455:  74%|███████▎  | 3680/5000 [00:52<00:15, 87.83it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 4.5455:  74%|███████▍  | 3712/5000 [00:53<00:14, 87.03it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  74%|███████▍  | 3712/5000 [00:53<00:14, 87.03it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  75%|███████▍  | 3744/5000 [00:53<00:14, 88.19it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  75%|███████▍  | 3744/5000 [00:53<00:14, 88.19it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  76%|███████▌  | 3776/5000 [00:54<00:13, 90.27it/s]    r_training: 0.4021, rewards: 0.1000, total_rewards: 4.5455:  76%|███████▌  | 3776/5000 [00:54<00:13, 90.27it/s]    r_training: 0.4021, rewards: 0.1000, total_rewards: 4.5455:  76%|███████▌  | 3808/5000 [00:54<00:13, 89.49it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  76%|███████▌  | 3808/5000 [00:54<00:13, 89.49it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  77%|███████▋  | 3840/5000 [00:54<00:12, 90.07it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  77%|███████▋  | 3840/5000 [00:54<00:12, 90.07it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  77%|███████▋  | 3872/5000 [00:55<00:12, 91.12it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 4.5455:  77%|███████▋  | 3872/5000 [00:55<00:12, 91.12it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 4.5455:  78%|███████▊  | 3904/5000 [00:55<00:11, 91.46it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  78%|███████▊  | 3904/5000 [00:55<00:11, 91.46it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  79%|███████▊  | 3936/5000 [00:55<00:11, 91.04it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  79%|███████▊  | 3936/5000 [00:55<00:11, 91.04it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  79%|███████▉  | 3968/5000 [00:56<00:11, 91.50it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  79%|███████▉  | 3968/5000 [00:56<00:11, 91.50it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  80%|████████  | 4000/5000 [00:56<00:11, 90.81it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  80%|████████  | 4000/5000 [00:56<00:11, 90.81it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  81%|████████  | 4032/5000 [00:56<00:10, 92.69it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  81%|████████  | 4032/5000 [00:56<00:10, 92.69it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  81%|████████▏ | 4064/5000 [00:57<00:10, 91.33it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  81%|████████▏ | 4064/5000 [00:57<00:10, 91.33it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  82%|████████▏ | 4096/5000 [00:57<00:09, 91.04it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  82%|████████▏ | 4096/5000 [00:57<00:09, 91.04it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  83%|████████▎ | 4128/5000 [00:57<00:09, 90.52it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  83%|████████▎ | 4128/5000 [00:57<00:09, 90.52it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  83%|████████▎ | 4160/5000 [00:58<00:09, 92.30it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  83%|████████▎ | 4160/5000 [00:58<00:09, 92.30it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  84%|████████▍ | 4192/5000 [00:58<00:08, 92.24it/s]    r_training: 0.3718, rewards: 0.1000, total_rewards: 4.5455:  84%|████████▍ | 4192/5000 [00:58<00:08, 92.24it/s]    r_training: 0.3718, rewards: 0.1000, total_rewards: 4.5455:  84%|████████▍ | 4224/5000 [00:58<00:08, 91.29it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  84%|████████▍ | 4224/5000 [00:58<00:08, 91.29it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  85%|████████▌ | 4256/5000 [00:59<00:08, 91.12it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  85%|████████▌ | 4256/5000 [00:59<00:08, 91.12it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  86%|████████▌ | 4288/5000 [00:59<00:07, 92.86it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  86%|████████▌ | 4288/5000 [00:59<00:07, 92.86it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  86%|████████▋ | 4320/5000 [00:59<00:07, 92.47it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  86%|████████▋ | 4320/5000 [00:59<00:07, 92.47it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  87%|████████▋ | 4352/5000 [01:00<00:06, 94.75it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  87%|████████▋ | 4352/5000 [01:00<00:06, 94.75it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  88%|████████▊ | 4384/5000 [01:00<00:06, 95.03it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  88%|████████▊ | 4384/5000 [01:00<00:06, 95.03it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  88%|████████▊ | 4416/5000 [01:00<00:06, 94.11it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  88%|████████▊ | 4416/5000 [01:00<00:06, 94.11it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  89%|████████▉ | 4448/5000 [01:01<00:05, 94.88it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 4.5455:  89%|████████▉ | 4448/5000 [01:01<00:05, 94.88it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 4.5455:  90%|████████▉ | 4480/5000 [01:01<00:05, 94.72it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  90%|████████▉ | 4480/5000 [01:01<00:05, 94.72it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  90%|█████████ | 4512/5000 [01:02<00:05, 92.99it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  90%|█████████ | 4512/5000 [01:02<00:05, 92.99it/s]    r_training: 0.3991, rewards: 0.1000, total_rewards: 4.5455:  91%|█████████ | 4544/5000 [01:02<00:04, 94.93it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  91%|█████████ | 4544/5000 [01:02<00:04, 94.93it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  92%|█████████▏| 4576/5000 [01:02<00:04, 93.39it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  92%|█████████▏| 4576/5000 [01:02<00:04, 93.39it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  92%|█████████▏| 4608/5000 [01:03<00:04, 93.80it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  92%|█████████▏| 4608/5000 [01:03<00:04, 93.80it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  93%|█████████▎| 4640/5000 [01:03<00:03, 92.57it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  93%|█████████▎| 4640/5000 [01:03<00:03, 92.57it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  93%|█████████▎| 4672/5000 [01:03<00:03, 90.85it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  93%|█████████▎| 4672/5000 [01:03<00:03, 90.85it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  94%|█████████▍| 4704/5000 [01:04<00:03, 90.47it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  94%|█████████▍| 4704/5000 [01:04<00:03, 90.47it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  95%|█████████▍| 4736/5000 [01:04<00:02, 90.45it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  95%|█████████▍| 4736/5000 [01:04<00:02, 90.45it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  95%|█████████▌| 4768/5000 [01:04<00:02, 91.66it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  95%|█████████▌| 4768/5000 [01:04<00:02, 91.66it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  96%|█████████▌| 4800/5000 [01:05<00:02, 89.86it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 4.5455:  96%|█████████▌| 4800/5000 [01:05<00:02, 89.86it/s]    r_training: 0.4173, rewards: 0.1000, total_rewards: 4.5455:  97%|█████████▋| 4832/5000 [01:05<00:01, 91.62it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  97%|█████████▋| 4832/5000 [01:05<00:01, 91.62it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  97%|█████████▋| 4864/5000 [01:05<00:01, 90.96it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  97%|█████████▋| 4864/5000 [01:05<00:01, 90.96it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  98%|█████████▊| 4896/5000 [01:06<00:01, 90.79it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  98%|█████████▊| 4896/5000 [01:06<00:01, 90.79it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  99%|█████████▊| 4928/5000 [01:06<00:00, 91.51it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  99%|█████████▊| 4928/5000 [01:06<00:00, 91.51it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455:  99%|█████████▉| 4960/5000 [01:06<00:00, 89.47it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 4.5455:  99%|█████████▉| 4960/5000 [01:06<00:00, 89.47it/s]    r_training: 0.4082, rewards: 0.1000, total_rewards: 4.5455: 100%|█████████▉| 4992/5000 [01:07<00:00, 90.69it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455: 100%|█████████▉| 4992/5000 [01:07<00:00, 90.69it/s]    r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455: : 5024it [01:07, 90.28it/s]                            r_training: 0.4295, rewards: 0.1000, total_rewards: 4.5455: : 5024it [01:07, 90.28it/s]


.. GENERATED FROM PYTHON SOURCE LINES 707-708

We can now quickly check the CSVs with the results.

.. GENERATED FROM PYTHON SOURCE LINES 708-737

.. code-block:: Python


    def print_csv_files_in_folder(folder_path):
        """
        Find all CSV files in a folder and prints the first 10 lines of each file.

        Args:
            folder_path (str): The relative path to the folder.

        """
        csv_files = []
        output_str = ""
        for dirpath, _, filenames in os.walk(folder_path):
            for file in filenames:
                if file.endswith(".csv"):
                    csv_files.append(os.path.join(dirpath, file))
        for csv_file in csv_files:
            output_str += f"File: {csv_file}\n"
            with open(csv_file, "r") as f:
                for i, line in enumerate(f):
                    if i == 10:
                        break
                    output_str += line.strip() + "\n"
            output_str += "\n"
        print(output_str)


    print_csv_files_in_folder(logger.experiment.log_dir)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    File: /tmp/tmp07bwkid3/dqn_exp_706b0f22-02f5-11ef-931b-0242ac110002/scalars/r_training.csv
    512,0.34751853346824646
    1024,0.40816524624824524
    1536,0.40816524624824524
    2048,0.40213119983673096
    2560,0.42945271730422974
    3072,0.42945271730422974
    3584,0.42945271730422974
    4096,0.42945271730422974
    4608,0.42945271730422974

    File: /tmp/tmp07bwkid3/dqn_exp_706b0f22-02f5-11ef-931b-0242ac110002/scalars/optim_steps.csv
    512,128.0
    1024,256.0
    1536,384.0
    2048,512.0
    2560,640.0
    3072,768.0
    3584,896.0
    4096,1024.0
    4608,1152.0

    File: /tmp/tmp07bwkid3/dqn_exp_706b0f22-02f5-11ef-931b-0242ac110002/scalars/loss.csv
    512,0.19233030080795288
    1024,0.18954430520534515
    1536,0.24853023886680603
    2048,0.3870314061641693
    2560,0.34363630414009094
    3072,0.2559172511100769
    3584,0.2244148701429367
    4096,0.3171423673629761
    4608,0.45604708790779114

    File: /tmp/tmp07bwkid3/dqn_exp_706b0f22-02f5-11ef-931b-0242ac110002/scalars/grad_norm_0.csv
    512,2.343657970428467
    1024,2.708251476287842
    1536,3.2506790161132812
    2048,4.042572498321533
    2560,3.310127019882202
    3072,3.0853092670440674
    3584,2.9565651416778564
    4096,4.037069797515869
    4608,5.269522190093994

    File: /tmp/tmp07bwkid3/dqn_exp_706b0f22-02f5-11ef-931b-0242ac110002/scalars/rewards.csv
    3232,0.10000000894069672

    File: /tmp/tmp07bwkid3/dqn_exp_706b0f22-02f5-11ef-931b-0242ac110002/scalars/total_rewards.csv
    3232,4.545454502105713


.. GENERATED FROM PYTHON SOURCE LINES 738-760

Conclusion and possible improvements
------------------------------------

In this tutorial we have learned:

- How to write a Trainer, including building its components and registering
  them in the trainer;
- How to code a DQN algorithm, including how to create a policy that picks
  up the action with the highest value with
  :class:`~torchrl.modules.QValueNetwork`;
- How to build a multiprocessed data collector;

Possible improvements to this tutorial could include:

- A prioritized replay buffer could also be used. This will give a
  higher priority to samples that have the worst value accuracy.
  Learn more on the
  `replay buffer section <https://pytorch.org/rl/reference/data.html#composable-replay-buffers>`_
  of the documentation.
- A distributional loss (see :class:`~torchrl.objectives.DistributionalDQNLoss`
  for more information).
- More fancy exploration techniques, such as :class:`~torchrl.modules.NoisyLinear` layers and such.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (2 minutes 38.407 seconds)

**Estimated memory usage:**  731 MB


.. _sphx_glr_download_tutorials_coding_dqn.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: coding_dqn.ipynb <coding_dqn.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: coding_dqn.py <coding_dqn.py>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_