.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "intermediate/reinforcement_ppo.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_intermediate_reinforcement_ppo.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_intermediate_reinforcement_ppo.py:


Reinforcement Learning (PPO) with TorchRL Tutorial
==================================================
**Author**: `Vincent Moens <https://github.com/vmoens>`_

This tutorial demonstrates how to use PyTorch and :py:mod:`torchrl` to train a parametric policy
network to solve the Inverted Pendulum task from the `OpenAI-Gym/Farama-Gymnasium
control library <https://github.com/Farama-Foundation/Gymnasium>`__.

.. figure:: /_static/img/invpendulum.gif
   :alt: Inverted pendulum

   Inverted pendulum

Key learnings:

- How to create an environment in TorchRL, transform its outputs, and collect data from this environment;
- How to make your classes talk to each other using :class:`~tensordict.TensorDict`;
- The basics of building your training loop with TorchRL:

  - How to compute the advantage signal for policy gradient methods;
  - How to create a stochastic policy using a probabilistic neural network;
  - How to create a dynamic replay buffer and sample from it without repetition.

We will cover six crucial components of TorchRL:

* `environments <https://pytorch.org/rl/reference/envs.html>`__
* `transforms <https://pytorch.org/rl/reference/envs.html#transforms>`__
* `models (policy and value function) <https://pytorch.org/rl/reference/modules.html>`__
* `loss modules <https://pytorch.org/rl/reference/objectives.html>`__
* `data collectors <https://pytorch.org/rl/reference/collectors.html>`__
* `replay buffers <https://pytorch.org/rl/reference/data.html#replay-buffers>`__

.. GENERATED FROM PYTHON SOURCE LINES 38-106

If you are running this in Google Colab, make sure you install the following dependencies:

.. code-block:: bash

   !pip3 install torchrl
   !pip3 install gym[mujoco]
   !pip3 install tqdm

Proximal Policy Optimization (PPO) is a policy-gradient algorithm where a
batch of data is being collected and directly consumed to train the policy to maximise
the expected return given some proximality constraints. You can think of it
as a sophisticated version of `REINFORCE <https://link.springer.com/content/pdf/10.1007/BF00992696.pdf>`_,
the foundational policy-optimization algorithm. For more information, see the
`Proximal Policy Optimization Algorithms <https://arxiv.org/abs/1707.06347>`_ paper.

PPO is usually regarded as a fast and efficient method for online, on-policy
reinforcement algorithm. TorchRL provides a loss-module that does all the work
for you, so that you can rely on this implementation and focus on solving your
problem rather than re-inventing the wheel every time you want to train a policy.

For completeness, here is a brief overview of what the loss computes, even though
this is taken care of by our :class:`~torchrl.objectives.ClipPPOLoss` module—the algorithm works as follows:
1. we will sample a batch of data by playing the
policy in the environment for a given number of steps.
2. Then, we will perform a given number of optimization steps with random sub-samples of this batch using
a clipped version of the REINFORCE loss.
3. The clipping will put a pessimistic bound on our loss: lower return estimates will
be favored compared to higher ones.
The precise formula of the loss is:

.. math::

    L(s,a,\theta_k,\theta) = \min\left(
    \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}  A^{\pi_{\theta_k}}(s,a), \;\;
    g(\epsilon, A^{\pi_{\theta_k}}(s,a))
    \right),

There are two components in that loss: in the first part of the minimum operator,
we simply compute an importance-weighted version of the REINFORCE loss (for example, a
REINFORCE loss that we have corrected for the fact that the current policy
configuration lags the one that was used for the data collection).
The second part of that minimum operator is a similar loss where we have clipped
the ratios when they exceeded or were below a given pair of thresholds.

This loss ensures that whether the advantage is positive or negative, policy
updates that would produce significant shifts from the previous configuration
are being discouraged.

This tutorial is structured as follows:

1. First, we will define a set of hyperparameters we will be using for training.

2. Next, we will focus on creating our environment, or simulator, using TorchRL's
   wrappers and transforms.

3. Next, we will design the policy network and the value model,
   which is indispensable to the loss function. These modules will be used
   to configure our loss module.

4. Next, we will create the replay buffer and data loader.

5. Finally, we will run our training loop and analyze the results.

Throughout this tutorial, we'll be using the :mod:`tensordict` library.
:class:`~tensordict.TensorDict` is the lingua franca of TorchRL: it helps us abstract
what a module reads and writes and care less about the specific data
description and more about the algorithm itself.


.. GENERATED FROM PYTHON SOURCE LINES 106-128

.. code-block:: default


    from collections import defaultdict

    import matplotlib.pyplot as plt
    import torch
    from tensordict.nn import TensorDictModule
    from tensordict.nn.distributions import NormalParamExtractor
    from torch import nn
    from torchrl.collectors import SyncDataCollector
    from torchrl.data.replay_buffers import ReplayBuffer
    from torchrl.data.replay_buffers.samplers import SamplerWithoutReplacement
    from torchrl.data.replay_buffers.storages import LazyTensorStorage
    from torchrl.envs import (Compose, DoubleToFloat, ObservationNorm, StepCounter,
                              TransformedEnv)
    from torchrl.envs.libs.gym import GymEnv
    from torchrl.envs.utils import check_env_specs, ExplorationType, set_exploration_type
    from torchrl.modules import ProbabilisticActor, TanhNormal, ValueOperator
    from torchrl.objectives import ClipPPOLoss
    from torchrl.objectives.value import GAE
    from tqdm import tqdm


.. GENERATED FROM PYTHON SOURCE LINES 144-155

Define Hyperparameters
----------------------

We set the hyperparameters for our algorithm. Depending on the resources
available, one may choose to execute the policy on GPU or on another
device.
The ``frame_skip`` will control how for how many frames is a single
action being executed. The rest of the arguments that count frames
must be corrected for this value (since one environment step will
actually return ``frame_skip`` frames).


.. GENERATED FROM PYTHON SOURCE LINES 155-166

.. code-block:: default


    is_fork = multiprocessing.get_start_method() == "fork"
    device = (
        torch.device(0)
        if torch.cuda.is_available() and not is_fork
        else torch.device("cpu")
    )
    num_cells = 256  # number of cells in each layer i.e. output dim.
    lr = 3e-4
    max_grad_norm = 1.0


.. GENERATED FROM PYTHON SOURCE LINES 167-177

Data collection parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~

When collecting data, we will be able to choose how big each batch will be
by defining a ``frames_per_batch`` parameter. We will also define how many
frames (such as the number of interactions with the simulator) we will allow ourselves to
use. In general, the goal of an RL algorithm is to learn to solve the task
as fast as it can in terms of environment interactions: the lower the ``total_frames``
the better.


.. GENERATED FROM PYTHON SOURCE LINES 177-181

.. code-block:: default

    frames_per_batch = 1000
    # For a complete training, bring the number of frames up to 1M
    total_frames = 50_000


.. GENERATED FROM PYTHON SOURCE LINES 182-193

PPO parameters
~~~~~~~~~~~~~~

At each data collection (or batch collection) we will run the optimization
over a certain number of *epochs*, each time consuming the entire data we just
acquired in a nested training loop. Here, the ``sub_batch_size`` is different from the
``frames_per_batch`` here above: recall that we are working with a "batch of data"
coming from our collector, which size is defined by ``frames_per_batch``, and that
we will further split in smaller sub-batches during the inner training loop.
The size of these sub-batches is controlled by ``sub_batch_size``.


.. GENERATED FROM PYTHON SOURCE LINES 193-202

.. code-block:: default

    sub_batch_size = 64  # cardinality of the sub-samples gathered from the current data in the inner loop
    num_epochs = 10  # optimization steps per batch of data collected
    clip_epsilon = (
        0.2  # clip value for PPO loss: see the equation in the intro for more context.
    )
    gamma = 0.99
    lmbda = 0.95
    entropy_eps = 1e-4


.. GENERATED FROM PYTHON SOURCE LINES 203-214

Define an environment
---------------------

In RL, an *environment* is usually the way we refer to a simulator or a
control system. Various libraries provide simulation environments for reinforcement
learning, including Gymnasium (previously OpenAI Gym), DeepMind control suite, and
many others.
As a general library, TorchRL's goal is to provide an interchangeable interface
to a large panel of RL simulators, allowing you to easily swap one environment
with another. For example, creating a wrapped gym environment can be achieved with few characters:


.. GENERATED FROM PYTHON SOURCE LINES 214-217

.. code-block:: default


    base_env = GymEnv("InvertedDoublePendulum-v4", device=device)


.. GENERATED FROM PYTHON SOURCE LINES 218-268

There are a few things to notice in this code: first, we created
the environment by calling the ``GymEnv`` wrapper. If extra keyword arguments
are passed, they will be transmitted to the ``gym.make`` method, hence covering
the most common environment construction commands.
Alternatively, one could also directly create a gym environment using ``gym.make(env_name, **kwargs)``
and wrap it in a `GymWrapper` class.

Also the ``device`` argument: for gym, this only controls the device where
input action and observed states will be stored, but the execution will always
be done on CPU. The reason for this is simply that gym does not support on-device
execution, unless specified otherwise. For other libraries, we have control over
the execution device and, as much as we can, we try to stay consistent in terms of
storing and execution backends.

Transforms
~~~~~~~~~~

We will append some transforms to our environments to prepare the data for
the policy. In Gym, this is usually achieved via wrappers. TorchRL takes a different
approach, more similar to other pytorch domain libraries, through the use of transforms.
To add transforms to an environment, one should simply wrap it in a :class:`~torchrl.envs.transforms.TransformedEnv`
instance and append the sequence of transforms to it. The transformed environment will inherit
the device and meta-data of the wrapped environment, and transform these depending on the sequence
of transforms it contains.

Normalization
~~~~~~~~~~~~~

The first to encode is a normalization transform.
As a rule of thumbs, it is preferable to have data that loosely
match a unit Gaussian distribution: to obtain this, we will
run a certain number of random steps in the environment and compute
the summary statistics of these observations.

We'll append two other transforms: the :class:`~torchrl.envs.transforms.DoubleToFloat` transform will
convert double entries to single-precision numbers, ready to be read by the
policy. The :class:`~torchrl.envs.transforms.StepCounter` transform will be used to count the steps before
the environment is terminated. We will use this measure as a supplementary measure
of performance.

As we will see later, many of the TorchRL's classes rely on :class:`~tensordict.TensorDict`
to communicate. You could think of it as a python dictionary with some extra
tensor features. In practice, this means that many modules we will be working
with need to be told what key to read (``in_keys``) and what key to write
(``out_keys``) in the ``tensordict`` they will receive. Usually, if ``out_keys``
is omitted, it is assumed that the ``in_keys`` entries will be updated
in-place. For our transforms, the only entry we are interested in is referred
to as ``"observation"`` and our transform layers will be told to modify this
entry and this entry only:


.. GENERATED FROM PYTHON SOURCE LINES 268-279

.. code-block:: default


    env = TransformedEnv(
        base_env,
        Compose(
            # normalize observations
            ObservationNorm(in_keys=["observation"]),
            DoubleToFloat(),
            StepCounter(),
        ),
    )


.. GENERATED FROM PYTHON SOURCE LINES 280-284

As you may have noticed, we have created a normalization layer but we did not
set its normalization parameters. To do this, :class:`~torchrl.envs.transforms.ObservationNorm` can
automatically gather the summary statistics of our environment:


.. GENERATED FROM PYTHON SOURCE LINES 284-286

.. code-block:: default

    env.transform[0].init_stats(num_iter=1000, reduce_dim=0, cat_dim=0)


.. GENERATED FROM PYTHON SOURCE LINES 287-292

The :class:`~torchrl.envs.transforms.ObservationNorm` transform has now been populated with a
location and a scale that will be used to normalize the data.

Let us do a little sanity check for the shape of our summary stats:


.. GENERATED FROM PYTHON SOURCE LINES 292-294

.. code-block:: default

    print("normalization constant shape:", env.transform[0].loc.shape)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    normalization constant shape: torch.Size([11])


.. GENERATED FROM PYTHON SOURCE LINES 295-314

An environment is not only defined by its simulator and transforms, but also
by a series of metadata that describe what can be expected during its
execution.
For efficiency purposes, TorchRL is quite stringent when it comes to
environment specs, but you can easily check that your environment specs are
adequate.
In our example, the :class:`~torchrl.envs.libs.gym.GymWrapper` and
:class:`~torchrl.envs.libs.gym.GymEnv` that inherits
from it already take care of setting the proper specs for your environment so
you should not have to care about this.

Nevertheless, let's see a concrete example using our transformed
environment by looking at its specs.
There are three specs to look at: ``observation_spec`` which defines what
is to be expected when executing an action in the environment,
``reward_spec`` which indicates the reward domain and finally the
``input_spec`` (which contains the ``action_spec``) and which represents
everything an environment requires to execute a single step.


.. GENERATED FROM PYTHON SOURCE LINES 314-319

.. code-block:: default

    print("observation_spec:", env.observation_spec)
    print("reward_spec:", env.reward_spec)
    print("input_spec:", env.input_spec)
    print("action_spec (as defined by input_spec):", env.action_spec)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    observation_spec: CompositeSpec(
        observation: UnboundedContinuousTensorSpec(
            shape=torch.Size([11]),
            space=None,
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        step_count: BoundedTensorSpec(
            shape=torch.Size([1]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, contiguous=True),
                high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, contiguous=True)),
            device=cpu,
            dtype=torch.int64,
            domain=continuous), device=cpu, shape=torch.Size([]))
    reward_spec: UnboundedContinuousTensorSpec(
        shape=torch.Size([1]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous)
    input_spec: CompositeSpec(
        full_state_spec: CompositeSpec(
            step_count: BoundedTensorSpec(
                shape=torch.Size([1]),
                space=ContinuousBox(
                    low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, contiguous=True),
                    high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, contiguous=True)),
                device=cpu,
                dtype=torch.int64,
                domain=continuous), device=cpu, shape=torch.Size([])),
        full_action_spec: CompositeSpec(
            action: BoundedTensorSpec(
                shape=torch.Size([1]),
                space=ContinuousBox(
                    low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
                    high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
                device=cpu,
                dtype=torch.float32,
                domain=continuous), device=cpu, shape=torch.Size([])), device=cpu, shape=torch.Size([]))
    action_spec (as defined by input_spec): BoundedTensorSpec(
        shape=torch.Size([1]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous)


.. GENERATED FROM PYTHON SOURCE LINES 320-323

the :func:`check_env_specs` function runs a small rollout and compares its output against the environment
specs. If no error is raised, we can be confident that the specs are properly defined:


.. GENERATED FROM PYTHON SOURCE LINES 323-325

.. code-block:: default

    check_env_specs(env)


.. GENERATED FROM PYTHON SOURCE LINES 326-340

For fun, let's see what a simple random rollout looks like. You can
call `env.rollout(n_steps)` and get an overview of what the environment inputs
and outputs look like. Actions will automatically be drawn from the action spec
domain, so you don't need to care about designing a random sampler.

Typically, at each step, an RL environment receives an
action as input, and outputs an observation, a reward and a done state. The
observation may be composite, meaning that it could be composed of more than one
tensor. This is not a problem for TorchRL, since the whole set of observations
is automatically packed in the output :class:`~tensordict.TensorDict`. After executing a rollout
(for example, a sequence of environment steps and random action generations) over a given
number of steps, we will retrieve a :class:`~tensordict.TensorDict` instance with a shape
that matches this trajectory length:


.. GENERATED FROM PYTHON SOURCE LINES 340-344

.. code-block:: default

    rollout = env.rollout(3)
    print("rollout of three steps:", rollout)
    print("Shape of the rollout TensorDict:", rollout.batch_size)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    rollout of three steps: TensorDict(
        fields={
            action: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
            done: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
            next: TensorDict(
                fields={
                    done: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                    observation: Tensor(shape=torch.Size([3, 11]), device=cpu, dtype=torch.float32, is_shared=False),
                    reward: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                    step_count: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.int64, is_shared=False),
                    terminated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                    truncated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
                batch_size=torch.Size([3]),
                device=cpu,
                is_shared=False),
            observation: Tensor(shape=torch.Size([3, 11]), device=cpu, dtype=torch.float32, is_shared=False),
            step_count: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.int64, is_shared=False),
            terminated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
            truncated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
        batch_size=torch.Size([3]),
        device=cpu,
        is_shared=False)
    Shape of the rollout TensorDict: torch.Size([3])


.. GENERATED FROM PYTHON SOURCE LINES 345-378

Our rollout data has a shape of ``torch.Size([3])``, which matches the number of steps
we ran it for. The ``"next"`` entry points to the data coming after the current step.
In most cases, the ``"next"`` data at time `t` matches the data at ``t+1``, but this
may not be the case if we are using some specific transformations (for example, multi-step).

Policy
------

PPO utilizes a stochastic policy to handle exploration. This means that our
neural network will have to output the parameters of a distribution, rather
than a single value corresponding to the action taken.

As the data is continuous, we use a Tanh-Normal distribution to respect the
action space boundaries. TorchRL provides such distribution, and the only
thing we need to care about is to build a neural network that outputs the
right number of parameters for the policy to work with (a location, or mean,
and a scale):

.. math::

    f_{\theta}(\text{observation}) = \mu_{\theta}(\text{observation}), \sigma^{+}_{\theta}(\text{observation})

The only extra-difficulty that is brought up here is to split our output in two
equal parts and map the second to a strictly positive space.

We design the policy in three steps:

1. Define a neural network ``D_obs`` -> ``2 * D_action``. Indeed, our ``loc`` (mu) and ``scale`` (sigma) both have dimension ``D_action``.

2. Append a :class:`~tensordict.nn.distributions.NormalParamExtractor` to extract a location and a scale (for example, splits the input in two equal parts and applies a positive transformation to the scale parameter).

3. Create a probabilistic :class:`~tensordict.nn.TensorDictModule` that can generate this distribution and sample from it.


.. GENERATED FROM PYTHON SOURCE LINES 378-390

.. code-block:: default


    actor_net = nn.Sequential(
        nn.LazyLinear(num_cells, device=device),
        nn.Tanh(),
        nn.LazyLinear(num_cells, device=device),
        nn.Tanh(),
        nn.LazyLinear(num_cells, device=device),
        nn.Tanh(),
        nn.LazyLinear(2 * env.action_spec.shape[-1], device=device),
        NormalParamExtractor(),
    )


.. GENERATED FROM PYTHON SOURCE LINES 391-396

To enable the policy to "talk" with the environment through the ``tensordict``
data carrier, we wrap the ``nn.Module`` in a :class:`~tensordict.nn.TensorDictModule`. This
class will simply ready the ``in_keys`` it is provided with and write the
outputs in-place at the registered ``out_keys``.


.. GENERATED FROM PYTHON SOURCE LINES 396-400

.. code-block:: default

    policy_module = TensorDictModule(
        actor_net, in_keys=["observation"], out_keys=["loc", "scale"]
    )


.. GENERATED FROM PYTHON SOURCE LINES 401-416

We now need to build a distribution out of the location and scale of our
normal distribution. To do so, we instruct the
:class:`~torchrl.modules.tensordict_module.ProbabilisticActor`
class to build a :class:`~torchrl.modules.TanhNormal` out of the location and scale
parameters. We also provide the minimum and maximum values of this
distribution, which we gather from the environment specs.

The name of the ``in_keys`` (and hence the name of the ``out_keys`` from
the :class:`~tensordict.nn.TensorDictModule` above) cannot be set to any value one may
like, as the :class:`~torchrl.modules.TanhNormal` distribution constructor will expect the
``loc`` and ``scale`` keyword arguments. That being said,
:class:`~torchrl.modules.tensordict_module.ProbabilisticActor` also accepts
``Dict[str, str]`` typed ``in_keys`` where the key-value pair indicates
what ``in_key`` string should be used for every keyword argument that is to be used.


.. GENERATED FROM PYTHON SOURCE LINES 416-429

.. code-block:: default

    policy_module = ProbabilisticActor(
        module=policy_module,
        spec=env.action_spec,
        in_keys=["loc", "scale"],
        distribution_class=TanhNormal,
        distribution_kwargs={
            "min": env.action_spec.space.low,
            "max": env.action_spec.space.high,
        },
        return_log_prob=True,
        # we'll need the log-prob for the numerator of the importance weights
    )


.. GENERATED FROM PYTHON SOURCE LINES 430-441

Value network
-------------

The value network is a crucial component of the PPO algorithm, even though it
won't be used at inference time. This module will read the observations and
return an estimation of the discounted return for the following trajectory.
This allows us to amortize learning by relying on the some utility estimation
that is learned on-the-fly during training. Our value network share the same
structure as the policy, but for simplicity we assign it its own set of
parameters.


.. GENERATED FROM PYTHON SOURCE LINES 441-456

.. code-block:: default

    value_net = nn.Sequential(
        nn.LazyLinear(num_cells, device=device),
        nn.Tanh(),
        nn.LazyLinear(num_cells, device=device),
        nn.Tanh(),
        nn.LazyLinear(num_cells, device=device),
        nn.Tanh(),
        nn.LazyLinear(1, device=device),
    )

    value_module = ValueOperator(
        module=value_net,
        in_keys=["observation"],
    )


.. GENERATED FROM PYTHON SOURCE LINES 457-462

let's try our policy and value modules. As we said earlier, the usage of
:class:`~tensordict.nn.TensorDictModule` makes it possible to directly read the output
of the environment to run these modules, as they know what information to read
and where to write it:


.. GENERATED FROM PYTHON SOURCE LINES 462-465

.. code-block:: default

    print("Running policy:", policy_module(env.reset()))
    print("Running value:", value_module(env.reset()))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Running policy: TensorDict(
        fields={
            action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
            done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
            loc: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
            observation: Tensor(shape=torch.Size([11]), device=cpu, dtype=torch.float32, is_shared=False),
            sample_log_prob: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
            scale: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
            step_count: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, is_shared=False),
            terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
            truncated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False)},
        batch_size=torch.Size([]),
        device=cpu,
        is_shared=False)
    Running value: TensorDict(
        fields={
            done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
            observation: Tensor(shape=torch.Size([11]), device=cpu, dtype=torch.float32, is_shared=False),
            state_value: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
            step_count: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, is_shared=False),
            terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
            truncated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False)},
        batch_size=torch.Size([]),
        device=cpu,
        is_shared=False)


.. GENERATED FROM PYTHON SOURCE LINES 466-496

Data collector
--------------

TorchRL provides a set of `DataCollector classes <https://pytorch.org/rl/reference/collectors.html>`__.
Briefly, these classes execute three operations: reset an environment,
compute an action given the latest observation, execute a step in the environment,
and repeat the last two steps until the environment signals a stop (or reaches
a done state).

They allow you to control how many frames to collect at each iteration
(through the ``frames_per_batch`` parameter),
when to reset the environment (through the ``max_frames_per_traj`` argument),
on which ``device`` the policy should be executed, etc. They are also
designed to work efficiently with batched and multiprocessed environments.

The simplest data collector is the :class:`~torchrl.collectors.collectors.SyncDataCollector`:
it is an iterator that you can use to get batches of data of a given length, and
that will stop once a total number of frames (``total_frames``) have been
collected.
Other data collectors (:class:`~torchrl.collectors.collectors.MultiSyncDataCollector` and
:class:`~torchrl.collectors.collectors.MultiaSyncDataCollector`) will execute
the same operations in synchronous and asynchronous manner over a
set of multiprocessed workers.

As for the policy and environment before, the data collector will return
:class:`~tensordict.TensorDict` instances with a total number of elements that will
match ``frames_per_batch``. Using :class:`~tensordict.TensorDict` to pass data to the
training loop allows you to write data loading pipelines
that are 100% oblivious to the actual specificities of the rollout content.


.. GENERATED FROM PYTHON SOURCE LINES 496-505

.. code-block:: default

    collector = SyncDataCollector(
        env,
        policy_module,
        frames_per_batch=frames_per_batch,
        total_frames=total_frames,
        split_trajs=False,
        device=device,
    )


.. GENERATED FROM PYTHON SOURCE LINES 506-524

Replay buffer
-------------

Replay buffers are a common building piece of off-policy RL algorithms.
In on-policy contexts, a replay buffer is refilled every time a batch of
data is collected, and its data is repeatedly consumed for a certain number
of epochs.

TorchRL's replay buffers are built using a common container
:class:`~torchrl.data.ReplayBuffer` which takes as argument the components
of the buffer: a storage, a writer, a sampler and possibly some transforms.
Only the storage (which indicates the replay buffer capacity) is mandatory.
We also specify a sampler without repetition to avoid sampling multiple times
the same item in one epoch.
Using a replay buffer for PPO is not mandatory and we could simply
sample the sub-batches from the collected batch, but using these classes
make it easy for us to build the inner training loop in a reproducible way.


.. GENERATED FROM PYTHON SOURCE LINES 524-530

.. code-block:: default


    replay_buffer = ReplayBuffer(
        storage=LazyTensorStorage(max_size=frames_per_batch),
        sampler=SamplerWithoutReplacement(),
    )


.. GENERATED FROM PYTHON SOURCE LINES 531-552

Loss function
-------------

The PPO loss can be directly imported from TorchRL for convenience using the
:class:`~torchrl.objectives.ClipPPOLoss` class. This is the easiest way of utilizing PPO:
it hides away the mathematical operations of PPO and the control flow that
goes with it.

PPO requires some "advantage estimation" to be computed. In short, an advantage
is a value that reflects an expectancy over the return value while dealing with
the bias / variance tradeoff.
To compute the advantage, one just needs to (1) build the advantage module, which
utilizes our value operator, and (2) pass each batch of data through it before each
epoch.
The GAE module will update the input ``tensordict`` with new ``"advantage"`` and
``"value_target"`` entries.
The ``"value_target"`` is a gradient-free tensor that represents the empirical
value that the value network should represent with the input observation.
Both of these will be used by :class:`~torchrl.objectives.ClipPPOLoss` to
return the policy and value losses.


.. GENERATED FROM PYTHON SOURCE LINES 552-573

.. code-block:: default


    advantage_module = GAE(
        gamma=gamma, lmbda=lmbda, value_network=value_module, average_gae=True
    )

    loss_module = ClipPPOLoss(
        actor_network=policy_module,
        critic_network=value_module,
        clip_epsilon=clip_epsilon,
        entropy_bonus=bool(entropy_eps),
        entropy_coef=entropy_eps,
        # these keys match by default but we set this for completeness
        critic_coef=1.0,
        loss_critic_type="smooth_l1",
    )

    optim = torch.optim.Adam(loss_module.parameters(), lr)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optim, total_frames // frames_per_batch, 0.0
    )


.. GENERATED FROM PYTHON SOURCE LINES 574-592

Training loop
-------------
We now have all the pieces needed to code our training loop.
The steps include:

* Collect data

  * Compute advantage

    * Loop over the collected to compute loss values
    * Back propagate
    * Optimize
    * Repeat

  * Repeat

* Repeat


.. GENERATED FROM PYTHON SOURCE LINES 592-662

.. code-block:: default


    logs = defaultdict(list)
    pbar = tqdm(total=total_frames)
    eval_str = ""

    # We iterate over the collector until it reaches the total number of frames it was
    # designed to collect:
    for i, tensordict_data in enumerate(collector):
        # we now have a batch of data to work with. Let's learn something from it.
        for _ in range(num_epochs):
            # We'll need an "advantage" signal to make PPO work.
            # We re-compute it at each epoch as its value depends on the value
            # network which is updated in the inner loop.
            advantage_module(tensordict_data)
            data_view = tensordict_data.reshape(-1)
            replay_buffer.extend(data_view.cpu())
            for _ in range(frames_per_batch // sub_batch_size):
                subdata = replay_buffer.sample(sub_batch_size)
                loss_vals = loss_module(subdata.to(device))
                loss_value = (
                    loss_vals["loss_objective"]
                    + loss_vals["loss_critic"]
                    + loss_vals["loss_entropy"]
                )

                # Optimization: backward, grad clipping and optimization step
                loss_value.backward()
                # this is not strictly mandatory but it's good practice to keep
                # your gradient norm bounded
                torch.nn.utils.clip_grad_norm_(loss_module.parameters(), max_grad_norm)
                optim.step()
                optim.zero_grad()

        logs["reward"].append(tensordict_data["next", "reward"].mean().item())
        pbar.update(tensordict_data.numel())
        cum_reward_str = (
            f"average reward={logs['reward'][-1]: 4.4f} (init={logs['reward'][0]: 4.4f})"
        )
        logs["step_count"].append(tensordict_data["step_count"].max().item())
        stepcount_str = f"step count (max): {logs['step_count'][-1]}"
        logs["lr"].append(optim.param_groups[0]["lr"])
        lr_str = f"lr policy: {logs['lr'][-1]: 4.4f}"
        if i % 10 == 0:
            # We evaluate the policy once every 10 batches of data.
            # Evaluation is rather simple: execute the policy without exploration
            # (take the expected value of the action distribution) for a given
            # number of steps (1000, which is our ``env`` horizon).
            # The ``rollout`` method of the ``env`` can take a policy as argument:
            # it will then execute this policy at each step.
            with set_exploration_type(ExplorationType.MEAN), torch.no_grad():
                # execute a rollout with the trained policy
                eval_rollout = env.rollout(1000, policy_module)
                logs["eval reward"].append(eval_rollout["next", "reward"].mean().item())
                logs["eval reward (sum)"].append(
                    eval_rollout["next", "reward"].sum().item()
                )
                logs["eval step_count"].append(eval_rollout["step_count"].max().item())
                eval_str = (
                    f"eval cumulative reward: {logs['eval reward (sum)'][-1]: 4.4f} "
                    f"(init: {logs['eval reward (sum)'][0]: 4.4f}), "
                    f"eval step-count: {logs['eval step_count'][-1]}"
                )
                del eval_rollout
        pbar.set_description(", ".join([eval_str, cum_reward_str, stepcount_str, lr_str]))

        # We're also using a learning rate scheduler. Like the gradient clipping,
        # this is a nice-to-have but nothing necessary for PPO to work.
        scheduler.step()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


      0%|          | 0/50000 [00:00<?, ?it/s]
      2%|2         | 1000/50000 [00:04<04:00, 203.64it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.0844 (init= 9.0844), step count (max): 10, lr policy:  0.0003:   2%|2         | 1000/50000 [00:04<04:00, 203.64it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.0844 (init= 9.0844), step count (max): 10, lr policy:  0.0003:   4%|4         | 2000/50000 [00:09<03:55, 203.54it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.1006 (init= 9.0844), step count (max): 12, lr policy:  0.0003:   4%|4         | 2000/50000 [00:09<03:55, 203.54it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.1006 (init= 9.0844), step count (max): 12, lr policy:  0.0003:   6%|6         | 3000/50000 [00:14<03:49, 204.45it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.1527 (init= 9.0844), step count (max): 22, lr policy:  0.0003:   6%|6         | 3000/50000 [00:14<03:49, 204.45it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.1527 (init= 9.0844), step count (max): 22, lr policy:  0.0003:   8%|8         | 4000/50000 [00:19<03:44, 205.20it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.1871 (init= 9.0844), step count (max): 21, lr policy:  0.0003:   8%|8         | 4000/50000 [00:19<03:44, 205.20it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.1871 (init= 9.0844), step count (max): 21, lr policy:  0.0003:  10%|#         | 5000/50000 [00:24<03:41, 203.08it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.2069 (init= 9.0844), step count (max): 22, lr policy:  0.0003:  10%|#         | 5000/50000 [00:24<03:41, 203.08it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.2069 (init= 9.0844), step count (max): 22, lr policy:  0.0003:  12%|#2        | 6000/50000 [00:29<03:34, 204.67it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.2261 (init= 9.0844), step count (max): 32, lr policy:  0.0003:  12%|#2        | 6000/50000 [00:29<03:34, 204.67it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.2261 (init= 9.0844), step count (max): 32, lr policy:  0.0003:  14%|#4        | 7000/50000 [00:34<03:28, 205.81it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.2359 (init= 9.0844), step count (max): 32, lr policy:  0.0003:  14%|#4        | 7000/50000 [00:34<03:28, 205.81it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.2359 (init= 9.0844), step count (max): 32, lr policy:  0.0003:  16%|#6        | 8000/50000 [00:38<03:23, 206.83it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.2502 (init= 9.0844), step count (max): 42, lr policy:  0.0003:  16%|#6        | 8000/50000 [00:38<03:23, 206.83it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.2502 (init= 9.0844), step count (max): 42, lr policy:  0.0003:  18%|#8        | 9000/50000 [00:43<03:17, 207.17it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.2468 (init= 9.0844), step count (max): 39, lr policy:  0.0003:  18%|#8        | 9000/50000 [00:43<03:17, 207.17it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.2468 (init= 9.0844), step count (max): 39, lr policy:  0.0003:  20%|##        | 10000/50000 [00:48<03:12, 207.66it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.2538 (init= 9.0844), step count (max): 64, lr policy:  0.0003:  20%|##        | 10000/50000 [00:48<03:12, 207.66it/s]
    eval cumulative reward:  110.9623 (init:  110.9623), eval step-count: 11, average reward= 9.2538 (init= 9.0844), step count (max): 64, lr policy:  0.0003:  22%|##2       | 11000/50000 [00:53<03:07, 208.11it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2660 (init= 9.0844), step count (max): 52, lr policy:  0.0003:  22%|##2       | 11000/50000 [00:53<03:07, 208.11it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2660 (init= 9.0844), step count (max): 52, lr policy:  0.0003:  24%|##4       | 12000/50000 [00:58<03:03, 207.46it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2729 (init= 9.0844), step count (max): 56, lr policy:  0.0003:  24%|##4       | 12000/50000 [00:58<03:03, 207.46it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2729 (init= 9.0844), step count (max): 56, lr policy:  0.0003:  26%|##6       | 13000/50000 [01:03<02:59, 205.61it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2750 (init= 9.0844), step count (max): 46, lr policy:  0.0003:  26%|##6       | 13000/50000 [01:03<02:59, 205.61it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2750 (init= 9.0844), step count (max): 46, lr policy:  0.0003:  28%|##8       | 14000/50000 [01:07<02:54, 206.67it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2778 (init= 9.0844), step count (max): 61, lr policy:  0.0003:  28%|##8       | 14000/50000 [01:07<02:54, 206.67it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2778 (init= 9.0844), step count (max): 61, lr policy:  0.0003:  30%|###       | 15000/50000 [01:12<02:48, 207.43it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2711 (init= 9.0844), step count (max): 51, lr policy:  0.0002:  30%|###       | 15000/50000 [01:12<02:48, 207.43it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2711 (init= 9.0844), step count (max): 51, lr policy:  0.0002:  32%|###2      | 16000/50000 [01:17<02:43, 208.13it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2907 (init= 9.0844), step count (max): 65, lr policy:  0.0002:  32%|###2      | 16000/50000 [01:17<02:43, 208.13it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2907 (init= 9.0844), step count (max): 65, lr policy:  0.0002:  34%|###4      | 17000/50000 [01:22<02:38, 208.53it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2913 (init= 9.0844), step count (max): 95, lr policy:  0.0002:  34%|###4      | 17000/50000 [01:22<02:38, 208.53it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2913 (init= 9.0844), step count (max): 95, lr policy:  0.0002:  36%|###6      | 18000/50000 [01:27<02:33, 208.78it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2776 (init= 9.0844), step count (max): 53, lr policy:  0.0002:  36%|###6      | 18000/50000 [01:27<02:33, 208.78it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2776 (init= 9.0844), step count (max): 53, lr policy:  0.0002:  38%|###8      | 19000/50000 [01:31<02:28, 208.85it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2714 (init= 9.0844), step count (max): 70, lr policy:  0.0002:  38%|###8      | 19000/50000 [01:31<02:28, 208.85it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2714 (init= 9.0844), step count (max): 70, lr policy:  0.0002:  40%|####      | 20000/50000 [01:36<02:25, 206.40it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2648 (init= 9.0844), step count (max): 49, lr policy:  0.0002:  40%|####      | 20000/50000 [01:36<02:25, 206.40it/s]
    eval cumulative reward:  306.3771 (init:  110.9623), eval step-count: 32, average reward= 9.2648 (init= 9.0844), step count (max): 49, lr policy:  0.0002:  42%|####2     | 21000/50000 [01:41<02:19, 207.25it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2684 (init= 9.0844), step count (max): 58, lr policy:  0.0002:  42%|####2     | 21000/50000 [01:41<02:19, 207.25it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2684 (init= 9.0844), step count (max): 58, lr policy:  0.0002:  44%|####4     | 22000/50000 [01:46<02:15, 206.52it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2785 (init= 9.0844), step count (max): 54, lr policy:  0.0002:  44%|####4     | 22000/50000 [01:46<02:15, 206.52it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2785 (init= 9.0844), step count (max): 54, lr policy:  0.0002:  46%|####6     | 23000/50000 [01:51<02:10, 207.47it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2903 (init= 9.0844), step count (max): 65, lr policy:  0.0002:  46%|####6     | 23000/50000 [01:51<02:10, 207.47it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2903 (init= 9.0844), step count (max): 65, lr policy:  0.0002:  48%|####8     | 24000/50000 [01:55<02:04, 208.16it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2963 (init= 9.0844), step count (max): 76, lr policy:  0.0002:  48%|####8     | 24000/50000 [01:55<02:04, 208.16it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2963 (init= 9.0844), step count (max): 76, lr policy:  0.0002:  50%|#####     | 25000/50000 [02:00<01:59, 208.56it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2903 (init= 9.0844), step count (max): 87, lr policy:  0.0002:  50%|#####     | 25000/50000 [02:00<01:59, 208.56it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2903 (init= 9.0844), step count (max): 87, lr policy:  0.0002:  52%|#####2    | 26000/50000 [02:05<01:54, 208.76it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2853 (init= 9.0844), step count (max): 52, lr policy:  0.0001:  52%|#####2    | 26000/50000 [02:05<01:54, 208.76it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2853 (init= 9.0844), step count (max): 52, lr policy:  0.0001:  54%|#####4    | 27000/50000 [02:10<01:50, 208.82it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2823 (init= 9.0844), step count (max): 61, lr policy:  0.0001:  54%|#####4    | 27000/50000 [02:10<01:50, 208.82it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2823 (init= 9.0844), step count (max): 61, lr policy:  0.0001:  56%|#####6    | 28000/50000 [02:15<01:46, 206.38it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2812 (init= 9.0844), step count (max): 78, lr policy:  0.0001:  56%|#####6    | 28000/50000 [02:15<01:46, 206.38it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2812 (init= 9.0844), step count (max): 78, lr policy:  0.0001:  58%|#####8    | 29000/50000 [02:20<01:41, 207.36it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2957 (init= 9.0844), step count (max): 113, lr policy:  0.0001:  58%|#####8    | 29000/50000 [02:20<01:41, 207.36it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2957 (init= 9.0844), step count (max): 113, lr policy:  0.0001:  60%|######    | 30000/50000 [02:24<01:36, 207.95it/s]
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2796 (init= 9.0844), step count (max): 48, lr policy:  0.0001:  60%|######    | 30000/50000 [02:24<01:36, 207.95it/s] 
    eval cumulative reward:  362.5167 (init:  110.9623), eval step-count: 38, average reward= 9.2796 (init= 9.0844), step count (max): 48, lr policy:  0.0001:  62%|######2   | 31000/50000 [02:29<01:31, 208.34it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2811 (init= 9.0844), step count (max): 54, lr policy:  0.0001:  62%|######2   | 31000/50000 [02:29<01:31, 208.34it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2811 (init= 9.0844), step count (max): 54, lr policy:  0.0001:  64%|######4   | 32000/50000 [02:34<01:26, 207.15it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.3003 (init= 9.0844), step count (max): 78, lr policy:  0.0001:  64%|######4   | 32000/50000 [02:34<01:26, 207.15it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.3003 (init= 9.0844), step count (max): 78, lr policy:  0.0001:  66%|######6   | 33000/50000 [02:39<01:21, 207.80it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2958 (init= 9.0844), step count (max): 96, lr policy:  0.0001:  66%|######6   | 33000/50000 [02:39<01:21, 207.80it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2958 (init= 9.0844), step count (max): 96, lr policy:  0.0001:  68%|######8   | 34000/50000 [02:44<01:16, 208.35it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2960 (init= 9.0844), step count (max): 67, lr policy:  0.0001:  68%|######8   | 34000/50000 [02:44<01:16, 208.35it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2960 (init= 9.0844), step count (max): 67, lr policy:  0.0001:  70%|#######   | 35000/50000 [02:49<01:12, 206.13it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2879 (init= 9.0844), step count (max): 53, lr policy:  0.0001:  70%|#######   | 35000/50000 [02:49<01:12, 206.13it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2879 (init= 9.0844), step count (max): 53, lr policy:  0.0001:  72%|#######2  | 36000/50000 [02:53<01:07, 207.16it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2927 (init= 9.0844), step count (max): 55, lr policy:  0.0001:  72%|#######2  | 36000/50000 [02:53<01:07, 207.16it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2927 (init= 9.0844), step count (max): 55, lr policy:  0.0001:  74%|#######4  | 37000/50000 [02:58<01:02, 207.94it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2972 (init= 9.0844), step count (max): 84, lr policy:  0.0001:  74%|#######4  | 37000/50000 [02:58<01:02, 207.94it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2972 (init= 9.0844), step count (max): 84, lr policy:  0.0001:  76%|#######6  | 38000/50000 [03:03<00:57, 208.45it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2950 (init= 9.0844), step count (max): 64, lr policy:  0.0000:  76%|#######6  | 38000/50000 [03:03<00:57, 208.45it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2950 (init= 9.0844), step count (max): 64, lr policy:  0.0000:  78%|#######8  | 39000/50000 [03:08<00:52, 208.65it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2946 (init= 9.0844), step count (max): 101, lr policy:  0.0000:  78%|#######8  | 39000/50000 [03:08<00:52, 208.65it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2946 (init= 9.0844), step count (max): 101, lr policy:  0.0000:  80%|########  | 40000/50000 [03:12<00:47, 208.71it/s]
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2920 (init= 9.0844), step count (max): 59, lr policy:  0.0000:  80%|########  | 40000/50000 [03:12<00:47, 208.71it/s] 
    eval cumulative reward:  503.0214 (init:  110.9623), eval step-count: 53, average reward= 9.2920 (init= 9.0844), step count (max): 59, lr policy:  0.0000:  82%|########2 | 41000/50000 [03:17<00:43, 208.82it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.2969 (init= 9.0844), step count (max): 113, lr policy:  0.0000:  82%|########2 | 41000/50000 [03:17<00:43, 208.82it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.2969 (init= 9.0844), step count (max): 113, lr policy:  0.0000:  84%|########4 | 42000/50000 [03:22<00:38, 208.65it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.2977 (init= 9.0844), step count (max): 111, lr policy:  0.0000:  84%|########4 | 42000/50000 [03:22<00:38, 208.65it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.2977 (init= 9.0844), step count (max): 111, lr policy:  0.0000:  86%|########6 | 43000/50000 [03:27<00:33, 206.53it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.2978 (init= 9.0844), step count (max): 68, lr policy:  0.0000:  86%|########6 | 43000/50000 [03:27<00:33, 206.53it/s] 
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.2978 (init= 9.0844), step count (max): 68, lr policy:  0.0000:  88%|########8 | 44000/50000 [03:32<00:28, 207.46it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.2992 (init= 9.0844), step count (max): 108, lr policy:  0.0000:  88%|########8 | 44000/50000 [03:32<00:28, 207.46it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.2992 (init= 9.0844), step count (max): 108, lr policy:  0.0000:  90%|######### | 45000/50000 [03:37<00:24, 207.84it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.3017 (init= 9.0844), step count (max): 103, lr policy:  0.0000:  90%|######### | 45000/50000 [03:37<00:24, 207.84it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.3017 (init= 9.0844), step count (max): 103, lr policy:  0.0000:  92%|#########2| 46000/50000 [03:41<00:19, 208.10it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.2975 (init= 9.0844), step count (max): 138, lr policy:  0.0000:  92%|#########2| 46000/50000 [03:41<00:19, 208.10it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.2975 (init= 9.0844), step count (max): 138, lr policy:  0.0000:  94%|#########3| 47000/50000 [03:46<00:14, 208.60it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.3068 (init= 9.0844), step count (max): 79, lr policy:  0.0000:  94%|#########3| 47000/50000 [03:46<00:14, 208.60it/s] 
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.3068 (init= 9.0844), step count (max): 79, lr policy:  0.0000:  96%|#########6| 48000/50000 [03:51<00:09, 209.01it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.3066 (init= 9.0844), step count (max): 121, lr policy:  0.0000:  96%|#########6| 48000/50000 [03:51<00:09, 209.01it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.3066 (init= 9.0844), step count (max): 121, lr policy:  0.0000:  98%|#########8| 49000/50000 [03:56<00:04, 209.23it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.3064 (init= 9.0844), step count (max): 119, lr policy:  0.0000:  98%|#########8| 49000/50000 [03:56<00:04, 209.23it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.3064 (init= 9.0844), step count (max): 119, lr policy:  0.0000: 100%|##########| 50000/50000 [04:01<00:00, 206.97it/s]
    eval cumulative reward:  100.8552 (init:  110.9623), eval step-count: 10, average reward= 9.3000 (init= 9.0844), step count (max): 85, lr policy:  0.0000: 100%|##########| 50000/50000 [04:01<00:00, 206.97it/s] 


.. GENERATED FROM PYTHON SOURCE LINES 663-670

Results
-------

Before the 1M step cap is reached, the algorithm should have reached a max
step count of 1000 steps, which is the maximum number of steps before the
trajectory is truncated.


.. GENERATED FROM PYTHON SOURCE LINES 670-685

.. code-block:: default

    plt.figure(figsize=(10, 10))
    plt.subplot(2, 2, 1)
    plt.plot(logs["reward"])
    plt.title("training rewards (average)")
    plt.subplot(2, 2, 2)
    plt.plot(logs["step_count"])
    plt.title("Max step count (training)")
    plt.subplot(2, 2, 3)
    plt.plot(logs["eval reward (sum)"])
    plt.title("Return (test)")
    plt.subplot(2, 2, 4)
    plt.plot(logs["eval step_count"])
    plt.title("Max step count (test)")
    plt.show()


.. image-sg:: /intermediate/images/sphx_glr_reinforcement_ppo_001.png
   :alt: training rewards (average), Max step count (training), Return (test), Max step count (test)
   :srcset: /intermediate/images/sphx_glr_reinforcement_ppo_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 686-706

Conclusion and next steps
-------------------------

In this tutorial, we have learned:

1. How to create and customize an environment with :py:mod:`torchrl`;
2. How to write a model and a loss function;
3. How to set up a typical training loop.

If you want to experiment with this tutorial a bit more, you can apply the following modifications:

* From an efficiency perspective,
  we could run several simulations in parallel to speed up data collection.
  Check :class:`~torchrl.envs.ParallelEnv` for further information.

* From a logging perspective, one could add a :class:`torchrl.record.VideoRecorder` transform to
  the environment after asking for rendering to get a visual rendering of the
  inverted pendulum in action. Check :py:mod:`torchrl.record` to
  know more.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 4 minutes  2.606 seconds)


.. _sphx_glr_download_intermediate_reinforcement_ppo.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: reinforcement_ppo.py <reinforcement_ppo.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: reinforcement_ppo.ipynb <reinforcement_ppo.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_