Shortcuts

Common PyTorch errors and solutions

My Training is too slow [Newcomers / intermediate]

  • RL is known to be CPU-intensive in some instances. Even when running a few environments in parallel, you can see a great speed-up by asking for more cores on your cluster than the number of environments you’re working with (twice as much for instance). This is also and especially true for environments that are rendered (even if they are rendered on GPU).

  • The speed of training depends upon several factors and there is not a one-fits-all solution to every problem. The common bottlnecks are:

    • data collection: the simulator speed may affect performance, as can the data transformation that follows. Speeding up environment interactions is usually done via vectorization (if the simulators enables it, e.g. Brax and other Jax-based simulators) or parallelization (which is improperly called vectorized envs in gym and other libraries). In TorchRL, transformations can usually be executed on device.

    • Replay buffer storage and sampling: storing items in a replay buffer can take time if the underlying operation requires some heavy memory manipulation or tedeious indexing (e.g. with prioritized replay buffers). Sampling can also take a considerable amount of time if the data isn’t stored contiguously and/or if costly stacking of concatenation operations are performed. TorchRL provides efficient contiguous storage solutions and efficient writing and sampling solutions in these cases.

    • Advantage computation: computing advantage functions can also constitute a computational bottleneck as these are usually coded using plain for loops. If profiling indicates that this operation is taking a considerable amount of time, consider using our fully vectorized solutions instead.

    • Loss compuation: The loss computation and the optimization steps are frequently responsible of a significant share of the compute time. Some techniques can speed things up. For instance, if multiple target networks are being used, using vectorized maps and functional programming (through functorch) instead of looping over the model configurations can provide a significant speedup.

Common bugs

  • For bugs related to mujoco (incl. DeepMind Control suite and other libraries), refer to the MUJOCO_INSTALLATION file.

  • ValueError: bad value(s) in fds_to_keep: this can have multiple reasons. One that is common in torchrl is that you are trying to send a tensor across processes that is a view of another tensor. For instance, when sending the tensor b = tensor.expand(new_shape) across processes, the reference to the original content will be lost (as the expand operation keeps the reference to the original tensor). To debug this, look for such operations (view, permute, expand, etc.) and call clone() or contiguous() after the call to the function.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources