Safe Softmax

Motivation

One of the issues that commonly comes up is the necessity for a safe softmax – that is, if there is an entire batch that is “masked out” or consists entirely of padding (which in the softmax case translates to being set to -inf, then this will result in NaNs, which can lead to training divergence. For more detail on why this functionality is helpful, please find Issue 55056 - Feature Request for Safe Softmax.

Luckily, MaskedTensor has solved this issue already.

import torch
from maskedtensor import masked_tensor

data = torch.randn(3, 3)
mask = torch.tensor([
    [True, False, False],
    [True, False, True],
    [False, False, False]
])
x = data.masked_fill(~mask, float('-inf'))

m = masked_tensor(data, mask)

PyTorch result:

x.softmax(0)

tensor([[0.5169,    nan, 0.0000],
        [0.4831,    nan, 1.0000],
        [0.0000,    nan, 0.0000]])

MaskedTensor result:

m.softmax(0)

masked_tensor(
  [
    [  0.5169,       --,       --],
    [  0.4831,       --,   1.0000],
    [      --,       --,       --]
  ]
)