Safe Softmax

Motivation

One of the issues that commonly comes up is the necessity for a safe softmax – that is, if there is an entire batch that is “masked out” or consists entirely of padding (which in the softmax case translates to being set to -inf, then this will result in NaNs, which can lead to training divergence. For more detail on why this functionality is helpful, please find Issue 55056 - Feature Request for Safe Softmax.

Luckily, MaskedTensor has solved this issue already.

import torch
if "1.11.0" not in torch.__version__:
    !pip uninstall --y torch
    !pip install torch -f https://download.pytorch.org/whl/test/cu102/torch_test.html --pre

# Import factory function
from maskedtensor import masked_tensor
from maskedtensor import as_masked_tensor

data = torch.randn(3, 3)
mask = torch.tensor([
    [True, False, False],
    [True, False, True],
    [False, False, False]
])
x = data.masked_fill(~mask, float('-inf'))

m = masked_tensor(data, mask)

PyTorch result:

x.softmax(0)

tensor([[0.3628,    nan, 0.0000],
        [0.6372,    nan, 1.0000],
        [0.0000,    nan, 0.0000]])

MaskedTensor result:

m.softmax(0)

masked_tensor(
  [
    [  0.3628,       --,       --],
    [  0.6372,       --,   1.0000],
    [      --,       --,       --]
  ]
)