Safe Softmax
Motivation
One of the issues that commonly comes up is the necessity for a safe softmax – that is, if there is an entire batch that is “masked out” or consists entirely of padding (which in the softmax case translates to being set to -inf
, then this will result in NaNs, which can lead to training divergence. For more detail on why this functionality is helpful, please find Issue 55056 - Feature Request for Safe Softmax.
Luckily, MaskedTensor has solved this issue already.
import torch
if "1.11.0" not in torch.__version__:
!pip uninstall --y torch
!pip install torch -f https://download.pytorch.org/whl/test/cu102/torch_test.html --pre
# Import factory function
from maskedtensor import masked_tensor
from maskedtensor import as_masked_tensor
data = torch.randn(3, 3)
mask = torch.tensor([
[True, False, False],
[True, False, True],
[False, False, False]
])
x = data.masked_fill(~mask, float('-inf'))
m = masked_tensor(data, mask)
PyTorch result:
x.softmax(0)
tensor([[0.3628, nan, 0.0000],
[0.6372, nan, 1.0000],
[0.0000, nan, 0.0000]])
MaskedTensor result:
m.softmax(0)
masked_tensor(
[
[ 0.3628, --, --],
[ 0.6372, --, 1.0000],
[ --, --, --]
]
)