Safe Softmax
Motivation
One of the issues that commonly comes up is the necessity for a safe softmax – that is, if there is an entire batch that is “masked out” or consists entirely of padding (which in the softmax case translates to being set to -inf
, then this will result in NaNs, which can lead to training divergence. For more detail on why this functionality is helpful, please find Issue 55056 - Feature Request for Safe Softmax.
Luckily, MaskedTensor has solved this issue already.
import torch
from maskedtensor import masked_tensor
data = torch.randn(3, 3)
mask = torch.tensor([
[True, False, False],
[True, False, True],
[False, False, False]
])
x = data.masked_fill(~mask, float('-inf'))
m = masked_tensor(data, mask)
PyTorch result:
x.softmax(0)
tensor([[0.5169, nan, 0.0000],
[0.4831, nan, 1.0000],
[0.0000, nan, 0.0000]])
MaskedTensor result:
m.softmax(0)
masked_tensor(
[
[ 0.5169, --, --],
[ 0.4831, --, 1.0000],
[ --, --, --]
]
)