Efficiency of writing “sparse” semantics for Adagrad

Motivation

Issue 1369 discussed the additional lines of code that were introduce while writing “sparse” semantics for Adagrad. But really the code doesn’t use sparsity as a compression and optimization technique, it wants to use masked semantics. We worked around this by introducing one-off semantics and operators that encode this behavior while forcing users to be aware of storage details such as indices and values. Let’s look at the current implementation of Adagrad (functional) to illustrate that.

In particular we’ll point out when sparsity is used as a semantic extension, i.e. unspecified values are not zero and when it is just used to compress zeros. We’ll also compare and contrast this with equivalent code written using MaskedTensor. In the end the code snippets are repeat without additional comments to show the difference in brevity.

Code

import torch
from maskedtensor import masked_tensor

Original Sparse Implementation

def _make_sparse(grad, grad_indices, values):
    size = grad.size()
    if grad_indices.numel() == 0 or values.numel() == 0:
        return torch.empty_like(grad)
    return torch.sparse_coo_tensor(grad_indices, values, size)

# We don't support sparse gradients
param = torch.arange(8).reshape(2, 4).float()
i = torch.tensor([[0, 1, 1],
                  [2, 0, 2]])
v = torch.tensor([3, 4, 5], dtype=torch.float32)
grad = torch.sparse_coo_tensor(i, v, [2, 4])
state_sum = torch.full_like(param, 0.5) # initial value for state sum

print("param:\n", param)
print("grad:\n", grad.to_dense())
print("state_sum:\n", state_sum)

# Some hyperparameters
eps = 1e-10
clr = 0.1

param:
 tensor([[0., 1., 2., 3.],
        [4., 5., 6., 7.]])
grad:
 tensor([[0., 0., 3., 0.],
        [4., 0., 5., 0.]])
state_sum:
 tensor([[0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000]])

state_sum = torch.full_like(param, 0.5) # initial value for state sum
print(state_sum)

grad = grad.coalesce()  # the update is non-linear so indices must be unique
grad_indices = grad._indices()
grad_values = grad._values()

# pow(2) has the same semantics for both sparse and dense memory layouts since 0^2 is zero
state_sum.add_(_make_sparse(grad, grad_indices, grad_values.pow(2)))
# We take care to make std sparse, even though state_sum clearly is not.
# This means that we're only applying the gradient to parts of the state_sum
# for which it is specified. This even drives the point home a lot more that
# the passed gradient is not sparse, but masked.
std = state_sum.sparse_mask(grad)
print("state_sum:\n", state_sum)
print("std:\n", std.to_dense())

tensor([[0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000]])
state_sum:
 tensor([[ 0.5000,  0.5000,  9.5000,  0.5000],
        [16.5000,  0.5000, 25.5000,  0.5000]])
std:
 tensor([[ 0.0000,  0.0000,  9.5000,  0.0000],
        [16.5000,  0.0000, 25.5000,  0.0000]])

This is where we have a very important divergence. The addition of eps should technically be applied to all values, but instead is only applied to specified values. Here we’re using sparsity as a semantic extension and to enforce a certain pattern of defined and undefined values. If parts of the values of the gradient are zero they are still included if materialized. Even though they could be compressed by other sparse storage layouts. This is technically quite brittle even though someone could argue that eps is always very small.

More so an implementation add_ for sparsity as a storage layout and compression scheme should cause densification, but we force it not to. For this one-off case it is fine until we want to introduce new compression schemes such as CSR, BSR or 2:4 block sparsity. We’ll then need to introduce separate Tensor types for each and write variations for gradients compressed using different storage formats.

# We currently dodge all these concerns using the private method values.
std_values = std._values().sqrt_().add_(eps)

# We currently don't support div for sparse Tensors because zero / zero is
# not well defined. For a MaskedTensor undefined / undefined is undefined.
param.add_(_make_sparse(grad, grad_indices, grad_values / std_values), alpha=-clr)
print("param:\n", param)

param:
 tensor([[0.0000, 1.0000, 1.9027, 3.0000],
        [3.9015, 5.0000, 5.9010, 7.0000]])

MaskedTensor Sparse Implementation

We’ve been conflating sparsity as an optimization with sparsity as a semantic extension to PyTorch. MaskedTensor proposes to call the semantic extension through sparsity masked. Currently we can’t have dense semantics with sparse storage or masked semantics with dense storage. MaskedTensor fixes that because it separates the storage from the semantics. Let’s look at above example using a masked gradient.

# Create an entirely new set of parameters to avoid errors
param2 = torch.arange(8).reshape(2, 4).float()
state_sum2 = torch.full_like(param, 0.5)  # initial value for state sum

mask = (grad.to_dense() != 0).to_sparse()
masked_grad = masked_tensor(grad, mask)
print("masked_grad:\n", masked_grad)

masked_grad:
 masked_tensor(
  [
    [      --,       --,   3.0000,       --],
    [  4.0000,       --,   5.0000,       --]
  ]
)

# We can eventually construct a masked std backed by a sparse layout
state_sum2 = state_sum2 + masked_grad.pow(2).data()
std2 = masked_tensor(state_sum2.to_sparse(), mask)

# Let's print both this version and the regular version for easier comparison
print("state_sum:\n", state_sum)
print("std:\n", std)
print("state_sum2:\n", state_sum2)
print("std2:\n", std2)

state_sum:
 tensor([[ 0.5000,  0.5000,  9.5000,  0.5000],
        [16.5000,  0.5000, 25.5000,  0.5000]])
std:
 tensor(indices=tensor([[0, 1, 1],
                       [2, 0, 2]]),
       values=tensor([3.0822, 4.0620, 5.0498]),
       size=(2, 4), nnz=3, layout=torch.sparse_coo)
state_sum2:
 tensor([[ 0.5000,  0.5000,  9.5000,  0.5000],
        [16.5000,  0.5000, 25.5000,  0.5000]])
std2:
 masked_tensor(
  [
    [      --,       --,   9.5000,       --],
    [ 16.5000,       --,  25.5000,       --]
  ]
)

# We can add support for in-place operations later. Notice how this doesn't
# need to access any storage internals and is in general a lot shorter
std2 = std2.sqrt().add(eps)

print("std:\n", std)
print("std2:\n", std2)

# .data() indeed returns a sparse tensor
param2 = param2.add((masked_grad / std2).data(), alpha=-clr)
print("param2:\n", param2)

std:
 tensor(indices=tensor([[0, 1, 1],
                       [2, 0, 2]]),
       values=tensor([3.0822, 4.0620, 5.0498]),
       size=(2, 4), nnz=3, layout=torch.sparse_coo)
std2:
 masked_tensor(
  [
    [      --,       --,   3.0822,       --],
    [  4.0620,       --,   5.0498,       --]
  ]
)
param2:
 tensor([[0.0000, 1.0000, 1.9027, 3.0000],
        [3.9015, 5.0000, 5.9010, 7.0000]])

Conclusion: Difference in Code

For reference, this is the regular, dense code path without masked gradients or sparsity:

state_sum.addcmul_(grad, grad, value=1)
std = state_sum.sqrt().add_(eps)
param.addcdiv_(grad, std, value=-clr)

The vanilla tensor implementation for sparse:

grad = grad.coalesce()  # the update is non-linear so indices must be unique
grad_indices = grad._indices()
grad_values = grad._values()
size = grad.size()

state_sum.add_(_make_sparse(grad, grad_indices, grad_values.pow(2)))
std = state_sum.sparse_mask(grad)
std_values = std._values().sqrt_().add_(eps)
param.add_(_make_sparse(grad, grad_indices, grad_values / std_values), alpha=-clr)

tensor([[0.0000, 1.0000, 1.8329, 3.0000],
        [3.8314, 5.0000, 5.8306, 7.0000]])

While MaskedTensor minimizes the code to the following snippet:

state_sum2 = state_sum2 + masked_grad.pow(2).data()
std2 = masked_tensor(state_sum2.to_sparse(), mask)
std2 = std2.sqrt().add(eps)
param2 = param2.add((masked_grad / std2).data(), alpha=-clr)

And for good measure, let’s make sure the parameters match:

print("param:\n", param)
print("param2:\n", param2)

param:
 tensor([[0.0000, 1.0000, 1.8329, 3.0000],
        [3.8314, 5.0000, 5.8306, 7.0000]])
param2:
 tensor([[0.0000, 1.0000, 1.8329, 3.0000],
        [3.8314, 5.0000, 5.8306, 7.0000]])