{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# For tips on running notebooks in Google Colab, see\n",
    "# https://pytorch.org/tutorials/beginner/colab\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Neural Tangent Kernels\n",
    "======================\n",
    "\n",
    "The neural tangent kernel (NTK) is a kernel that describes [how a neural\n",
    "network evolves during\n",
    "training](https://en.wikipedia.org/wiki/Neural_tangent_kernel). There\n",
    "has been a lot of research around it [in recent\n",
    "years](https://arxiv.org/abs/1806.07572). This tutorial, inspired by the\n",
    "implementation of [NTKs in\n",
    "JAX](https://github.com/google/neural-tangents) (see [Fast Finite Width\n",
    "Neural Tangent Kernel](https://arxiv.org/abs/2206.08720) for details),\n",
    "demonstrates how to easily compute this quantity using `torch.func`,\n",
    "composable function transforms for PyTorch.\n",
    "\n",
    "<div style=\"background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>NOTE:</strong></div>\n",
    "\n",
    "<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n",
    "\n",
    "<p>This tutorial requires PyTorch 2.0.0 or later.</p>\n",
    "\n",
    "</div>\n",
    "\n",
    "Setup\n",
    "-----\n",
    "\n",
    "First, some setup. Let\\'s define a simple CNN that we wish to compute\n",
    "the NTK of.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "from torch.func import functional_call, vmap, vjp, jvp, jacrev\n",
    "device = 'cuda' if torch.cuda.device_count() > 0 else 'cpu'\n",
    "\n",
    "class CNN(nn.Module):\n",
    "    def __init__(self):\n",
    "        super(CNN, self).__init__()\n",
    "        self.conv1 = nn.Conv2d(3, 32, (3, 3))\n",
    "        self.conv2 = nn.Conv2d(32, 32, (3, 3))\n",
    "        self.conv3 = nn.Conv2d(32, 32, (3, 3))\n",
    "        self.fc = nn.Linear(21632, 10)\n",
    "\n",
    "    def forward(self, x):\n",
    "        x = self.conv1(x)\n",
    "        x = x.relu()\n",
    "        x = self.conv2(x)\n",
    "        x = x.relu()\n",
    "        x = self.conv3(x)\n",
    "        x = x.flatten(1)\n",
    "        x = self.fc(x)\n",
    "        return x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And let\\'s generate some random data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x_train = torch.randn(20, 3, 32, 32, device=device)\n",
    "x_test = torch.randn(5, 3, 32, 32, device=device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a function version of the model\n",
    "======================================\n",
    "\n",
    "`torch.func` transforms operate on functions. In particular, to compute\n",
    "the NTK, we will need a function that accepts the parameters of the\n",
    "model and a single input (as opposed to a batch of inputs!) and returns\n",
    "a single output.\n",
    "\n",
    "We\\'ll use `torch.func.functional_call`, which allows us to call an\n",
    "`nn.Module` using different parameters/buffers, to help accomplish the\n",
    "first step.\n",
    "\n",
    "Keep in mind that the model was originally written to accept a batch of\n",
    "input data points. In our CNN example, there are no inter-batch\n",
    "operations. That is, each data point in the batch is independent of\n",
    "other data points. With this assumption in mind, we can easily generate\n",
    "a function that evaluates the model on a single data point:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "net = CNN().to(device)\n",
    "\n",
    "# Detaching the parameters because we won't be calling Tensor.backward().\n",
    "params = {k: v.detach() for k, v in net.named_parameters()}\n",
    "\n",
    "def fnet_single(params, x):\n",
    "    return functional_call(net, params, (x.unsqueeze(0),)).squeeze(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the NTK: method 1 (Jacobian contraction)\n",
    "================================================\n",
    "\n",
    "We\\'re ready to compute the empirical NTK. The empirical NTK for two\n",
    "data points $x_1$ and $x_2$ is defined as the matrix product between the\n",
    "Jacobian of the model evaluated at $x_1$ and the Jacobian of the model\n",
    "evaluated at $x_2$:\n",
    "\n",
    "$$J_{net}(x_1) J_{net}^T(x_2)$$\n",
    "\n",
    "In the batched case where $x_1$ is a batch of data points and $x_2$ is a\n",
    "batch of data points, then we want the matrix product between the\n",
    "Jacobians of all combinations of data points from $x_1$ and $x_2$.\n",
    "\n",
    "The first method consists of doing just that - computing the two\n",
    "Jacobians, and contracting them. Here\\'s how to compute the NTK in the\n",
    "batched case:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def empirical_ntk_jacobian_contraction(fnet_single, params, x1, x2):\n",
    "    # Compute J(x1)\n",
    "    jac1 = vmap(jacrev(fnet_single), (None, 0))(params, x1)\n",
    "    jac1 = jac1.values()\n",
    "    jac1 = [j.flatten(2) for j in jac1]\n",
    "\n",
    "    # Compute J(x2)\n",
    "    jac2 = vmap(jacrev(fnet_single), (None, 0))(params, x2)\n",
    "    jac2 = jac2.values()\n",
    "    jac2 = [j.flatten(2) for j in jac2]\n",
    "\n",
    "    # Compute J(x1) @ J(x2).T\n",
    "    result = torch.stack([torch.einsum('Naf,Mbf->NMab', j1, j2) for j1, j2 in zip(jac1, jac2)])\n",
    "    result = result.sum(0)\n",
    "    return result\n",
    "\n",
    "result = empirical_ntk_jacobian_contraction(fnet_single, params, x_train, x_test)\n",
    "print(result.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In some cases, you may only want the diagonal or the trace of this\n",
    "quantity, especially if you know beforehand that the network\n",
    "architecture results in an NTK where the non-diagonal elements can be\n",
    "approximated by zero. It\\'s easy to adjust the above function to do\n",
    "that:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def empirical_ntk_jacobian_contraction(fnet_single, params, x1, x2, compute='full'):\n",
    "    # Compute J(x1)\n",
    "    jac1 = vmap(jacrev(fnet_single), (None, 0))(params, x1)\n",
    "    jac1 = jac1.values()\n",
    "    jac1 = [j.flatten(2) for j in jac1]\n",
    "\n",
    "    # Compute J(x2)\n",
    "    jac2 = vmap(jacrev(fnet_single), (None, 0))(params, x2)\n",
    "    jac2 = jac2.values()\n",
    "    jac2 = [j.flatten(2) for j in jac2]\n",
    "\n",
    "    # Compute J(x1) @ J(x2).T\n",
    "    einsum_expr = None\n",
    "    if compute == 'full':\n",
    "        einsum_expr = 'Naf,Mbf->NMab'\n",
    "    elif compute == 'trace':\n",
    "        einsum_expr = 'Naf,Maf->NM'\n",
    "    elif compute == 'diagonal':\n",
    "        einsum_expr = 'Naf,Maf->NMa'\n",
    "    else:\n",
    "        assert False\n",
    "\n",
    "    result = torch.stack([torch.einsum(einsum_expr, j1, j2) for j1, j2 in zip(jac1, jac2)])\n",
    "    result = result.sum(0)\n",
    "    return result\n",
    "\n",
    "result = empirical_ntk_jacobian_contraction(fnet_single, params, x_train, x_test, 'trace')\n",
    "print(result.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The asymptotic time complexity of this method is $N O [FP]$ (time to\n",
    "compute the Jacobians) + $N^2 O^2 P$ (time to contract the Jacobians),\n",
    "where $N$ is the batch size of $x_1$ and $x_2$, $O$ is the model\\'s\n",
    "output size, $P$ is the total number of parameters, and $[FP]$ is the\n",
    "cost of a single forward pass through the model. See section 3.2 in\n",
    "[Fast Finite Width Neural Tangent\n",
    "Kernel](https://arxiv.org/abs/2206.08720) for details.\n",
    "\n",
    "Compute the NTK: method 2 (NTK-vector products)\n",
    "===============================================\n",
    "\n",
    "The next method we will discuss is a way to compute the NTK using\n",
    "NTK-vector products.\n",
    "\n",
    "This method reformulates NTK as a stack of NTK-vector products applied\n",
    "to columns of an identity matrix $I_O$ of size $O\\times O$ (where $O$ is\n",
    "the output size of the model):\n",
    "\n",
    "$$J_{net}(x_1) J_{net}^T(x_2) = J_{net}(x_1) J_{net}^T(x_2) I_{O} = \\left[J_{net}(x_1) \\left[J_{net}^T(x_2) e_o\\right]\\right]_{o=1}^{O},$$\n",
    "\n",
    "where $e_o\\in \\mathbb{R}^O$ are column vectors of the identity matrix\n",
    "$I_O$.\n",
    "\n",
    "-   Let $\\textrm{vjp}_o = J_{net}^T(x_2) e_o$. We can use a\n",
    "    vector-Jacobian product to compute this.\n",
    "-   Now, consider $J_{net}(x_1) \\textrm{vjp}_o$. This is a\n",
    "    Jacobian-vector product!\n",
    "-   Finally, we can run the above computation in parallel over all\n",
    "    columns $e_o$ of $I_O$ using `vmap`.\n",
    "\n",
    "This suggests that we can use a combination of reverse-mode AD (to\n",
    "compute the vector-Jacobian product) and forward-mode AD (to compute the\n",
    "Jacobian-vector product) to compute the NTK.\n",
    "\n",
    "Let\\'s code that up:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def empirical_ntk_ntk_vps(func, params, x1, x2, compute='full'):\n",
    "    def get_ntk(x1, x2):\n",
    "        def func_x1(params):\n",
    "            return func(params, x1)\n",
    "\n",
    "        def func_x2(params):\n",
    "            return func(params, x2)\n",
    "\n",
    "        output, vjp_fn = vjp(func_x1, params)\n",
    "\n",
    "        def get_ntk_slice(vec):\n",
    "            # This computes ``vec @ J(x2).T``\n",
    "            # `vec` is some unit vector (a single slice of the Identity matrix)\n",
    "            vjps = vjp_fn(vec)\n",
    "            # This computes ``J(X1) @ vjps``\n",
    "            _, jvps = jvp(func_x2, (params,), vjps)\n",
    "            return jvps\n",
    "\n",
    "        # Here's our identity matrix\n",
    "        basis = torch.eye(output.numel(), dtype=output.dtype, device=output.device).view(output.numel(), -1)\n",
    "        return vmap(get_ntk_slice)(basis)\n",
    "\n",
    "    # ``get_ntk(x1, x2)`` computes the NTK for a single data point x1, x2\n",
    "    # Since the x1, x2 inputs to ``empirical_ntk_ntk_vps`` are batched,\n",
    "    # we actually wish to compute the NTK between every pair of data points\n",
    "    # between {x1} and {x2}. That's what the ``vmaps`` here do.\n",
    "    result = vmap(vmap(get_ntk, (None, 0)), (0, None))(x1, x2)\n",
    "\n",
    "    if compute == 'full':\n",
    "        return result\n",
    "    if compute == 'trace':\n",
    "        return torch.einsum('NMKK->NM', result)\n",
    "    if compute == 'diagonal':\n",
    "        return torch.einsum('NMKK->NMK', result)\n",
    "\n",
    "# Disable TensorFloat-32 for convolutions on Ampere+ GPUs to sacrifice performance in favor of accuracy\n",
    "with torch.backends.cudnn.flags(allow_tf32=False):\n",
    "    result_from_jacobian_contraction = empirical_ntk_jacobian_contraction(fnet_single, params, x_test, x_train)\n",
    "    result_from_ntk_vps = empirical_ntk_ntk_vps(fnet_single, params, x_test, x_train)\n",
    "\n",
    "assert torch.allclose(result_from_jacobian_contraction, result_from_ntk_vps, atol=1e-5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our code for `empirical_ntk_ntk_vps` looks like a direct translation\n",
    "from the math above! This showcases the power of function transforms:\n",
    "good luck trying to write an efficient version of the above by only\n",
    "using `torch.autograd.grad`.\n",
    "\n",
    "The asymptotic time complexity of this method is $N^2 O [FP]$, where $N$\n",
    "is the batch size of $x_1$ and $x_2$, $O$ is the model\\'s output size,\n",
    "and $[FP]$ is the cost of a single forward pass through the model. Hence\n",
    "this method performs more forward passes through the network than method\n",
    "1, Jacobian contraction ($N^2 O$ instead of $N O$), but avoids the\n",
    "contraction cost altogether (no $N^2 O^2 P$ term, where $P$ is the total\n",
    "number of model\\'s parameters). Therefore, this method is preferable\n",
    "when $O P$ is large relative to $[FP]$, such as fully-connected (not\n",
    "convolutional) models with many outputs $O$. Memory-wise, both methods\n",
    "should be comparable. See section 3.3 in [Fast Finite Width Neural\n",
    "Tangent Kernel](https://arxiv.org/abs/2206.08720) for details.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}