Training neural networks using gradient descent is equivalent to kernel gradient descent. In addition, in the infinite width limit there is a constant, deterministic, limiting kernel.

Being able to analyze the training process as kernel gradient descent allows for many useful analyses. For instance, the theoretical motivation for using Fourier features in deep learning is based on the eigendecomposition of the NTK matrix 1.

There are two key points:

  1. The training dynamics of kernel gradient descent is equivalent to the training dynamics of gradient descent.
  2. There is a limiting kernel as the number of parameters $P \rightarrow \infty$ (in the case of random linear functions), $n_1, n_2, \ldots, n_l \rightarrow \infty$ in the case of a deep neural network with Lipschitz activation $\sigma$ and $l$ layers.

Kernel Regression

Consider a function $f$ which can be defined in terms of the weighted sum of various kernels centered at different locations $x_i$, where $x_1, x_2, \ldots$ are the input features of our dataset:

\[f(x) = \sum_i K(x^{(i)}, x) \alpha(x^{(i)})\]

The input dimension is $n_0$, and $n_l$ is the output of the final layer.

$f, \alpha: \mathbb{R}^{n_0} \rightarrow \mathbb{R}^{n_l}$, $K: \mathbb{R}^{n_0} \times \mathbb{R}^{n_0} \rightarrow \mathbb{R}^{n_l} \times \mathbb{R}^{n_l}$

Such a function $f$ is said to be an element of a Reproducing Hilbert Kernel Space (RKHS) $\mathcal{H}_K$.

Kernel Gradient

In kernel regression, instead of using gradient descent by updating the parameters using their partial derivatives $\theta_{t + 1} = \theta_t - \eta \nabla_\theta L(x, y; \theta_t, f)$, we instead take steps in the function space $f_{t + 1} = f_t - \eta \nabla_K C\vert_{f_t}$, independent of how we parametrize the function $f$ where $C$ is a convext cost function. The functional gradient is depending on the cost function $C$, the kernel $K$, and the function $f_t$.

Consider a cost function in the dual space $C \in \mathcal{H}_K^*: \mathcal{H}_K \rightarrow \mathbb{R}$. Then $C \circ f$ is analagous to the loss function, evaluating $f$ over the finite dataset $x_1, \ldots, x_n \in \mathbb{R}^{n_0}$.

The kernel gradient for kernel $K$ is written $\nabla_K C$ and defines a mapping $\mathcal{H}_K^* \rightarrow \mathcal{H}_K$. Evaluated at a function $f_0$, we are actually evaluating $\nabla_K C \vert_{f_0} = \Phi_K(\partial_{f}^{in} C\vert_{f_0})$ where $\partial_{f}^{in} C$ is the functional derivative of $C$ w.r.t $f$.

Note that $\partial_{f}^{in} C\vert_{f_0} \in \mathcal{H}_K^*$, so we can define it using an element of $\mathcal{H}_K$ as follows: $\partial_{f}^{in} C\vert_{f_0} = \langle d_{f_0}, \cdot \rangle_{in}$. $\Phi_K$ is defined so that $\Phi_K(\partial_{f}^{in} C\vert_{f_0}) = E_{x \sim {in}}[K(x, \cdot) d_{f_0}(\cdot)]$ Since we are working with the empirical distribution, we have the definition of our functional derivative $\nabla_K C \vert_{f_0} \in \mathcal{H}_K$:

\[\nabla_K C \vert_{f_0} \in \mathcal{H}_K = \frac{1}{N} \sum_{i = 1}^N K(\cdot, x_i) d\vert_{f_0} (x_i)\]

Quick notation comment, ${in}$ script refers to the input distribution, which is the empirical distribution defined by our training dataset. $\langle f, g\rangle_{in}$ denotes $\mathbb{E}_{x \sim {in}}[f(x)g(x)^T] = \frac{1}{N} \sum_{i = 1}^N f(x_i) g(x_i)^T$.

If we follow the updating rule above, we have the following training dynamics \(\partial_t f(t) = - \nabla_K C \vert_{f_t}\)

Equivalence with Gradient Descent and Limiting Kernel

The original paper demonstrates the training dynamics are identical for both gradient descent and kernel descent in the 2 cases:

  1. Random function approximation (random function basis) \(\tilde{K} = \frac{1}{P} \sum_{p = 1}^{P} f^{(p)} \otimes f^{(p)}\)
  2. Deep learning case with Lipschitz activation and $L$ layer neural network. The kernel in this case is the eponymous neural tangent kernel (NTK). \(\Theta^{(L)}(\theta) = \sum_{p = 1}^P \partial_{\theta_p} F^{(L)}(\theta) \otimes \partial_{\theta_p} F^{(L)}(\theta)\)

The equivalence of kernel gradient descent follows from the definitions. The second part, to prove that there is a constant limiting kernel follows from LLN in the case of (1). In case (2), there is some additional machinery needed using Gaussian processes applied to neural networks to show that this kernel is constant in the case $P \rightarrow \infty$. There are more thorough explanation available elsewhere 2 3.

References