Lecture by Cengiz Pehlevan at Harvard University. My personal takeaway on auditing the presented content.
Course overview at https://klab.tch.harvard.edu/academia/classes/BAI/bai.html
Biological and Artificial Intelligence
Inductive bias of neural networks
A brain can be understood as a network with parameters as 10^11 neurons (nodes) and 10^14 synapses (parameters. Geoffrey Hinton cleverly observed that “The brain has about 10^14 synapses and we only live for about 10^9 seconds. So we have a lot more parameters than(supervised C.P.) data.” But biologist Anthony Zador argues that animal behaviour is not the result of algorithms (supervised or unsupervised) but encoded in the genome. When born, an animals structured brain connectivity enables them to learn very rapidly.
Deep learning is modeled on brain functions. While we cannot answer (yet), why brains don’t overfit, we can maybe understand why modern deep learning networks with up to 10^11 parameters don’t overfit even when they have orders of magnitude more parameters than data. Double Deep Descent implies that at the interpolation threshold – that is one parameter per data point – the test error actually starts to fall again.
Using the simplest possible architecture, we analyze how to map x -> y(x) with two hidden layers and 100 units per layer. We obtain bout 10000 parameters which out to be heavily overparameterized. Any line shape between two data points would be possible but only a line estimation is produced. It is as if the neural network applied Occam’s Razor. It seems that neural networks are strongly biased towards simple functions.
- What are the inductive biases of overparametrized neural networks?
- What makes a function easy/hard to approximate? How many samples do you need?
- Can we have a theory that actually applies to real data?
- Hwat are the signatures of inductive biases in natural population codes?
Goal of network Training
Cost function for training: min(theta, 1/2 sum(mu=1,P)(f(x^mu;theta)-f_T(x_^mu))^2 )
However, with more unknowns than equations, we end up with a hyperplane of possible solutions. The gradient descent method end us somewhere on the hyperplane and therefore produces a bias to land at a specific point (based on random initialization). To understand the bias, we need to understand the function space (what can the network express?), the loss function (how do we define good match?) and the learning algorithm (how do we update?).
My own thoughts that I need to check whether they are correct: A neural network projects such a hyperplane into the output space. Therefore the simplest/closest projection is probably a line or approximating a line. The answer is that only for linear regression and special setups. The theta space is to complex, only in the weight space we can argue for linearization.
Can we simplify this to solve this hard problem?
Looking at a infinitely wide network, we can have a look at the function space. In the neural tangent kernel, we can see that that most of the time we produce some thing close to a line. The wider the network, the easier it is to fit the points with the random initialization and only requiring minimal change. Looking at a Taylor-Expansion, we see that wide networks linearizes with respect to the loss produced by the gradient flow.
Kernel Regression
The goal is to learn a function f from X -> R from a finite number of observations. The function f is a reproducing Kernel Hilbert Space (RKHS) – essentially a special kind of smooth function space with an inner product. The regression then is a minimizer with a lamba times inner product term to penalize complex functions. Under RKHS there is a unique solution that approximates the quadratic loss on a zero training error for infinite width neural networks. Kernel Regression is easier to study than neural networks and can shed light on how neural networks work. Eigenfunction of Kernel is an orthonormal basis (like an eigenvector) under RKHS.
The functions that a neural network can express in the infinite width scenario are part of RKHS. Taken the space of these functions, we can say that a neural network of infinite width is just the set of eigenfunctions and select a particular weight.
Own thought: The eigenfunctions are fixed and the weight can be learned like in a perceptron. Can we layer these eigenfunctions layers and get something new, or is that just linearizable as well? The answer is it linearizes again!
Application to real datasets
Image data sets can be reduced to kernel PCA. Take the kernel eigenvalues from KPCA and project target values to kernel eigenbasis to get target weights for the one hot encoding.
Applying the generalization error based on the eigenfunctions, we can look at the relative error of the networks weight compared to the eigenfunctions space weights and produce a spectral bias that tells us which eigenfunctions are primarily selected. The larger the spectral bias, the more a network is likely to rely the particular eigenfunction to produce the output.
We can use KPCA to understand how the data is split and how many eigenfunction (KPCA principal components) we need to discriminate between outputs.