Neuro 140/240 – Lecture 5

Lecture by Tomer Ullman at Harvard University. My personal takeaway on auditing the presented content.

Course overview at

Biological and Artificial Intelligence

The development of intuitive physics and intuitive psychology

Turing proposed that an AI could be developed very much like a human – from a empty notebook or child to a developed adult. However, early development research in psychology and cognitive science has shown that the note may not be “empty”. There is some core knowledge that seems to be either innate to or very early developing in humans but the notion is generally contested and still under research.

Evolutionarily, it may make sense to kickstart a “new” being with some innate knowledge to give it a head-start instead of having to acquire the knowledge on its own. The core knowledge is very limited to several domains. In core physics knowledge, infants have expectations about objects amongst others permanence, cohesive, solid, smooth paths and contact causality. There is not much more than these principles. At the moment, for these core knowledge expectations, there seems to be more a limitation of how we inquire the knowledge rather than them knowing it. Research is actively conducted to find a lower limit.

Side note: For preverbal infants, surprise is measured by the time they spend looking at something but it can be confused with looking at things or people they are attached to (like parents).

Alternatives to Core Physics?

Physical Reasoning Systems

There could be physical reasoning systems (Luo & Baillargeon, 2005) where visually observable features are evaluated to make a decision what physical result would happen. For infants, it appears that the reasoning system is refined with development. A feedforward deep network by Lerer at al. (2016) trained to evaluate whether a piling of stones is stable but the system did not generalize.

Since 2010 a cognitive revolution of sorts has happened in neural network architecture consisting of decoders, LSTMs, memory, and attention that have become “off the shelves”. Using these systems (Piloto et al., 2018), it is possible to generalize better (51% success classifying surprise).

Mental Game Engine Proposal

Maybe, the human brain works like a game engine that emulates physics to approximate reality (Battaglia et al., 2013, in PNAS). A minimal example is a model, a test stimuli and data. This is an ongoing area of research. A model of physics understanding at 4 months consists of approximate objects, dynamics, priors, re-sampling and memory (Smith et al., 2019) is used to predict the next state which is compared to the real next state. In this context, surprise can be defined as the difference between the prediction and the outcome.

Core Psychology

Mental planning engine proposal

There are also expectation about agents. Infants have ideas about agents goals, actions, and planning.


Models have many possible routes.

  1. Human brains could be the way to model intelligence.
  2. Intelligence can be modelled in another way and need not be human-like.
  3. A general/universal function approximator may eventually converge to human behaviour/ability.
  4. A general/universal function approximator actually represents human behaviour/ability.

The problem is that any input output problem can be represented with a look-up-table and thus have no intelligence. Many models may eventually end up in “look-up-table land” where they don’t learn an actual model but only a simple look-up. These models can be useful to solve some tasks but they do not respond to common sense and fail easily on variation.

At the time, reinforcement learning is only a solution in as much evolution seems to have “used” it to produce human behaviour. But how that worked and what the conditions are to make that work are still unknown and thus reinforcement learning is still not the solution to get human-like-behaviour.

Neuro 140/240 – Lecture 4

Lecture by Cengiz Pehlevan at Harvard University. My personal takeaway on auditing the presented content.

Course overview at

Biological and Artificial Intelligence

Inductive bias of neural networks

A brain can be understood as a network with parameters as 10^11 neurons (nodes) and 10^14 synapses (parameters. Geoffrey Hinton cleverly observed that “The brain has about 10^14 synapses and we only live for about 10^9 seconds. So we have a lot more parameters than(supervised C.P.) data.” But biologist Anthony Zador argues that animal behaviour is not the result of algorithms (supervised or unsupervised) but encoded in the genome. When born, an animals structured brain connectivity enables them to learn very rapidly.

Deep learning is modeled on brain functions. While we cannot answer (yet), why brains don’t overfit, we can maybe understand why modern deep learning networks with up to 10^11 parameters don’t overfit even when they have orders of magnitude more parameters than data. Double Deep Descent implies that at the interpolation threshold – that is one parameter per data point – the test error actually starts to fall again.

Using the simplest possible architecture, we analyze how to map x -> y(x) with two hidden layers and 100 units per layer. We obtain bout 10000 parameters which out to be heavily overparameterized. Any line shape between two data points would be possible but only a line estimation is produced. It is as if the neural network applied Occam’s Razor. It seems that neural networks are strongly biased towards simple functions.

  1. What are the inductive biases of overparametrized neural networks?
  2. What makes a function easy/hard to approximate? How many samples do you need?
  3. Can we have a theory that actually applies to real data?
  4. Hwat are the signatures of inductive biases in natural population codes?

Goal of network Training

Cost function for training: min(theta, 1/2 sum(mu=1,P)(f(x^mu;theta)-f_T(x_^mu))^2 )

However, with more unknowns than equations, we end up with a hyperplane of possible solutions. The gradient descent method end us somewhere on the hyperplane and therefore produces a bias to land at a specific point (based on random initialization). To understand the bias, we need to understand the function space (what can the network express?), the loss function (how do we define good match?) and the learning algorithm (how do we update?).

My own thoughts that I need to check whether they are correct: A neural network projects such a hyperplane into the output space. Therefore the simplest/closest projection is probably a line or approximating a line. The answer is that only for linear regression and special setups. The theta space is to complex, only in the weight space we can argue for linearization.

Can we simplify this to solve this hard problem?

Looking at a infinitely wide network, we can have a look at the function space. In the neural tangent kernel, we can see that that most of the time we produce some thing close to a line. The wider the network, the easier it is to fit the points with the random initialization and only requiring minimal change. Looking at a Taylor-Expansion, we see that wide networks linearizes with respect to the loss produced by the gradient flow.

Kernel Regression

The goal is to learn a function f from X -> R from a finite number of observations. The function f is a reproducing Kernel Hilbert Space (RKHS) – essentially a special kind of smooth function space with an inner product. The regression then is a minimizer with a lamba times inner product term to penalize complex functions. Under RKHS there is a unique solution that approximates the quadratic loss on a zero training error for infinite width neural networks. Kernel Regression is easier to study than neural networks and can shed light on how neural networks work. Eigenfunction of Kernel is an orthonormal basis (like an eigenvector) under RKHS.

The functions that a neural network can express in the infinite width scenario are part of RKHS. Taken the space of these functions, we can say that a neural network of infinite width is just the set of eigenfunctions and select a particular weight.

Own thought: The eigenfunctions are fixed and the weight can be learned like in a perceptron. Can we layer these eigenfunctions layers and get something new, or is that just linearizable as well? The answer is it linearizes again!

Application to real datasets

Image data sets can be reduced to kernel PCA. Take the kernel eigenvalues from KPCA and project target values to kernel eigenbasis to get target weights for the one hot encoding.

Applying the generalization error based on the eigenfunctions, we can look at the relative error of the networks weight compared to the eigenfunctions space weights and produce a spectral bias that tells us which eigenfunctions are primarily selected. The larger the spectral bias, the more a network is likely to rely the particular eigenfunction to produce the output.

We can use KPCA to understand how the data is split and how many eigenfunction (KPCA principal components) we need to discriminate between outputs.

Neuro 140/240 – Lecture 2

Lecture by Richard Born at Harvard University. My personal takeaway on auditing the presented content.

Course overview at

Biological and Artificial Intelligence

Warren Weaver was the head at the Rockefeller Center in the 1950s and he said the future of engineering is to understand the tricks that nature has come up with over the millennia.

Anatomy of visual pathways

The visual system spawns large areas of the brain and often any damage to the brain causes malfunctions of the visual system.

The world is mirrored on the retina. About one million axons are connected to the retina. Vision is also connected to the brainstem to orient the head in space. This is a semi-automatic system to pay attention. It also connected to the Cycadian rhythm to manage sleep cycle through brightness.

The important point in primates is the V1 striate area 17. A lession here makes humans blind. A brain has no visual understanding. It only produces action potentials that are interpreted to be visual. In monkeys, there are more than 30 visual areas roughly grouped into two. The ventral stream (down) is concerned with the what (object recognition) and the dorsal stream (up) is concerned with the where (spatial perception). Retinotopic representations are aligned with the retina space but object recognition ought to be object-centred. How the brain converts this to world coordinates is still an open question. Mishkin showed in 1983 that monkeys taught to associate food with a specific object or with a specific location solve tasks at random if they had a lesions in the respective brain area.

Receptive Fields

The sensory epithelium can influence a given neuron’s firing rate. Hubel and Wiesel showed that the
Lateral geniculate nucleus
(LGN) excited by a light signal is triggered. Hartline showed that surround suppression helps to locate points of interest. The brain is interested in points in the visual space where the derivative is not zero. Brains locate contrast (space), color contrast (wavelength), transience (time), motion (space&time), and space&color.

Hierarchical receptive fields

There is a hierarchical elaboration of receptive fields. Hubel & Wiesel also measured the signal in the primary visual cortex and found that the neurons encode orientation of an edge with a stronger off response on one side but no response to diffuse edges. Essentially, we can think of the neurons as a filter or a convolution (simplification). A brain does it in parallel in contrast to a computer. Horace Barlow noted that the brain focuses on suspicious coincidence (e.g. unusual changes).

We go from LGN (center-surround) to simple cells (orientation) to complex (contrast invariant across area or pooling/softmax). In the 1950s, the psychologist Attneave found the 17 points of maximal curvature on an image of a cat and connected the lines and produced an abstract representation that was recognizable as a cat.

Convolutional Neural Networks

An engineered alternation of selectivity (convolution) and generalization (pooling) has led to great success early on in vision research but then came deep networks. However, deep networks actually does apply convolutions, rectification (ReLU), pooling, and lastly normalization. The non-engineered application of these features improved performance.

Yamins et al., 2014 showed that alexnet has some non-trival similarity with monkey brains in the ventral stream visual areas.

What is missing?

Adding noise to images, CNNs failed quickly at 20% noise whereas human performance reduced gracefully at the level of noise. Even worse, the CNNs can learn solutions with specific kind of noise but end up failing if the noise changes.

In Ponce et al., 2019, random codes are fed to a generative neural network to synthesize images. The neuron is used as an objective function to rank the synthesize images. A genetic algorithm is applied to find the codes that maximally trigger the neurons.

PredNet predictes on videostreams with unsupervised learning.

Tootell and Born showed in 1990 that clustering visual cortex the data is still very retrinotopic but in the MT it is organized in hypercolumns to detect motion direction rather than spatial connectivity.

Neurons near each other seem to like to do the same. Brains are not just look-up tables but have a sematic structure the spatial organisation.