Neuro 140/240 – Lecture 1

Lecture by Gabriel Kreiman at Harvard University. My personal takeaway on auditing the presented content.

Course overview at

Biological and Artificial Intelligence

Eventually, we will have ultraintelligent machines, that is machines that are more intelligent than humans.

Going back to 1950, The Turing test denotes the first test to understand whether a machine is indistinguishable from a human. However, the test is limited to language interaction. Turing tests can be adapted to other modes such as vision. The adapted Turing test for vision would be to ask any question on the image.

Intelligence is the greatest problem in science. Tomaso Poggio postulates that if we can understand the brain, we can understand intelligence. Consequently, we might become more intelligent ourselves and intractable problems might be resolved.

What can’t deep convolutional networks do?

An adversarial attack through adding noise immediately can change the predicted outcome of a deep convolutional network. Such a network cannot pass the visual Turing test because a human would not be fooled by this simple manipulation.

Action recognition often uses anything else in the image but the action to determine the outcome by using biases in the data collection processes. Again, the visual Turing test would fail where a human would succeed.

The most powerful computational devices on Earth

The human brain is still the most powerful device because it generalises and can solve unknown tasks.

The Biophysics of computation (Neuroscience 101)

A neuron has dendrites that collection information, an internal function in the cell that sums the dendrites information, an activation function that leans to an axon that outputs a value. Axons then are connected to other dendrites via synapses.

Source: Wikipedia

Studying animal is critical not only for behaviour but also to understand underlying mechanisms. David Hubel and Torsten Wiesel placed electrons in the back of the brain to find neurons that reacted to the reaction of orientation of objects. A lot of the computer scientists used their cartoons of brain functions (and others) to design neural networks for AI. Indeed, nearly every type of neural network has an analogue in biology.

Disruptive Neuroscience

Circuit diagrams that show the full connectivity in a cube micron of brain matter.

Listening to a concert of lots of neurons. We are now able to record from many neurons (ten thousands) simultaneously over prolonged periods of time.

Causally interfering with neural activity. We are able to turn on and off specific neurons through the iron channels with light triggers.

Together, today we have a better understanding of the connectivity of neurons, the activity of neurons and even the manipulation of neurons. In the long-run we may be able to understand biological intelligence.

Tangents to the topic

Consciousness is a matter of major debate. If a machine passes the Turing test, does it have a consciousness? Christof Koch argues that consciousness and intelligence are separate phenomena.

Humans already prescribe attachment to simple machines like a Tamagotchis. The Atlas robot from Boston Dynamics is trained by being pushed over. The human-like appearance makes it a question whether cruel behaviour towards machines is ethical?

The perils of AI are:

  1. Redistribution of jobs
  2. Unlikely terminator-like scenarios
  3. Military applications
  4. To err is algorithmic (just like humans)
  5. Biases in training data (note that humans have biases too or create them for the machine -> garbage in / garbage out)
  6. Lack of understanding (we still don’t understand how humans make decisions either)
  7. Social, mental, and political consequences of rapid changes in labour force
  8. Rapid growth, faster than development of regulations

But robots playing football are still years if not decades away from human-like behaviour.

Another point is to comprehend humour. Humour is based on higher level abstractions of content. Therefore, the systems require access to knowledge and make decisions about the contents shown to infer the humorous component (e.g. a picture of Abraham Lincoln with an iphone).

From Twitter:

P.S.: Note from Lecture 2: Gabriel Kreiman believes that the next revolutions in machine learning will be based on something that we can learn from biological intelligence.

Geometry of Big Data – Monday session

All talks are summarised in my words which may not accurately represent the authors’ opinion. The focus is on aspects I found interesting. Please refer to the authors’ work for more details.

Session 1 – Learning DAGs

The talk DAGs with NO TEARS: Continuous Optimization for Structure Learning is given by Pradeep Ravikumar. A draft is available on arxiv.

Learning directed acyclical graphs (DAGs) can traditionally be done in two ways: conditional independence and score-based . The latter poses a local search-problem with out a clear answer. More recently the problem has been posted as a continuous (global) optimisation for undirected graphs.

A loss function is a log-likelihood of the data and we need to find the most appropriate W such that X = XW + E. They provide a new M-estimator.

Session 2 – Parallel transport for data alignment

The talk Data Analysis with the Riemannian Geometry of Symmetric Positive-Definite Matrices given by Ronan Talmon. A draft is available on arxiv.

The talk focuses on how to align data when the intersubject variation is large but consistent and the intrasubject variation could be mapped. Parallel transport has the goal to align the intersubject values on an symmetric positive definite (SPD) embedding in n-dimensional space. SPD matrices are embedded on a hyperbole and all computations can be performed in closed-form.

Data from multiple subject and multiple session, it does not matter whether to first adapt the sessions or the subject – which only works for parallel transport and not with identy transformations.

Session 3 – Persistence framework for data analysis

The talk Metric learning for persistence-based summaries and application to graph classification is given by Yusu Wang. An underlying paper is available on PlosOne.

Persistence diagrams can be used to describe complexity. The features are simpler but persistent to the underlying object. A geometric object through a filtration perspective produces a summary. Filtration is a growing sequence of spaces. The time that sets get created and destroyed can be mapped onto a persistence diagram with death time on the y axis and birth dime on the x-axis.

The bottleneck distance is a matching between two persistence diagram such that each feature is matched with the shortest distance. Features may be matched to a zero-feature (capturing noise) if they are to close to the diagonal. More complex approaches include persistence images that transform the diagram (after transforming it) into a kernel density.

The weight function should be application dependent and thus can be learned instead of pre-assigned. We can just take the difference between two persistence images as a weighted kernel for persistence images (WLPI).

For graphs the following metrics can be used for persistence. The Discrete Ricci curvature captures the local curvature on the manifold. The Jaccard index function compares for nodes who has common neighbors which is good for noisy networks.

In general, a descriptive function must be found for the domain and may even encode meaningful knowledge on how the object behaves. High weights would describe the more distinct features.

Session 4 – Behold the spikes

The talk Proper regularizers for semi-supervised learning is given by Dejan Slepcev.

A d-dimensional point cloud can be converted to a graph representation using a kernel that connects close edges (with a fall-off or discontinuity). As the number of nodes n goes to infinity, the kernel bandwidth should shrink to 0.

The error bandwidth is critical. The take-away is that instead of producing single labeled data points, the label should be extended beyond the kernel bandwidth. A single data label can produce spikes because essentially the minimiser obtains smaller values for a flat surface with a single spike than for an appropriate surface.

Session 5

The talk Solving for committor functions in high dimension is given by Jianfeng Lu.

Session 6 – Finding structure in loss

The talk A consistent framework for structure machine learning is given by Lorenzo Rosasco.

Structured machine learning is not structure learning. It refers to learning functional dependencies between arbitrary output and input data. Classical approaches include likelihood estimation models (struct-svm, conditional random fields, but limited guarantees) and surrogate approaches (strong theoretical guarantees but ad hoc and specific).

Applying empirical risk minimisation (ERM) from statistical learning we can expect that the mean of the empirical data is close to the mean of the class. However, it is hard to pick a class. The inner risk (decomposing into marginal probability) reduces the class size. Making a strong assumption the structured encoding loss function (SELF) requires a Hilbert space and two maps such that the loss function can be presented as an inner product. Using a linear loss function helps. For a crazy space Y (need not be linear) the SELF gives enough structure to proceed. This enlarges the scope of structured learning to inner risk minimisation (IRM).

There is a function psi hidden in the loss function that encodes and decodes from Y to the Hilbert space. The steps are encode Y in H, learn from X to H, and decode H to Y. In linear estimation with least squares, the encoding/decoding disappears and the output space Y is not needed for computation.

Starting “Geometry and Learning from Data in 3D and Beyond” at IPAM, UCLA

Today is the first day of my stay at the Institute for Pure and Applied Mathematics (IPAM) at University of California Los Angeles (UCLA). Over the coming weeks I wil try to discuss interesting talks here at the long course Geometry and Learning from Data in 3D and Beyond. Stay tuned for the first workshop on Geometry of Big Data.

A Manifesto to Cite 50/50

I recently came across Women Also Know Stuff. I think it is a great initiative that helps to slowly combat systemic and structural inequality. They point to many female scientists in most social sciences and I wondered whether I could find a similar program in computer science. The answer was no because apparently we first need to get women into computer science. I would still love to see #WomenAlsoKnowComputerScience on twitter, alas the search results are empty. It is not that I don’t know great female compute scientist but maybe they lack exposition which makes it all the more harder to convince women to join the field.

What I thought could help would be a larger exposition in scientific citations. I will need to go a bit off-topic to explain my thinking but bare with me. Citations produce scale-free networks (Klemm & Eguiluz, 2002).

Comparison of a random network and a scale-free network. The scale free network shows super connecting nodes in grey. Taken from wikipedia.

That means that a few super-connected nodes (so-called hubs) take up almost all the citation. In general, if we as scientists need a citation to underline a concept, we are much more likely to end up citing such a super-connected node. What that means is that highly cited scientists will get even more cited and less cited scientists remain so. That is even if their science was better. Network effects (or economies of scales) ensure that not necessarily the best science is cited the most, but usually the one preserving the status quo (Wang, Veugelers, & Stephan, 2017). But the effect is even stronger than that. The big names (not only the citations) dominate the field to such an extend that alternative explanations favored by other scientists are locked out of the discussion until such a star departs from the field (Azoulay, Fons-Rosen, & Zivin, 2015).

So where does that leave us with citing female scientists? They are at a triple-disadvantage:

  1. They have been structurally excluded from the discipline
  2. They (usually) don’t have a big name so their citation counts don’t increase
  3. As there are no role models young women may not take up the field

However, and this is what I would like to stress most, it is not the quality of their research. Now, if citations are usually not awarded for merit only but mainly due to structural reasons, why not use them to start shifting the scales today such that in some day in the future women are equally represented in this field (and in many others) such as the statistical distribution of people would predict.

The Manifesto to Cite 50/50

Making a citation to underline a concept does not require us to only cite that one citation that we always use. We can vary whom we cite and we can choose to cite female scientists as well.

  • Citing a female scientist does not cost us anything in our career but it may help build those careers that eventually will bring equality.
  • Citing a female scientist when we only have male scientists at our hand makes us critically reflect our own field and possibly help us to engage with research more deeply to find female scientists.

We probably won’t reach a 50/50 quota any time soon in our citation lists but maybe we can start climbing towards it. I admit I am not there yet and I haven’t done this for any publication I produced yet, but I am of a mind to change this. Maybe you would like to contribute as well? Change is hard and so my first goal is to have at around 50% of publications having a female co-author (though first author would be preferable). I am sure I will fail miserably to reach that goal in the next few publications I make. But yesterday I sat down and tried to find a few women in the field that I could cite and it was surprising how relevant their research was and shocking how I barely heard of any of them (except those who despite the odds managed to become a big name of their own). I think that in the long-term this practice will also make me a better and more engaged scholar that (at least sometimes) manages to look beyond the in-group in which my work is circulating.

Computer Science and more

Now I know I specifically focused on computer science but probably such an attempt should not be confined to one discipline. It should be a truly interdisciplinary endeavor.

Azoulay, P., Fons-Rosen, C., & Zivin, J. S. G. (2015). Does science advance one funeral at a time? National Bureau of Economic Research.
Klemm, K., & Eguiluz, V. M. (2002 NaN). Highly clustered scale-free networks. Physical Review E. APS.
Wang, J., Veugelers, R., & Stephan, P. (2017 NaN). Bias against novelty in science: A cautionary tale for users of bibliometric indicators. Research Policy. Elsevier.