Geometry of Big Data – Tuesday session

All talks are summarised in my words which may not accurately represent the authors’ opinion. The focus is on aspects I found interesting. Please refer to the authors’ work for more details.

Session 1 – Graph-based persistence

The talk On the density of expected persistence diagrams and its kernel based estimation is given by Frederic Chazal. A draft is available on arxiv.

Grow circles around point data to generate a graph whenever other points meet the circle and produce a persistent homology of filtered simplicial complexes (e.g adding edges to possibly change homology). Persistent barcode and persistence diagrams encode the same information produced by this process.

Measures are nicer to work with than sets of points for statistical purposes. If the persistence diagram D is a random variable, then E[D] is a determnistic measure on R². Persistence images reveal E[D] and are more interpretable than persistance diagramms which may be too crowed for visual inspection with a large sample.

Persistence can be used as an additional feature on a dataset. For example, a random sample from the data set can be taken and the persistence diagram/image can be computed and compared between random samples giving us an idea of the stability of the homology.

Session 2 – Log-concave density estimation

The talk Log-concave density estimation: adaptation and high dimensions is given by Richard Samworth. The paper is available at Project Euclid.

To randomly sample a density f_0 there are generally two appraoches parametric and non-parametric methods. A density f is log-concave if log f is concave. The super level sets need to be convex. Univariate examples are normal, logistic and more. The class is closed under marginalisation, conditioning, convlution and linear transformations.

In an unbounded likelihood, the density surface is spiky. The log-concave density addresses this.

Session 3 – Infinite Width Neural Nets

The talk Infinite-Width Bounded-Norm Networks: A View from Function Space given by Nathan Srebro has two parts Infinite Width ReLU Nets and Geometry of Optimization Regularization and Inductive Bias.

Part 1: When we are learning we find a good fit (of weights) for the data. What kind of functions can be approxmiated by Neural Net? Essentially all, but the question is how large does the network have to be to approximate f to within error e. The question should be: what class function can be approximated by low norm Neural Nets? Another question should be: Given a bounded number of units what norm is required to approximate f to within any error e? The cost of the weights is taken as the parameter. This results in linear splines. A neural net with infinite width and one hidden layer solves the Green’s function.

Part 2: How does depth influence this? Deep learning should be considered with infinitive width and implemented with a finite approximation. Deep learning focuses on searching parameter space that maps into a richer function space.

Session 4

The talk Some geometric surprises in modern machine learningis given by Andrea Montanari.

Session 5

The talk Multi-target detection and cryo-EM imaging by autocorrelation analysis is given by Amit Singer.

Session 6

The talk Learning to Solve Inverse Problems in Imaging is given by Rebecca Willett.

Academia, Uncategorized

Geometry of Big Data – Monday session

All talks are summarised in my words which may not accurately represent the authors’ opinion. The focus is on aspects I found interesting. Please refer to the authors’ work for more details.

Session 1 – Learning DAGs

The talk DAGs with NO TEARS: Continuous Optimization for Structure Learning is given by Pradeep Ravikumar. A draft is available on arxiv.

Learning directed acyclical graphs (DAGs) can traditionally be done in two ways: conditional independence and score-based . The latter poses a local search-problem with out a clear answer. More recently the problem has been posted as a continuous (global) optimisation for undirected graphs.

A loss function is a log-likelihood of the data and we need to find the most appropriate W such that X = XW + E. They provide a new M-estimator.

Session 2 – Parallel transport for data alignment

The talk Data Analysis with the Riemannian Geometry of Symmetric Positive-Definite Matrices given by Ronan Talmon. A draft is available on arxiv.

The talk focuses on how to align data when the intersubject variation is large but consistent and the intrasubject variation could be mapped. Parallel transport has the goal to align the intersubject values on an symmetric positive definite (SPD) embedding in n-dimensional space. SPD matrices are embedded on a hyperbole and all computations can be performed in closed-form.

Data from multiple subject and multiple session, it does not matter whether to first adapt the sessions or the subject – which only works for parallel transport and not with identy transformations.

Session 3 – Persistence framework for data analysis

The talk Metric learning for persistence-based summaries and application to graph classification is given by Yusu Wang. An underlying paper is available on PlosOne.

Persistence diagrams can be used to describe complexity. The features are simpler but persistent to the underlying object. A geometric object through a filtration perspective produces a summary. Filtration is a growing sequence of spaces. The time that sets get created and destroyed can be mapped onto a persistence diagram with death time on the y axis and birth dime on the x-axis.

The bottleneck distance is a matching between two persistence diagram such that each feature is matched with the shortest distance. Features may be matched to a zero-feature (capturing noise) if they are to close to the diagonal. More complex approaches include persistence images that transform the diagram (after transforming it) into a kernel density.

The weight function should be application dependent and thus can be learned instead of pre-assigned. We can just take the difference between two persistence images as a weighted kernel for persistence images (WLPI).

For graphs the following metrics can be used for persistence. The Discrete Ricci curvature captures the local curvature on the manifold. The Jaccard index function compares for nodes who has common neighbors which is good for noisy networks.

In general, a descriptive function must be found for the domain and may even encode meaningful knowledge on how the object behaves. High weights would describe the more distinct features.

Session 4 – Behold the spikes

The talk Proper regularizers for semi-supervised learning is given by Dejan Slepcev.

A d-dimensional point cloud can be converted to a graph representation using a kernel that connects close edges (with a fall-off or discontinuity). As the number of nodes n goes to infinity, the kernel bandwidth should shrink to 0.

The error bandwidth is critical. The take-away is that instead of producing single labeled data points, the label should be extended beyond the kernel bandwidth. A single data label can produce spikes because essentially the minimiser obtains smaller values for a flat surface with a single spike than for an appropriate surface.

Session 5

The talk Solving for committor functions in high dimension is given by Jianfeng Lu.

Session 6 – Finding structure in loss

The talk A consistent framework for structure machine learning is given by Lorenzo Rosasco.

Structured machine learning is not structure learning. It refers to learning functional dependencies between arbitrary output and input data. Classical approaches include likelihood estimation models (struct-svm, conditional random fields, but limited guarantees) and surrogate approaches (strong theoretical guarantees but ad hoc and specific).

Applying empirical risk minimisation (ERM) from statistical learning we can expect that the mean of the empirical data is close to the mean of the class. However, it is hard to pick a class. The inner risk (decomposing into marginal probability) reduces the class size. Making a strong assumption the structured encoding loss function (SELF) requires a Hilbert space and two maps such that the loss function can be presented as an inner product. Using a linear loss function helps. For a crazy space Y (need not be linear) the SELF gives enough structure to proceed. This enlarges the scope of structured learning to inner risk minimisation (IRM).

There is a function psi hidden in the loss function that encodes and decodes from Y to the Hilbert space. The steps are encode Y in H, learn from X to H, and decode H to Y. In linear estimation with least squares, the encoding/decoding disappears and the output space Y is not needed for computation.


Starting “Geometry and Learning from Data in 3D and Beyond” at IPAM, UCLA

Today is the first day of my stay at the Institute for Pure and Applied Mathematics (IPAM) at University of California Los Angeles (UCLA). Over the coming weeks I wil try to discuss interesting talks here at the long course Geometry and Learning from Data in 3D and Beyond. Stay tuned for the first workshop on Geometry of Big Data.