Geometry of Big Data – Tuesday session

All talks are summarised in my words which may not accurately represent the authors’ opinion. The focus is on aspects I found interesting. Please refer to the authors’ work for more details.

Session 1 – Graph-based persistence

The talk On the density of expected persistence diagrams and its kernel based estimation is given by Frederic Chazal. A draft is available on arxiv.

Grow circles around point data to generate a graph whenever other points meet the circle and produce a persistent homology of filtered simplicial complexes (e.g adding edges to possibly change homology). Persistent barcode and persistence diagrams encode the same information produced by this process.

Measures are nicer to work with than sets of points for statistical purposes. If the persistence diagram D is a random variable, then E[D] is a determnistic measure on R². Persistence images reveal E[D] and are more interpretable than persistance diagramms which may be too crowed for visual inspection with a large sample.

Persistence can be used as an additional feature on a dataset. For example, a random sample from the data set can be taken and the persistence diagram/image can be computed and compared between random samples giving us an idea of the stability of the homology.

Session 2 – Log-concave density estimation

The talk Log-concave density estimation: adaptation and high dimensions is given by Richard Samworth. The paper is available at Project Euclid.

To randomly sample a density f_0 there are generally two appraoches parametric and non-parametric methods. A density f is log-concave if log f is concave. The super level sets need to be convex. Univariate examples are normal, logistic and more. The class is closed under marginalisation, conditioning, convlution and linear transformations.

In an unbounded likelihood, the density surface is spiky. The log-concave density addresses this.

Session 3 – Infinite Width Neural Nets

The talk Infinite-Width Bounded-Norm Networks: A View from Function Space given by Nathan Srebro has two parts Infinite Width ReLU Nets and Geometry of Optimization Regularization and Inductive Bias.

Part 1: When we are learning we find a good fit (of weights) for the data. What kind of functions can be approxmiated by Neural Net? Essentially all, but the question is how large does the network have to be to approximate f to within error e. The question should be: what class function can be approximated by low norm Neural Nets? Another question should be: Given a bounded number of units what norm is required to approximate f to within any error e? The cost of the weights is taken as the parameter. This results in linear splines. A neural net with infinite width and one hidden layer solves the Green’s function.

Part 2: How does depth influence this? Deep learning should be considered with infinitive width and implemented with a finite approximation. Deep learning focuses on searching parameter space that maps into a richer function space.

Session 4

The talk Some geometric surprises in modern machine learningis given by Andrea Montanari.

Session 5

The talk Multi-target detection and cryo-EM imaging by autocorrelation analysis is given by Amit Singer.

Session 6

The talk Learning to Solve Inverse Problems in Imaging is given by Rebecca Willett.

Geometry of Big Data – Monday session

All talks are summarised in my words which may not accurately represent the authors’ opinion. The focus is on aspects I found interesting. Please refer to the authors’ work for more details.

Session 1 – Learning DAGs

The talk DAGs with NO TEARS: Continuous Optimization for Structure Learning is given by Pradeep Ravikumar. A draft is available on arxiv.

Learning directed acyclical graphs (DAGs) can traditionally be done in two ways: conditional independence and score-based . The latter poses a local search-problem with out a clear answer. More recently the problem has been posted as a continuous (global) optimisation for undirected graphs.

A loss function is a log-likelihood of the data and we need to find the most appropriate W such that X = XW + E. They provide a new M-estimator.

Session 2 – Parallel transport for data alignment

The talk Data Analysis with the Riemannian Geometry of Symmetric Positive-Definite Matrices given by Ronan Talmon. A draft is available on arxiv.

The talk focuses on how to align data when the intersubject variation is large but consistent and the intrasubject variation could be mapped. Parallel transport has the goal to align the intersubject values on an symmetric positive definite (SPD) embedding in n-dimensional space. SPD matrices are embedded on a hyperbole and all computations can be performed in closed-form.

Data from multiple subject and multiple session, it does not matter whether to first adapt the sessions or the subject – which only works for parallel transport and not with identy transformations.

Session 3 – Persistence framework for data analysis

The talk Metric learning for persistence-based summaries and application to graph classification is given by Yusu Wang. An underlying paper is available on PlosOne.

Persistence diagrams can be used to describe complexity. The features are simpler but persistent to the underlying object. A geometric object through a filtration perspective produces a summary. Filtration is a growing sequence of spaces. The time that sets get created and destroyed can be mapped onto a persistence diagram with death time on the y axis and birth dime on the x-axis.

The bottleneck distance is a matching between two persistence diagram such that each feature is matched with the shortest distance. Features may be matched to a zero-feature (capturing noise) if they are to close to the diagonal. More complex approaches include persistence images that transform the diagram (after transforming it) into a kernel density.

The weight function should be application dependent and thus can be learned instead of pre-assigned. We can just take the difference between two persistence images as a weighted kernel for persistence images (WLPI).

For graphs the following metrics can be used for persistence. The Discrete Ricci curvature captures the local curvature on the manifold. The Jaccard index function compares for nodes who has common neighbors which is good for noisy networks.

In general, a descriptive function must be found for the domain and may even encode meaningful knowledge on how the object behaves. High weights would describe the more distinct features.

Session 4 – Behold the spikes

The talk Proper regularizers for semi-supervised learning is given by Dejan Slepcev.

A d-dimensional point cloud can be converted to a graph representation using a kernel that connects close edges (with a fall-off or discontinuity). As the number of nodes n goes to infinity, the kernel bandwidth should shrink to 0.

The error bandwidth is critical. The take-away is that instead of producing single labeled data points, the label should be extended beyond the kernel bandwidth. A single data label can produce spikes because essentially the minimiser obtains smaller values for a flat surface with a single spike than for an appropriate surface.

Session 5

The talk Solving for committor functions in high dimension is given by Jianfeng Lu.

Session 6 – Finding structure in loss

The talk A consistent framework for structure machine learning is given by Lorenzo Rosasco.

Structured machine learning is not structure learning. It refers to learning functional dependencies between arbitrary output and input data. Classical approaches include likelihood estimation models (struct-svm, conditional random fields, but limited guarantees) and surrogate approaches (strong theoretical guarantees but ad hoc and specific).

Applying empirical risk minimisation (ERM) from statistical learning we can expect that the mean of the empirical data is close to the mean of the class. However, it is hard to pick a class. The inner risk (decomposing into marginal probability) reduces the class size. Making a strong assumption the structured encoding loss function (SELF) requires a Hilbert space and two maps such that the loss function can be presented as an inner product. Using a linear loss function helps. For a crazy space Y (need not be linear) the SELF gives enough structure to proceed. This enlarges the scope of structured learning to inner risk minimisation (IRM).

There is a function psi hidden in the loss function that encodes and decodes from Y to the Hilbert space. The steps are encode Y in H, learn from X to H, and decode H to Y. In linear estimation with least squares, the encoding/decoding disappears and the output space Y is not needed for computation.

Starting “Geometry and Learning from Data in 3D and Beyond” at IPAM, UCLA

Today is the first day of my stay at the Institute for Pure and Applied Mathematics (IPAM) at University of California Los Angeles (UCLA). Over the coming weeks I wil try to discuss interesting talks here at the long course Geometry and Learning from Data in 3D and Beyond. Stay tuned for the first workshop on Geometry of Big Data.

A Manifesto to Cite 50/50

I recently came across Women Also Know Stuff. I think it is a great initiative that helps to slowly combat systemic and structural inequality. They point to many female scientists in most social sciences and I wondered whether I could find a similar program in computer science. The answer was no because apparently we first need to get women into computer science. I would still love to see #WomenAlsoKnowComputerScience on twitter, alas the search results are empty. It is not that I don’t know great female compute scientist but maybe they lack exposition which makes it all the more harder to convince women to join the field.

What I thought could help would be a larger exposition in scientific citations. I will need to go a bit off-topic to explain my thinking but bare with me. Citations produce scale-free networks (Klemm & Eguiluz, 2002).

Comparison of a random network and a scale-free network. The scale free network shows super connecting nodes in grey. Taken from wikipedia.

That means that a few super-connected nodes (so-called hubs) take up almost all the citation. In general, if we as scientists need a citation to underline a concept, we are much more likely to end up citing such a super-connected node. What that means is that highly cited scientists will get even more cited and less cited scientists remain so. That is even if their science was better. Network effects (or economies of scales) ensure that not necessarily the best science is cited the most, but usually the one preserving the status quo (Wang, Veugelers, & Stephan, 2017). But the effect is even stronger than that. The big names (not only the citations) dominate the field to such an extend that alternative explanations favored by other scientists are locked out of the discussion until such a star departs from the field (Azoulay, Fons-Rosen, & Zivin, 2015).

So where does that leave us with citing female scientists? They are at a triple-disadvantage:

  1. They have been structurally excluded from the discipline
  2. They (usually) don’t have a big name so their citation counts don’t increase
  3. As there are no role models young women may not take up the field

However, and this is what I would like to stress most, it is not the quality of their research. Now, if citations are usually not awarded for merit only but mainly due to structural reasons, why not use them to start shifting the scales today such that in some day in the future women are equally represented in this field (and in many others) such as the statistical distribution of people would predict.

The Manifesto to Cite 50/50

Making a citation to underline a concept does not require us to only cite that one citation that we always use. We can vary whom we cite and we can choose to cite female scientists as well.

  • Citing a female scientist does not cost us anything in our career but it may help build those careers that eventually will bring equality.
  • Citing a female scientist when we only have male scientists at our hand makes us critically reflect our own field and possibly help us to engage with research more deeply to find female scientists.

We probably won’t reach a 50/50 quota any time soon in our citation lists but maybe we can start climbing towards it. I admit I am not there yet and I haven’t done this for any publication I produced yet, but I am of a mind to change this. Maybe you would like to contribute as well? Change is hard and so my first goal is to have at around 50% of publications having a female co-author (though first author would be preferable). I am sure I will fail miserably to reach that goal in the next few publications I make. But yesterday I sat down and tried to find a few women in the field that I could cite and it was surprising how relevant their research was and shocking how I barely heard of any of them (except those who despite the odds managed to become a big name of their own). I think that in the long-term this practice will also make me a better and more engaged scholar that (at least sometimes) manages to look beyond the in-group in which my work is circulating.

Computer Science and more

Now I know I specifically focused on computer science but probably such an attempt should not be confined to one discipline. It should be a truly interdisciplinary endeavor.

Azoulay, P., Fons-Rosen, C., & Zivin, J. S. G. (2015). Does science advance one funeral at a time? National Bureau of Economic Research.
Klemm, K., & Eguiluz, V. M. (2002 NaN). Highly clustered scale-free networks. Physical Review E. APS.
Wang, J., Veugelers, R., & Stephan, P. (2017 NaN). Bias against novelty in science: A cautionary tale for users of bibliometric indicators. Research Policy. Elsevier.

Off to the Chicago Forum on Global Cities

Today I write you as part of a mini-series on my stay at the Chicago Forum on Global Cities (CFGC). I have been kindly sponsored by ETH Zurich and the Chicago Forum to participate in the event. I am currently sitting in my train to Zurich airport and I am looking forward to 3 days of intensive discussions on the future of global cities. You will also find a post about this event on the ETH Ambassadors Blog and ETH Global Facebook and you may look out for some tweets.

I hope for many interesting meetings and conversations at the Forum, especially about my main topics of interest Big Data in Smart Cities – for which I have a short policy brief with me designed in the Argumentation and Science Communication Course of the ISTP – as well as ways to design better cities based on Big Data and knowledge of human (navigation) behaviour – the topic of my soon to start PhD.

PIE: Ex Post Evaluation: Establishing Causality without Experimentation

So far, we discussed evaluation based on ex ante Randomised Control Trials (RCT). In ex post experiments, we have an another opportunity for an evaluation. However, there are strong limitations:

  • Treatment manipulation is no longer possible,
  • observational data only (i.e. the outcome of social processes), and
  • baseline may be missing

To address these issues, the idea is to exploit naturally occurring randomisation (as if randomly assigned) and try to construct a valid counterfactual. In essence, we try to construct ex post RCT based on historical data. The advantage of such a experiment is, that it allows us to learn from the past.

Natural experiments

The randomisation has arisen naturally, for instance after a natural disaster, infrastructure failures or indiscriminate forms of violence. The key tasks here is to establish, that the treatment variation is random with the limitation that it can only be checked for observable parameters.

These natural experiments are also called quasi-experiment.

Regression Discontinuity Design (RDD)

An RDD exploits that the treatment variable [latex]A[/latex] is determined, either completely or partially, by the value of an assignment variable [latex]X[/latex] being on either side of a fixed cutpoint [latex]c[/latex]. In the limit at cutpoint [latex]c[/latex] the assignment of treatment is random/exogenous. The assumption is that units just left and right of the cutpoint [latex]c[/latex] are identical except with regard to the treatment assignment.

A RDD and RCT are closely related considering that each participant is assigned a randomly generated number [latex]v[/latex] from a uniform distribution over the range [latex][0,1][/latex] such that [latex]T_i = 1[/latex] if [latex]v\geq0.5[/latex] and [latex]T_i=0[/latex] otherwise.

However, RDDs are more prone to several issues:

  • Omitted Variable Bias is possible (in contrast to well-designed RCTs), because a variable [latex]Z[/latex], which may affect [latex]T[/latex], change discontinously at the cutpoint [latex]c[/latex].
  • Units may be able to manipulate their value on assignment variable [latex]X[/latex] to influence  treatment assignment around [latex]c[/latex].
  • Global functional form misspecification may lead to non-linearities being interpreted as discontinuities.

Instrumental Variable Regression (IV)

There is a set of problems where endogeneity or joint determinancy of [latex]X[/latex] and [latex]Y[/latex], omitted variable bias (other variables) and measurement errors in [latex]X[/latex] may be an issue.

An instrumental variable [latex]Z[/latex] is introduced. It is considered a valid instrument if and only if:

  • Instrument relevance: [latex]Z[/latex] must be correlated with [latex]X[/latex],
  • Instrument exogeneity: [latex]Z[/latex] must be uncorrelated with all other determinants of [latex]Y[/latex].

Potential sources for instruments are:

  • Nature: e.g. geography, weather, biology in which a truly random source of variation influences [latex]X[/latex] (no endogeneity).
  • History: e.g. things determined a long time ago, which were possibly endogenous contemporaneously, but no longer plausibly influence [latex]Y[/latex].
  • Institutions/Policies: e.g. formal or informal rules that influence the assignment of [latex]X[/latex] in a way unrelated to [latex]Y[/latex].

Potential issues for IV Resssions are:

  • Conditional unconfoundedness of [latex]Z[/latex] regarding [latex]X[/latex] (ideally [latex]Z[/latex] as if random with regard to [latex]X[/latex] such as eligibility rule or encouragement design).
  • Weak instrument: [latex]Z[/latex] and [latex]X[/latex] are only weakly correlated.
  • Violation of exclusion restriction: [latex]Z[/latex] affects [latex]Y[/latex] independent of [latex]X[/latex].

Difference-in-Differences Estimation

Instead of comparing only one point in time, changes are compared over time (i.e. before and after the policy intervention) between participants units and non-participants units. This requires panel data of at least two time periods for participating and non-participating units before and after the policy intervention. Ideally, we have more than two pre-intervention periods.

All participating units should be included, but there are no particular assumptions about how non-participating units are selected. This allows for an arbitrary comparison group as long as they are a valid counterfactual.

However, as always, there are several issues:

  • Time-varying confounders could be an alternative explanation since we estimate time-invariant difference and any omitted variable would have an impact.
  • Parallel trend assumption is required to show that there is a similar trajectory and that the difference is due to the intervention.

Synthetic Control Methods (SCM)

While related to diff-in-diff estimation strategy, there are a few differences as SCM

  • can only have one participating unit;
  • does not need a fixed time period and can be applied more flexibly;
  • requires a continuous outcome;
  • relaxes the modelling assumptions of diff-in-diff; and
  • does not have a method for formal inference (yet).

Non-participating units can be chosen freely (like in diff-in-diff), but work best with many homogeneous units. It also requires panel data, but with multiple pre-intervention years. The longer the time frame available, the better SCM can construct a valid counterfactual. The synthetic control is constructed of weights of the non-participating units.

Typical issues that can arise are as the quality of synthetic control depends on:

  • number of potential controls,
  • homogeneity of potential controls,
  • richness of time varying dataset to create synthetic control,
  • number of pre-intervention period observations, and
  • smoothness of the outcome.


The idea behind matching is to find identical pairs on a key confounder with multiple confounders across multiple dimensions. This becomes exceedingly difficult and the proposed solution is to estimate each units participation propensity given observable characteristics. There are a variety of matching estimators with different advantages and disadvantages (e.g. nearest neighbour, coerced exact matching, genetic matching, etc.).

In matching, we look at the distribution of the treated and the untreated. Observations that are never treated or untreated should be excluded.

Usual pitfalls include:

  • Quality of matching estimate requires similar assumptions to hold  as regular regression (complete understanding of which factors affect the programme outcome).
  • Matching can be considered a non- or semi-parametric regression, hence not significantly different from a causal inference perspective than multivariate regression.


Quality of ex post evaluation relies on the validity of the counterfactual. RCTs are the gold standard but the ex post methods have the advantage of allowing us to learn from the past. There is no technical/statistical fix that will create a valid counterfactual: it is always a question of design. Finding valid counterfactuals in observational data requires innovative thinking and deep substantive knowledge.

PIE: Ex Ante Evaluations: Randomised Control Trials

For a Randomised Control Trial (RCT) several elements are necessary. Evaluators need to be involved long before it ends – ideally from the conception. Randomisation must take place. The operationalisation and measurement must be defined. The data collection process and the data analysis must be performed rigorously. Randomisation and the data collection process is what makes the difference compared to other experiments.

To run a RCT partners are needed. Often firms and non-governmental organisations (NGOs) are partners since they benefit from evaluating their work. Governments are still rare partners, but the number of government-sponsored RCTs is increasing. The programme under evaluation can be either an actual programme (only simple impact evaluation) or pilot programmes (impact evaluation can become field experiments).

Randomisation needs to be chosen carefully. Usually, access, timing of access or encouragement to participant is randomised. The optimal test would run on access, but ethical concerns may make that impossible. Relaxation of access are obtained by introducing the treatment in waves or by encouraging the population to take up the treatment (and measuring the people who did not take it up as “non-accessed”).

A randomised trial can be run in many circumstances, for instance:

  • New program design,
  • New program,
  • New services,
  • New people,
  • New local,
  • Over- or under-subscription of existing programs,
  • Rotation of program benefits or burdens,
  • Admission cutoffs, and
  • Admission in phases.

The choice of the randomisation level is another important parameter. Often the type of treatment or randomisation opportunities determine the randomisation level. However, the best choice usually would be the individual level. If the level can be picked, there are still several considerations that need to be made when determining the level:

  • Unit of measurement (experiment constraint),
  • spillovers (interaction between observed units),
  • Attrition (loss of units throughout the observation),
  • Compliance (will the treatment work),
  • Statistical Power (number of units available), and
  • Feasibility (can the unit be observed (cost-)effectively).

Often a cross-cutting design is used. Where several treatments are applied and distributes across the units such that all combinations are observed. This allows to assess the individual treatments as well as the cross-interaction between treatments.

The data collection process can be described in three steps.

  1. Baseline Measurement (asserts whether the randomisation works and can assess the bias of non-compliance and attrition)
  2. Midstream Measurement (only in long-term projects)
  3. Endline Measurements (in combination with baseline measurements allows to estimate unit fixed effect (differences-in differences estimation)

Threats to RCTs

The four main threats to RCTs are partial compliance, attrition, spill-overs and evaluation-driven effects.

Partial compliance can be caused by several issues: Implementation staff may depart from the allocation or treatment procedures; Units in treatment group may not be treated or units in control group may be treated; Units in treatment group do not get complete treatment; and Unit exhibit opposite of compliance (so-called defiers).

Attrition may occur for specific reasons. However, often drop-out reasons cannot be measured (or the answer is refused).

Spillovers may occur on different levels: physical, behavioural, informational, general-equilibrium (market-wide) effects (i.e. long-term system-wide effects).

Evaluation-driven effects have been observed. The most important ones include:

  • Hawthorn effect (the treatment group changes behaviour due to being observed; the to counter the effect something that cannot be changed should be measured [alternatively, not telling that participants are observed would help, but is often unethical]);
  • John Henry effect (the control group changes behaviour due to believing being disadvantaged and trying to compensate);
  • resentment and demoralisation effect (selection into treatment and control changes behaviour);
  • demand effect (participants want to produce the result required by the observer or impress the observer);
  • anticipation effect (the psychological state of participants influences there performance [e.g. if they expected to be good at something they score better]); and
  • survey effect (the framing and order of tasks/questions will influence the response).


RCTs are seen as the gold standard when it comes to impact evaluation, but they are no panacea. Designing a rigorous impact evaluation requires innovative thinking and substantive knowledge of the program and policy area.

Funders in the U.S. and GB have begun to increasingly ask for RCT evaluation of programmes, especially in certain domestic policy areas (e.g. education) and development. Continental Europe is still somewhat lagging in this respect.

PIE: The Fundamental Problem of Causal Inference

We evaluate policies for a multitude of reasons. On the one hand, we wish to increase our knowledge and learn about its underlying function to improve program design and effectiveness. On the other hand, considerations from economy, society, and politics are the reason behind the evaluation. This may include allocation decisions via cost-benefit analysis (economic), transparency and accountability (social), public sector reform and innovation (social), credibility (politics), and overcoming ideological (fact-free) bickering (politics).

Impact Evaluation can offer some answers. In particular,

  • the effect of a programme can be measured,
  • the effectiveness of the programme can be measured (i.e. how much better of are beneficiaries),
  • how much do outcomes change under alternative designs,
  • how differently are different people impacted, and
  • is the programme cost-effective.

A causal framework is required to obtain the answer. However, there are risks inherent to evaluation. Evaluations not free and unproblematic. The main issues are

  • cost time and resources,
  • distorted incentives by equating measurable and valuable whereby intrinsic motivation is crowded out,
  • Goodhart’s law, i.e. that a measure that becomes a target ceases to be a good measure (because people optimise towards the measure rather than the underlying objective that was evaluated via the measure), and
  • a tendency towards encrustation and self-fulfilling prophecy.

Not every programme needs or should be evaluated: the potential benefits should outweigh the costs.

Causal Inference

Three basic criteria for causation are identified by Hume. A spatial and temporal contiguity, a temporal succession and constant conjunction (Hume, 1740, 1748). The shortcomings were shown by Mill who noted that observation without experimentation (supposing no aid from deduction) can ascertain sequences of co-existences, but cannot prove causation (Mill, 1843). Lastly, Lewis refined the notion as “a cause is something that makes a difference, and the difference it makes must be a difference from what would have happened without it” (Lewis, 1974).

Causal Claim

A causal claim is a statement about what did not happen. A statement “X caused Y” means that Y is presnet, but Y would not have been present if X were not present. This the counterfactual approach to causality. In this approach there is no notion that just because X caused Y that X is the main or the only reason why Y happened or even that X is “responsible” for Y.

This leads a fundamental misunderstanding between attribution and contribution. Atribution would claim that X is the one and only cause for Y, whereas contribution merely states that X contributed towards the outcome Y. The approach cannot figure out the causes of Y, only whether some X contributed to bringing Y about. The reason is that there is never a single cause of Y and there is no reason that the effects of different causes should add up to 100% unless all causes could be added up. Furthermore, causes are not rival. The question should always be “how much does X contribute to Y”, not “does X cause Y”.

Causality and Causal Pathways

Causal mechanisms or causal chains are often used to illustrate causality. This can be misleading, as Holland points out: If A is planing action Y and B tries to prevent it, but C intervenes to stop B. Then both A and C contribute to Y, but Hume’s criteria are not fulfilled for the contribution of C to Y. (Holland, 1986)

Necessary and sufficient conditions

A necessary condition demands for Y to occur, X needs to happened [latex]X \implies Y[/latex]. A sufficient condition demands that if X occurs then Y occurs [latex]not Y \implies not X[/latex].

In causal frameworks the conditions need to be related to allow for probabilistic conditions (probability of Y is higher if X is present) and contingencies (X causes Y if Z is present, but not otherwise).

No Causation Without Manipulation

The counterfactual approach requires one to be able to think through how things might look in different conditions. Causal claims should be restricted to conditions that can conceivably (not necessarily practically) be manipulated (Holland, 1986).

Ruben’s Potential Outcome Framework

In the framework a dichotomous treatment variable [latex]X[/latex] with [latex]x \in \{0,1\}[/latex] where [latex]x=1[/latex] means treated. Additionally, a dichotomous outcome variable [latex]Y[/latex] with [latex]y \in \{0,1\}[/latex]. Furthermore, we define [latex]Y^{x=1}[/latex] as the potential outcome under the treatment and [latex]Y^{x=0}[/latex] as the counterfactual outcome under no treatment.

The outcome of interest would be the individual causal/treatment effect (ITE): [latex]X[/latex] has a causal effect on unit [latex]i[/latex]’s outcome [latex]Y[/latex] if and only if [latex]Y^{x=1}\neq Y^{x=0}[/latex]. However, only one of the outcomes is actually observable (factual). ITE’s are not defined, which is referred to as the fundamental problem of causal inference (Holland, 1986).

We need an alternative measure of the causal effect. It is still possible to figure out whether [latex]X[/latex] causes [latex]Y[/latex] on average if [latex]Pr\[X=1|Y=1\]-Pr\[X=1|Y=0\]=Pr\[Y^{x=1}=1\]-Pr\[Y^{x=0}=1\][/latex], i.e.treatment and control units are exchangeable (statistically speaking [latex]Y^X[/latex] is stochastically indepented of [latex]X[/latex]). Then and only then is the average treatment effect (ATE) of [latex]X[/latex] and [latex]Y[/latex] for a finite population [latex]N[/latex] equal to [latex]Pr\[Y^{x=1}=1\]\neqPr\[Y^{x=0}=1\][/latex] and [latex]E\[Y^{x=1}-Y^{x=0}\][/latex] and [latex]\frac{1}{N}\sum_{i=1}^N \[Y^{x=1}-Y^{x=0}\][/latex].

Stochastic independence can be achieved  either with the scientific approach which relies on homogeneity assumption. This is impossible with heterogeneous units. The statistical approach relies on large numbers and can only achieve exchangeability on average. Furthermore, for the ATE we also need the Stable Unit Treatment Value Assumption (SUVTA) to hold, i.e. no variation in treatment across units and non-interference between the units of observation (treatment of one does not influence others).


Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81(396), 945–960.
Hume, D. (1740). An Abstract of a Book Lately Published; Entituled, A Treatise of Human Nature, &c: Wherein the Chief Argument of that Book is Farther Illustrated and Explained. (C. Borbet [ie Corbet], Ed.). over-against St. Dunstan’s Church, in Fleetstreet: Addison’s Head.
Hume, D. (1748). Philosophical Essays Concerning Human Understanding: By the Author of the Essays Moral and Political. opposite Katherine-Street, in the Strand: Millar A.
Lewis, D. (1974). Causation. The Journal of Philosophy, 70(17), 556–567.
Mill, J. S. (1843). A System of Logic. Parker.

ASC: Concepts and Arguments

The evaluation of the correctness of arguments is the core of this blog post.

We will focus on justifications as premises are to be evaluated with the scientific method. However, the quality of premises must be considered. Only true premises can guarantee the truth of the conclusion, so the reasons must be impeccable. Therefore, acceptable premises can provide for the acceptance of the conclusion. Additionally, all premises must be consistent to form the conclusion.

Deductive inference (validity) is then used to come to the conclusion. An inference is valid, if all premises are true and the conclusion must be true (where the must refers to the relation between premisses and conclusion not the conclusion itself). Consequently, a valid inference cannot have a false conclusion from true premises.

A central feature of valid premises is that if the conclusion is false, then one of the premises must be false. However, if all premises are true, but the conclusion is still false, the inference must be invalid.

Formal validity is based on the structure of the assertion, not the meaning. [latex](X \in  A \lor X \in B) \land \neg X \in B \implies X \in A[/latex].

Material validity is based on the relation between the concepts. E.g. a square has four sides of equal length.

Conditional Claims

A sufficient condition A for B [latex]A\implies B[/latex]. Logically, B must be true if A occurs, but could be true due to different condition C. B is a necessary condition for A [latex]\neg B\implies \neg A[/latex]. A is true at most if B is true, however, there could be a C that is also necessary for A to be true.

Inferential schemes for conditional claims If [latex]A[/latex] then [latex]B[/latex] are Modus Ponens [latex]A\implies B[/latex] and Modus Tollens [latex]\neg B \implies \neg A[/latex]. Invalid schemes include denying the antecedent [latex]\net A\implies \neg B[/latex] and affirming the consequent [latex]B\implies A[/latex] are a formal fallacy in reasoning (non-sequitur).

In the fallacy of equivocation the same expression is used in different ways in the premises than in the conclusion.

In the naturalistic fallacy a normative claim is deduced from a descriptive claim.

Non-deductive inferences claim to be correct (but not valid). An inference is correct iff its premises together provide a good reason for accepting its conclusion. However, a central characteristics of correct non-deductive inferences is that the conclusion can be false, even if the premises are true. The conclusion is supported with different degrees and can be strengthened or weakened with additional premises. Non-formal fallacies may occur if the reasons are too weak to support the conclusion.

Inductive inferences are an important class of non-deductive inferences, where the premises are analysed with the help the theory of probability and statistics. Enumerative induction concludes from a sample property distribution the whole population property distribution. Statistical syllogism derives from a population that two properties have been observed in common and concludes that one implies the other. Predictive induction observes two properties in a sample and concludes that one implies the other. Usual fallacies include too small samples, non-representative samples, relevant information not considered, and false deliberation regarding probability.

Argument by analogy

A claim is justified by analogy to another claim. This argument is often fallacious, as illustrative analogies (do not justify conclusions), irrelevant analogies, weak analogy, and not considering a relevant disanalogy.

Causal inferences

A factor F is considered to be a causally relevent if for an event, two situations must differ in that the event only occurs in the situation in which F is present. Typicall fallacies include inference from temporal sequence, inference from positive correlation, and inference the inverse causal relevance.

Inference to Best Explanation

A hypothesis is justified because it is the best (closest) explanation for the obtaining of certain facts.

Rules of reasoning

Shifting the burden of proof, instead of justifying a controversial claim, is done by attacking the opponents position or demanding justification. Other ways of shifting are appeals to authority.

The relevance of reasoning demands that an argument is in favour of owns claim. Throwing in arguments that are not related to the claim break relevance.

The accuracy of reasoning is undermined by a “straw man”-fallacy, where an exaggeration or change of claims of the opponent is introduced to make it more susceptible to criticism. More generally, a different claim is attributed to opponent to attack them.

The freedom of speech needs to be preserved by allowing criticism and justification of arguments. Fallacies include argument ad baculum (if you believe A than you believe B), argument ad misercordiam (have pity because X), and argument ad hominem (attacking the person, rather than the argument).

Implicit premises must be stated if they complete the argument. Fallacies include attributing false implicit premises to opponents and not accepting implicit premises in one owns arguments.

Shared premises have to be accepted to reach a reasonable agreement about a controversial claim. Fallacies include retreating from shared premises or attributing claims to be shared premises.

Accepting results of previous argumentation is necessary. Otherwise fallacies such as argument ad ignorantiam  (taking absence of evidence as evidence of absence) or equating the defence of a claim with its acceptance.


SMABSC: Cognitive Agents

Cognitive models are a representation of an agent control mechanism resembling the cognitive architecture of a mind.  It can be understood as a control system (e.g. a flow graph how to react) that takes sensory inputs and produces motor outputs (Piaget, 1985).

More advanced models include adaptive memory (Anderson, 1983).

Famous models include SOaR: State Operator and Result (Laird, Newell, & Rosenbloom, 1987); BDI: Belief, Desire, and Intention (Bratman, Israel, & Pollack, 1988); PECS: Physics, Emotions, Cognitive, Social (Urban & Schmidt, 2001); ACT-R: Adaptive Control of Thought – Rational (Anderson, Matessa, & Lebiere, 1997); CLARION: Connectionist Learning with Adapative Rule Induction On-line (Sun, 2006); and Agent Zero (Epstein, 2014).

The communality of all these is summed up in this slide from the University of Michigan:



Anderson, J. R. (1983). A spreading activation theory of memory. Journal of Verbal Learning and Verbal Behavior, 22(3), 261–295.
Anderson, J. R., Matessa, M., & Lebiere, C. (1997). ACT-R: A theory of higher level cognition and its relation to visual attention. Human-Computer Interaction, 12(4), 439–462.
Bratman, M. E., Israel, D. J., & Pollack, M. E. (1988). Plans and resource‐bounded practical reasoning. Computational Intelligence, 4(3), 349–355.
Epstein, J. M. (2014). Agent_Zero: Toward neurocognitive foundations for generative social science. Princeton University Press.
Laird, J. E., Newell, A., & Rosenbloom, P. S. (1987). Soar: An architecture for general intelligence. Artificial Intelligence, 33(1), 1–64.
Piaget, J. (1985). The equilibration of cognitive structures: The central problem of intellectual development. University of Chicago Press.
Sun, R. (2006). The CLARION cognitive architecture: Extending cognitive modeling to social simulation. In Cognition and multi-agent interaction: From cognitive modeling to social simulation (pp. 79–99). Cambridge University Press.
Urban, C., & Schmidt, B. (2001). PECS–agent-based modelling of human behaviour. In Emotional and Intelligent–The Tangled Knot of Social Cognition. Presented at the AAAI Fall Symposium Series, North Falmouth, MA.