PIE: Ex Post Evaluation: Establishing Causality without Experimentation

So far, we discussed evaluation based on ex ante Randomised Control Trials (RCT). In ex post experiments, we have an another opportunity for an evaluation. However, there are strong limitations:

  • Treatment manipulation is no longer possible,
  • observational data only (i.e. the outcome of social processes), and
  • baseline may be missing

To address these issues, the idea is to exploit naturally occurring randomisation (as if randomly assigned) and try to construct a valid counterfactual. In essence, we try to construct ex post RCT based on historical data. The advantage of such a experiment is, that it allows us to learn from the past.

Natural experiments

The randomisation has arisen naturally, for instance after a natural disaster, infrastructure failures or indiscriminate forms of violence. The key tasks here is to establish, that the treatment variation is random with the limitation that it can only be checked for observable parameters.

These natural experiments are also called quasi-experiment.

Regression Discontinuity Design (RDD)

An RDD exploits that the treatment variable A is determined, either completely or partially, by the value of an assignment variable X being on either side of a fixed cutpoint c. In the limit at cutpoint c the assignment of treatment is random/exogenous. The assumption is that units just left and right of the cutpoint c are identical except with regard to the treatment assignment.

A RDD and RCT are closely related considering that each participant is assigned a randomly generated number v from a uniform distribution over the range [0,1] such that T_i = 1 if v\geq0.5 and T_i=0 otherwise.

However, RDDs are more prone to several issues:

  • Omitted Variable Bias is possible (in contrast to well-designed RCTs), because a variable Z, which may affect T, change discontinously at the cutpoint c.
  • Units may be able to manipulate their value on assignment variable X to influence  treatment assignment around c.
  • Global functional form misspecification may lead to non-linearities being interpreted as discontinuities.

Instrumental Variable Regression (IV)

There is a set of problems where endogeneity or joint determinancy of X and Y, omitted variable bias (other variables) and measurement errors in X may be an issue.

An instrumental variable Z is introduced. It is considered a valid instrument if and only if:

  • Instrument relevance: Z must be correlated with X,
  • Instrument exogeneity: Z must be uncorrelated with all other determinants of Y.

Potential sources for instruments are:

  • Nature: e.g. geography, weather, biology in which a truly random source of variation influences X (no endogeneity).
  • History: e.g. things determined a long time ago, which were possibly endogenous contemporaneously, but no longer plausibly influence Y.
  • Institutions/Policies: e.g. formal or informal rules that influence the assignment of X in a way unrelated to Y.

Potential issues for IV Resssions are:

  • Conditional unconfoundedness of Z regarding X (ideally Z as if random with regard to X such as eligibility rule or encouragement design).
  • Weak instrument: Z and X are only weakly correlated.
  • Violation of exclusion restriction: Z affects Y independent of X.

Difference-in-Differences Estimation

Instead of comparing only one point in time, changes are compared over time (i.e. before and after the policy intervention) between participants units and non-participants units. This requires panel data of at least two time periods for participating and non-participating units before and after the policy intervention. Ideally, we have more than two pre-intervention periods.

All participating units should be included, but there are no particular assumptions about how non-participating units are selected. This allows for an arbitrary comparison group as long as they are a valid counterfactual.

However, as always, there are several issues:

  • Time-varying confounders could be an alternative explanation since we estimate time-invariant difference and any omitted variable would have an impact.
  • Parallel trend assumption is required to show that there is a similar trajectory and that the difference is due to the intervention.

Synthetic Control Methods (SCM)

While related to diff-in-diff estimation strategy, there are a few differences as SCM

  • can only have one participating unit;
  • does not need a fixed time period and can be applied more flexibly;
  • requires a continuous outcome;
  • relaxes the modelling assumptions of diff-in-diff; and
  • does not have a method for formal inference (yet).

Non-participating units can be chosen freely (like in diff-in-diff), but work best with many homogeneous units. It also requires panel data, but with multiple pre-intervention years. The longer the time frame available, the better SCM can construct a valid counterfactual. The synthetic control is constructed of weights of the non-participating units.

Typical issues that can arise are as the quality of synthetic control depends on:

  • number of potential controls,
  • homogeneity of potential controls,
  • richness of time varying dataset to create synthetic control,
  • number of pre-intervention period observations, and
  • smoothness of the outcome.


The idea behind matching is to find identical pairs on a key confounder with multiple confounders across multiple dimensions. This becomes exceedingly difficult and the proposed solution is to estimate each units participation propensity given observable characteristics. There are a variety of matching estimators with different advantages and disadvantages (e.g. nearest neighbour, coerced exact matching, genetic matching, etc.).

In matching, we look at the distribution of the treated and the untreated. Observations that are never treated or untreated should be excluded.

Usual pitfalls include:

  • Quality of matching estimate requires similar assumptions to hold  as regular regression (complete understanding of which factors affect the programme outcome).
  • Matching can be considered a non- or semi-parametric regression, hence not significantly different from a causal inference perspective than multivariate regression.


Quality of ex post evaluation relies on the validity of the counterfactual. RCTs are the gold standard but the ex post methods have the advantage of allowing us to learn from the past. There is no technical/statistical fix that will create a valid counterfactual: it is always a question of design. Finding valid counterfactuals in observational data requires innovative thinking and deep substantive knowledge.


PIE: Ex Ante Evaluations: Randomised Control Trials

For a Randomised Control Trial (RCT) several elements are necessary. Evaluators need to be involved long before it ends – ideally from the conception. Randomisation must take place. The operationalisation and measurement must be defined. The data collection process and the data analysis must be performed rigorously. Randomisation and the data collection process is what makes the difference compared to other experiments.

To run a RCT partners are needed. Often firms and non-governmental organisations (NGOs) are partners since they benefit from evaluating their work. Governments are still rare partners, but the number of government-sponsored RCTs is increasing. The programme under evaluation can be either an actual programme (only simple impact evaluation) or pilot programmes (impact evaluation can become field experiments).

Randomisation needs to be chosen carefully. Usually, access, timing of access or encouragement to participant is randomised. The optimal test would run on access, but ethical concerns may make that impossible. Relaxation of access are obtained by introducing the treatment in waves or by encouraging the population to take up the treatment (and measuring the people who did not take it up as “non-accessed”).

A randomised trial can be run in many circumstances, for instance:

  • New program design,
  • New program,
  • New services,
  • New people,
  • New local,
  • Over- or under-subscription of existing programs,
  • Rotation of program benefits or burdens,
  • Admission cutoffs, and
  • Admission in phases.

The choice of the randomisation level is another important parameter. Often the type of treatment or randomisation opportunities determine the randomisation level. However, the best choice usually would be the individual level. If the level can be picked, there are still several considerations that need to be made when determining the level:

  • Unit of measurement (experiment constraint),
  • spillovers (interaction between observed units),
  • Attrition (loss of units throughout the observation),
  • Compliance (will the treatment work),
  • Statistical Power (number of units available), and
  • Feasibility (can the unit be observed (cost-)effectively).

Often a cross-cutting design is used. Where several treatments are applied and distributes across the units such that all combinations are observed. This allows to assess the individual treatments as well as the cross-interaction between treatments.

The data collection process can be described in three steps.

  1. Baseline Measurement (asserts whether the randomisation works and can assess the bias of non-compliance and attrition)
  2. Midstream Measurement (only in long-term projects)
  3. Endline Measurements (in combination with baseline measurements allows to estimate unit fixed effect (differences-in differences estimation)

Threats to RCTs

The four main threats to RCTs are partial compliance, attrition, spill-overs and evaluation-driven effects.

Partial compliance can be caused by several issues: Implementation staff may depart from the allocation or treatment procedures; Units in treatment group may not be treated or units in control group may be treated; Units in treatment group do not get complete treatment; and Unit exhibit opposite of compliance (so-called defiers).

Attrition may occur for specific reasons. However, often drop-out reasons cannot be measured (or the answer is refused).

Spillovers may occur on different levels: physical, behavioural, informational, general-equilibrium (market-wide) effects (i.e. long-term system-wide effects).

Evaluation-driven effects have been observed. The most important ones include:

  • Hawthorn effect (the treatment group changes behaviour due to being observed; the to counter the effect something that cannot be changed should be measured [alternatively, not telling that participants are observed would help, but is often unethical]);
  • John Henry effect (the control group changes behaviour due to believing being disadvantaged and trying to compensate);
  • resentment and demoralisation effect (selection into treatment and control changes behaviour);
  • demand effect (participants want to produce the result required by the observer or impress the observer);
  • anticipation effect (the psychological state of participants influences there performance [e.g. if they expected to be good at something they score better]); and
  • survey effect (the framing and order of tasks/questions will influence the response).


RCTs are seen as the gold standard when it comes to impact evaluation, but they are no panacea. Designing a rigorous impact evaluation requires innovative thinking and substantive knowledge of the program and policy area.

Funders in the U.S. and GB have begun to increasingly ask for RCT evaluation of programmes, especially in certain domestic policy areas (e.g. education) and development. Continental Europe is still somewhat lagging in this respect.


PIE: The Fundamental Problem of Causal Inference

We evaluate policies for a multitude of reasons. On the one hand, we wish to increase our knowledge and learn about its underlying function to improve program design and effectiveness. On the other hand, considerations from economy, society, and politics are the reason behind the evaluation. This may include allocation decisions via cost-benefit analysis (economic), transparency and accountability (social), public sector reform and innovation (social), credibility (politics), and overcoming ideological (fact-free) bickering (politics).

Impact Evaluation can offer some answers. In particular,

  • the effect of a programme can be measured,
  • the effectiveness of the programme can be measured (i.e. how much better of are beneficiaries),
  • how much do outcomes change under alternative designs,
  • how differently are different people impacted, and
  • is the programme cost-effective.

A causal framework is required to obtain the answer. However, there are risks inherent to evaluation. Evaluations not free and unproblematic. The main issues are

  • cost time and resources,
  • distorted incentives by equating measurable and valuable whereby intrinsic motivation is crowded out,
  • Goodhart’s law, i.e. that a measure that becomes a target ceases to be a good measure (because people optimise towards the measure rather than the underlying objective that was evaluated via the measure), and
  • a tendency towards encrustation and self-fulfilling prophecy.

Not every programme needs or should be evaluated: the potential benefits should outweigh the costs.

Causal Inference

Three basic criteria for causation are identified by Hume. A spatial and temporal contiguity, a temporal succession and constant conjunction (Hume, 1740, 1748). The shortcomings were shown by Mill who noted that observation without experimentation (supposing no aid from deduction) can ascertain sequences of co-existences, but cannot prove causation (Mill, 1843). Lastly, Lewis refined the notion as “a cause is something that makes a difference, and the difference it makes must be a difference from what would have happened without it” (Lewis, 1974).

Causal Claim

A causal claim is a statement about what did not happen. A statement “X caused Y” means that Y is presnet, but Y would not have been present if X were not present. This the counterfactual approach to causality. In this approach there is no notion that just because X caused Y that X is the main or the only reason why Y happened or even that X is “responsible” for Y.

This leads a fundamental misunderstanding between attribution and contribution. Atribution would claim that X is the one and only cause for Y, whereas contribution merely states that X contributed towards the outcome Y. The approach cannot figure out the causes of Y, only whether some X contributed to bringing Y about. The reason is that there is never a single cause of Y and there is no reason that the effects of different causes should add up to 100% unless all causes could be added up. Furthermore, causes are not rival. The question should always be “how much does X contribute to Y”, not “does X cause Y”.

Causality and Causal Pathways

Causal mechanisms or causal chains are often used to illustrate causality. This can be misleading, as Holland points out: If A is planing action Y and B tries to prevent it, but C intervenes to stop B. Then both A and C contribute to Y, but Hume’s criteria are not fulfilled for the contribution of C to Y. (Holland, 1986)

Necessary and sufficient conditions

A necessary condition demands for Y to occur, X needs to happened X \implies Y. A sufficient condition demands that if X occurs then Y occurs not Y \implies not X.

In causal frameworks the conditions need to be related to allow for probabilistic conditions (probability of Y is higher if X is present) and contingencies (X causes Y if Z is present, but not otherwise).

No Causation Without Manipulation

The counterfactual approach requires one to be able to think through how things might look in different conditions. Causal claims should be restricted to conditions that can conceivably (not necessarily practically) be manipulated (Holland, 1986).

Ruben’s Potential Outcome Framework

In the framework a dichotomous treatment variable X with x \in \{0,1\} where x=1 means treated. Additionally, a dichotomous outcome variable Y with y \in \{0,1\}. Furthermore, we define Y^{x=1} as the potential outcome under the treatment and Y^{x=0} as the counterfactual outcome under no treatment.

The outcome of interest would be the individual causal/treatment effect (ITE): X has a causal effect on unit i‘s outcome Y if and only if Y^{x=1}\neq Y^{x=0}. However, only one of the outcomes is actually observable (factual). ITE’s are not defined, which is referred to as the fundamental problem of causal inference (Holland, 1986).

We need an alternative measure of the causal effect. It is still possible to figure out whether X causes Y on average if


, i.e.treatment and control units are exchangeable (statistically speaking Y^X is stochastically indepented of X). Then and only then is the average treatment effect (ATE) of X and Y for a finite population N equal to





    \frac{1}{N}\sum_{i=1}^N \[Y^{x=1}-Y^{x=0}\]


Stochastic independence can be achieved  either with the scientific approach which relies on homogeneity assumption. This is impossible with heterogeneous units. The statistical approach relies on large numbers and can only achieve exchangeability on average. Furthermore, for the ATE we also need the Stable Unit Treatment Value Assumption (SUVTA) to hold, i.e. no variation in treatment across units and non-interference between the units of observation (treatment of one does not influence others).


Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81(396), 945–960.
Hume, D. (1740). An Abstract of a Book Lately Published; Entituled, A Treatise of Human Nature, &c: Wherein the Chief Argument of that Book is Farther Illustrated and Explained. (C. Borbet [ie Corbet], Ed.). over-against St. Dunstan’s Church, in Fleetstreet: Addison’s Head.
Hume, D. (1748). Philosophical Essays Concerning Human Understanding: By the Author of the Essays Moral and Political. opposite Katherine-Street, in the Strand: Millar A.
Lewis, D. (1974). Causation. The Journal of Philosophy, 70(17), 556–567.
Mill, J. S. (1843). A System of Logic. Parker.