So far, we discussed evaluation based on *ex ante* Randomised Control Trials (RCT). In *ex post* experiments, we have an another opportunity for an evaluation. However, there are strong limitations:

- Treatment manipulation is no longer possible,
- observational data only (i.e. the outcome of social processes), and
- baseline may be missing

To address these issues, the idea is to exploit naturally occurring randomisation (as if randomly assigned) and try to construct a valid counterfactual. In essence, we try to construct ex post RCT based on historical data. The advantage of such a experiment is, that it allows us to learn from the past.

*Natural experiments*

The randomisation has arisen naturally, for instance after a natural disaster, infrastructure failures or indiscriminate forms of violence. The key tasks here is to establish, that the treatment variation is random with the limitation that it can only be checked for observable parameters.

These natural experiments are also called quasi-experiment.

*Regression Discontinuity Design (RDD)*

An RDD exploits that the treatment variable [latex]A[/latex] is determined, either completely or partially, by the value of an assignment variable [latex]X[/latex] being on either side of a fixed cutpoint [latex]c[/latex]. In the limit at cutpoint [latex]c[/latex] the assignment of treatment is random/exogenous. The assumption is that units just left and right of the cutpoint [latex]c[/latex] are identical *except* with regard to the treatment assignment.

A RDD and RCT are closely related considering that each participant is assigned a randomly generated number [latex]v[/latex] from a uniform distribution over the range [latex][0,1][/latex] such that [latex]T_i = 1[/latex] if [latex]v\geq0.5[/latex] and [latex]T_i=0[/latex] otherwise.

However, RDDs are more prone to several issues:

**Omitted Variable Bias**is possible (in contrast to well-designed RCTs), because a variable [latex]Z[/latex], which may affect [latex]T[/latex], change discontinously at the cutpoint [latex]c[/latex].- Units may be able to
**manipulate**their value on assignment variable [latex]X[/latex] to influence treatment assignment around [latex]c[/latex]. **Global functional form misspecification**may lead to non-linearities being interpreted as discontinuities.

*Instrumental Variable Regression (IV)*

There is a set of problems where endogeneity or joint determinancy of [latex]X[/latex] and [latex]Y[/latex], omitted variable bias (other variables) and measurement errors in [latex]X[/latex] may be an issue.

An instrumental variable [latex]Z[/latex] is introduced. It is considered a valid instrument if and only if:

*Instrument relevance*: [latex]Z[/latex] must be correlated with [latex]X[/latex],*Instrument exogeneity*: [latex]Z[/latex] must be uncorrelated with all other determinants of [latex]Y[/latex].

Potential sources for instruments are:

*Nature*: e.g. geography, weather, biology in which a truly random source of variation influences [latex]X[/latex] (no endogeneity).*History*: e.g. things determined a long time ago, which were possibly endogenous contemporaneously, but no longer plausibly influence [latex]Y[/latex].*Institutions/Policies*: e.g. formal or informal rules that influence the assignment of [latex]X[/latex] in a way unrelated to [latex]Y[/latex].

Potential issues for IV Resssions are:

- Conditional unconfoundedness of [latex]Z[/latex] regarding [latex]X[/latex] (ideally [latex]Z[/latex] as if random with regard to [latex]X[/latex] such as eligibility rule or encouragement design).
- Weak instrument: [latex]Z[/latex] and [latex]X[/latex] are only weakly correlated.
- Violation of exclusion restriction: [latex]Z[/latex] affects [latex]Y[/latex] independent of [latex]X[/latex].

*Difference-in-Differences Estimation*

Instead of comparing only one point in time, changes are compared over time (i.e. before and after the policy intervention) between participants units and non-participants units. This requires panel data of at least two time periods for participating and non-participating units before and after the policy intervention. Ideally, we have more than two pre-intervention periods.

All participating units should be included, but there are no particular assumptions about how non-participating units are selected. This allows for an arbitrary comparison group as long as they are a valid counterfactual.

However, as always, there are several issues:

**Time-varying confounders**could be an alternative explanation since we estimate time-invariant difference and any omitted variable would have an impact.**Parallel trend assumption**is required to show that there is a similar trajectory and that the difference is due to the intervention.

*Synthetic Control Methods (SCM)*

While related to diff-in-diff estimation strategy, there are a few differences as SCM

- can only have one participating unit;
- does not need a fixed time period and can be applied more flexibly;
- requires a continuous outcome;
- relaxes the modelling assumptions of diff-in-diff; and
- does not have a method for formal inference (yet).

Non-participating units can be chosen freely (like in diff-in-diff), but work best with many homogeneous units. It also requires panel data, but with multiple pre-intervention years. The longer the time frame available, the better SCM can construct a valid counterfactual. The synthetic control is constructed of weights of the non-participating units.

Typical issues that can arise are as the quality of synthetic control depends on:

- number of potential controls,
- homogeneity of potential controls,
- richness of time varying dataset to create synthetic control,
- number of pre-intervention period observations, and
- smoothness of the outcome.

*Matching*

The idea behind matching is to find identical pairs on a key confounder with multiple confounders across multiple dimensions. This becomes exceedingly difficult and the proposed solution is to estimate each units participation propensity given observable characteristics. There are a variety of matching estimators with different advantages and disadvantages (e.g. nearest neighbour, coerced exact matching, genetic matching, etc.).

In matching, we look at the distribution of the treated and the untreated. Observations that are never treated or untreated should be excluded.

Usual pitfalls include:

- Quality of matching estimate requires similar assumptions to hold as regular regression (complete understanding of which factors affect the programme outcome).
- Matching can be considered a non- or semi-parametric regression, hence not significantly different from a causal inference perspective than multivariate regression.

*Conclusion*

Quality of *ex post* evaluation relies on the validity of the counterfactual. RCTs are the gold standard but the *ex pos*t methods have the advantage of allowing us to learn from the past. There is no technical/statistical fix that will create a valid counterfactual: it is always a question of design. Finding valid counterfactuals in observational data requires innovative thinking and deep substantive knowledge.