Causality and DAGs (Directed Acyclic Graphs)
Testable implications of DAGs
Estimating causal effects
Final thoughts
Civil engineering doctorates awarded (in the US; source NSF) as a function of the per capita consumption of mozzarella cheese, measured from 2000 to 2009.
Characterising the role of neurons for behavior is a causal inference question.
Our sensory system routinely make inferences about causes of the sensory signals it receives.
Everyone knows what causality is
(mozzarella cheese does not cause engineering PhDs;
if we intervene to increase consumption of mozzarella we would not expect this to have much influence on the number of PhDs awarded)
Hume (1748): “We may define a cause to be an object, followed by another, […] where, if the first object had not been, the second had never existed.”
Much of 20-th century developments in statistics mostly avoided causality
The word “cause” is not in the vocabulary of probability theory
There is no causation without correlation (Reichenbach, 1956)
If \(X\) and \(Y\) are statistically dependent (\(X \not\!\perp\!\!\!\perp Y\))1,
then either:
- \(X\) causes \(Y\);
- \(Y\) causes \(X\);
- a third variable \(Z\) causes both \(X\) and \(Y\)
A collection of nodes and edges
Edges are directed arrows and indicates causal links.
A path is the set of nodes connecting any two nodes.
The node that a directed edge starts from is called the parent of the node that the edge goes into (child node).
More than 1 node separation: ancestor and descendant nodes.
Exogenous variables \(\{ X \}\) external variables not caused by the model.
Endogenous variables \(\{ Y, W, Z\}\) variables determined by other variables in the model.
A parent variable is a direct cause of it’s child variables
(\(X\) is a direct cause of \(Y\)).
An ancestor variable is an indirect or potential cause of its descendants
(\(X\) is a potential cause of \(Z\)).
Acyclicity may seem problematic in presence of feedback loops.
One way to deal with these is by adding nodes for repeated measures (e.g. crossed-lagged panel models)
Each variable in a DAG is usually assumed to be observed with some error (\(\epsilon_W, \epsilon_X, \ldots\)).
These errors are assumed to be independent from each other, and are usually not shown for simplicity
“The disturbance terms represent independent background factors that the investigator chooses not to include in the analysis” (Pearl, 2009)
We can write the joint distribution of all variables in a DAG as the product of the conditional distributions \(p(\text{child} \mid \text{parents})\)
\[\begin{align*}p & (X, Y,W,Z) =\\ & p(Z∣W,Y) \, p(Y∣X,W) \, p(W∣X) \, p(X) \end{align*}\]
This holds regardless of the precise functional form of each “arrows”.
A DAG provides a blueprint defining a family of joint probability distributions over a set of variables.
The structure of a DAG introduce constraints in the possible joint distribution of its variable that can be tested empirically.
Unconditional (in)dependencies: a DAG is a statement about which variables should (or should not) be associated with one another;
Conditional (in)dependencies a DAG states also which variables should become associated or non-associated when we condition on some other set of variables.
Conditioning on a variable here can be understood as controlling for it.
Conditioning on \(Z\) when studying the link between \(X\) and \(Y\) means we introduce information about \(Z\) in our analysis and ask if \(X\) provide any additional information (above and beyond what we already have knowing \(Z\)).
\[Y \perp\!\!\!\perp X \mid Z\]
How to condition on / control for a variable in practice?
Stratified analysis
Inclusion as covariate in regression models
Matching
All methods of statistical control are affected by measurement errors in confounding variables.
\(X\) and \(Y\) are independent of each other \(X \perp\!\!\!\perp Y\)
\(C\) is caused by both \(X\) and \(Y\)
They become dependent after we condition on the collider node \(C\);
(\(A \not\!\perp\!\!\!\perp B \mid C\))
No association between appearance
and taste
in the ‘population’ of cakes
appearance
and taste
becomes negatively correlated once we condition our analysis on whether a cake was shortlisted or not.
Realistic causal models are more complex and have more than 1 path between variables.
D-separation (“D” stands for directional) is when some variables on a directed graphs are independent of others
Assuming we are not conditioning on any variable, two variables in a DAG are D-separated if all the paths between them contain a collider.
Paths that do not contains colliders implies a correlation (dependency) between variables.
Some examples are “fork” paths \(X \leftarrow Z \rightarrow Y\) or a “pipe” paths \(X \rightarrow Z \rightarrow Y\). These paths are said to be “open” and in order to “close” them (so as to D-separate \(X\) and \(Y\) and make them independent) we need to condition on \(Z\).
\(Z\) and \(Y\) are D-separated (unconditionally independent, \(Z \perp\!\!\!\perp Y\))
However, \(W\) is a collider node, so if we condition our analysis on \(W\) we would find a ‘spurious’ association between \(Z\) and \(Y\)
This association is ‘spurious’ because \(Y\) is not a descendant of \(Z\).
If we were to make an intervention on \(Z\) this would have no effect on \(Y\).
Conditioning on \(U\), the child of the collider node produces the same effect!
(formally \(Z \not\!\perp\!\!\!\perp Y \mid U\))
Which variables that are D-separated (unconditionally independent) in this DAG?
Only \(Z_1\) and \(Z_2\) are unconditionally independent ( \(Z_1 \perp\!\!\!\perp Z_2\)).
Conditional on which variables are \(Z_1\) and \(W\) independent one another?
Conditioning on \(X\) makes \(Z_1\) and \(W\) independent one another (\(W \perp\!\!\!\perp Z_1 \mid X\)); all other paths are “blocked” by a collider.
dagitty
dagitty
dagitty
is also available as an R package
We can use D-separation to falsify and test our DAG model!
We have found that \(W \perp\!\!\!\perp Z_1 \mid X\)
(\(W\) is independent of \(Z_1\) conditional on \(X\))
This implies that if we regress \(W\) on \(X\) and \(Z_1\), e.g. \[W = \beta_1 X + \beta_2 Z_1\]
we should find that \(\beta_2 \approx 0\).
Finding that \(\beta_2\ne 0\) would indicate that \(W\) depends on \(Z_1\) given \(X\), implying that the DAG model is wrong.
More specifically, the DAG would be wrong because the ‘true’ model must have a path between \(W\) and \(Z_1\) that is not D-separated by \(X\)
What happens if we condition on both \(W\) and \(X\)?
Conditioning on \(\{ W,X \}\) makes \(Z\) and \(Y\) conditionally independent.
\(Z \perp\!\!\!\perp Y \mid \{ W,X \}\)
As complexity increases the answer is less and less intuitive.
Luckily these questions can be addressed by mechanically applying logical rules, which have been automated in software like dagitty
DAGs have implications about the type of conditional and unconditional independencies that should be observed between the variables.
We can use the D-separation criterion to falsify and test a graphical causal model.
This type of local test that is different from common practice (e.g. SEM) of estimating parameters for the whole model and interpreting some goodness of fit indices (e.g. RMSEA, etc.).
DAGs have implications about the type of conditional and unconditional independencies that should be observed between the variables.
We can use the D-separation criterion to falsify and test a graphical causal model.
This type of local test that is different from common practice (e.g. SEM) of estimating parameters for the whole model and interpreting some goodness of fit indices (e.g. RMSEA, etc.).
By testing constraints systematically one can attempt to recover the causal structure from data1. This requires a fairly large amount of data (many tests, each subject to sampling error).
When we have latent variables the output is not a unique DAG, but a a set of graphs with equivalent implications (equivalence class).
To estimate the causal effect of some variable \(X\) to some outcome \(Y\), we need to block all “spurious” paths transmitting non-causal associations
Non-causal paths are essentially paths that connect \(X\) to \(Y\) but have an arrow that enters \(X\). These are non-causal because changing \(X\) will not cause a change in \(Y\) (at least not through this path).
Example:
To estimate the causal effect \(X \rightarrow Y\) we can do a randomized control trial (RCT) in which we allocate people to treatment (\(X = 1\)) and control groups (\(X = 0\)) randomly.
The random allocation can be represented as the deletion of an arrow in the DAG.
If we don’t intervene, the confounder \(Z\) influence the likelihood of taking part in the intervention (\(X\))
If we intervene on \(X\) by allocating participants randomly we effectively erase the arrow from \(Z\) to \(X\).
In an RCT we can estimate the causal effect simply by measuring the association between \(X\) and \(Y\).
Alternatively (if we can’t intervene) we need block the non-causal path \(X \leftarrow Z \rightarrow Y\) by “controlling” for \(Z\).
To estimate a variable’s causal effect we adjust for its parents in the DAG.
Often we may have unmeasured parents that, although represented in the DAG, may be inaccessible for measurement.
We need to find an alternative set of variable to adjust for.
List all of the paths connecting \(X\) (the potential cause of interest) and \(Y\) (the outcome).
Classify each path by whether it is open or closed.
A path is open unless it contains a collider.
Classify each path by whether it is a backdoor path.
A backdoor path has an arrow entering \(X\).
For each open backdoor paths, decide which variable(s) to condition on to close it.
It may be the case that is not possible to close all backdoor paths, in which case one would conclude is not possible to estimate the causal effect from the available data.
Here \(X\) is the exposure of interest, \(Y\) the outcome, and \(U\) an unobserved (latent) variable.
There are two ‘indirect’ paths between \(X\) and \(Y\)
Path (2) is already blocked by a collider in \(B\). Path (1) needs to be blocked; we can’t condition on \(U\) since is not observed, so this leaves us either \(A\) or \(C\) (either will suffice).
The solution can also be found automatically using dagitty
:
Pearl (1995) introduced a new operator to explicitly represents intervention on a variable
Adopting this notation the causal effect of interest is: \[P(Y=1 \mid \text{do}(X = 1)) - P(Y=1 \mid \text{do}(X = 0))\]
Adjustment formula
\[P(Y \mid \text{do}(X = x)) = \sum_{z} P(Y \mid X = x, Z = z) P(Z = z)\]
DAGs help us reason explicitly about causal pathways among variables of interest.
By expressing our theory as a DAG, we can use D-separation to derive and test its implications.
Drawing a DAG and checking for backdoor paths provides a principled way to decide which variables to adjust for when estimating causal effects.
Graphical causal models offer a rigorous, transparent method for stating and scrutinizing causal assumptions.
Attempting to draw plausible a DAG can be a highly uncomfortable experience, as it fully exposes our uncertainty about our object of study.
DAGs make our assumptions transparent and easier to critique.
Unless we are able to intervene (RCT) a single study rarely offers conclusive evidence.
Establishing a convincing causal DAG requires broad, convergent evidence from multiple design and studies to constrain the plausible causal mechanisms.
DAGs are “the simplest possible” model; they clarify assumptions but cannot replace fully mechanistic, dynamic models.
Acyclic graphs can’t capture feedback or highly interconnected systems (e.g., the brain, the economy). Such systems systems exhibit complexity (e.g. sensitivity dependence on initial conditions) that DAGs cannot capture.