Table of Contents
Master causal inference. Complete guide to causal machine learning, causal graphs, treatments, and discovering cause-and-effect relationships from data.
Introduction: Causal Inference
Correlation is not causation. Everyone knows this.
Yet it’s the hardest lesson to apply.
You observe: “Users who receive email campaign spend more.”
You conclude: “Email causes spending.”
You increase email frequency.
Spending drops.
What happened? The original relationship was correlation (users inclined to spend also open emails), not causation. More emails didn’t cause spending; it annoyed users.
Causal inference—rigorously determining cause-and-effect—is one of ML’s hardest and most important problems.
Why important? Because:
- Policy decisions depend on causal effects (“Will this intervention help?”)
- Predictions fail when underlying causal structure changes
- Optimization requires understanding what actions cause outcomes
- Fairness requires causal reasoning (discrimination vs correlation)
This guide covers causal inference: from understanding why correlation fails to methods for discovering causation from data to practical applications.
Correlation vs Causation
Why Correlation Fails
Example: Coffee and Heart Disease
Observation: “People who drink more coffee have more heart disease.”
Possible explanations:
- Coffee causes heart disease (causal)
- Unhealthy people drink more coffee to stay awake (reverse causation)
- Coffee drinkers smoke more (confounding: smoking causes both)
All are compatible with the observed correlation.
Confounding
Most common reason correlation ≠ causation.
Confounded Variable (Confounder): Affects both treatment and outcome.
Smoking ←── confounder ──→ Heart Disease
↓
Coffee Drinking (caused by smoking)
Observing: Coffee + Heart Disease
But: Smoking is the real cause
Result: Coffee appears to cause disease when it doesn’t.
Selection Bias
Study participants differ from population.
Example: Online survey about dieting
Who responds? People interested in dieting
Biased sample: Over-represents diet enthusiasts
Results don't generalize to population
Causal Graphs and DAGs
Directed Acyclic Graph (DAG): Visual representation of causal relationships.
Example: Income and Education
Parents' Income
↓
Education Level ──→ Personal Income
↑
Intelligence
Interpretation:
- Parents’ income affects education (some causality)
- Intelligence affects education (some causality)
- Education affects personal income (causality)
- Intelligence affects personal income directly
- No cycles (acyclic)
Reading DAGs
Arrow → means: Causality flows this direction
X → Y: X causes Y (direct effect)
X → Z → Y: X affects Y through Z (indirect effect)
X ← Z → Y: Z affects both X and Y (confounding)
Constructing DAGs
Based on:
- Subject matter expertise: Understand how variables relate
- Temporal order: Causes must precede effects
- Prior research: What’s known about relationships
- Domain knowledge: How do systems work
Critical step: Getting DAG right determines whether causal inference works.
Causal Identification
Question: Can we estimate causal effect from data?
Answer depends on causal structure.
Identifiability
Identifiable: Can estimate causal effect from observational data
Parent Income → Education → Personal Income
Even without randomization, can estimate:
- Effect of education on income
(Using causal inference methods)
Non-Identifiable: Cannot estimate from observational data
Education ← Ability → Earnings
Intelligence → Education (confounder)
Cannot disentangle:
- Effect of education
- Effect of ability
Need randomization
Key Concepts
Back-door Criterion: Path that confounds treatment and outcome.
Treatment ← Confounder → Outcome
This "back-door" path creates correlation without causation
Must block it (adjust for confounder)
Front-door Criterion: Can estimate causation through mediating variables.
Treatment → Mediator → Outcome
Even with unobserved confounding of Treatment-Outcome
Can estimate causal effect if mediating path known
Randomized Controlled Trials (RCTs)
Gold Standard: Randomly assign treatment, measure outcome.
Why RCTs Work
Random assignment breaks confounding.
Without randomization:
Healthy people exercise more (health causes exercise)
Exercisers healthier (correlation)
Confounded: Can't tell if exercise helps
With randomization:
Randomly assign: Exercise vs No Exercise
Health status equal in both groups (on average)
Difference in outcomes = causal effect
Design
Process:
- Recruit participants
- Randomly assign to treatment/control
- Give treatment to treatment group
- Measure outcome in both
- Compare outcomes
Strengths:
- Unbiased causal effect (with large sample)
- Simple interpretation
- Handles unmeasured confounders
Limitations:
- Expensive
- Time-consuming
- Sometimes unethical
- May not generalize
Observational Studies
When randomization impossible (unethical or impractical), use observational data.
Observational Data: Not randomly assigned, just observed.
Challenge
With observational data, must be careful about confounding.
Example: Does smoking cause cancer?
Can't randomize (unethical)
Use observational data (who smokes, who gets cancer)
Account for confounders (age, genetics, etc.)
Estimate causal effect
Methods for Adjustment
1. Matching: Compare smokers to similar non-smokers (matched on age, genetics, etc.)
Smoker: Age 45, Genetics A
Compare to non-smoker: Age 45, Genetics A
Difference in cancer → causal effect of smoking
2. Regression Adjustment: Include confounders in regression model.
Cancer ~ Smoking + Age + Genetics + ...
Coefficient on Smoking = causal effect (if all confounders included)
3. Instrumental Variables: Use variable that affects treatment but not outcome (except through treatment).
Example: Tax on cigarettes
Affects whether people smoke (treatment)
Doesn't directly affect health (except through smoking)
Can use tax as instrument to estimate smoking's effect on health
Treatment Effects
Types of Treatment Effects
Average Treatment Effect (ATE):
ATE = E[Y(1)] - E[Y(0)]
= Expected outcome if everyone treated - if no one treated
Example: Average effect of medication on health
Heterogeneous Treatment Effects (HTE): Effect differs by person.
Medication helps people with severe disease
Doesn't help people with mild disease
Effect heterogeneous by disease severity
Conditional Average Treatment Effect (CATE):
CATE(X) = E[Y(1) - Y(0) | X]
= Treatment effect for people with characteristics X
Example: Medicine works for older people, not younger
Why Heterogeneous Effects Matter
Example: A/B Test
Overall: Treatment improves conversion 2%
But: Works for new users (+5%), hurts for loyal users (-2%)
Optimal: Treat new users, don’t treat loyal users
Methods for Causal Inference
Propensity Score Matching
Idea: Treated and untreated similar except for treatment.
1. Estimate probability of treatment given characteristics (propensity score)
2. Match treated units with similar untreated units (same propensity score)
3. Compare outcomes
4. Difference = causal effect
Advantage: Simple, interpretable
Disadvantage: Requires strong assumptions
Doubly Robust Estimation
Combine two methods, more robust.
1. Propensity score weighting
2. Outcome regression
Even if one wrong, often still get right answer
More reliable than single method
Causal Forests
Machine learning approach to estimate heterogeneous treatment effects.
Forest learns:
- Who benefits from treatment
- Who doesn't
- How much each group benefits
More flexible than linear methods
Captures non-linear effects
Causal Machine Learning
Combining ML with causal inference.
Challenges
Standard ML optimizes: Predict Y given X (whatever correlation helps)
Causal ML needs: Understand how X causes Y (causal mechanism)
Different objectives:
- Prediction: P(Y | X) any correlation works
- Causation: What happens if we change X?
Approaches
1. Causal Forests (Athey & Wager): ML + causal inference for treatment effects
2. CATE Estimation: Use ML to estimate treatment effect as function of X
ML model learns:
Effect(X) = How much treatment helps someone with characteristics X
3. Double Machine Learning (Chernozhukov): Use ML for nuisance parameters, causal inference for parameters of interest
Advantage: Leverage ML’s flexibility, maintain causal guarantees
Applications
Healthcare
Treatment Selection: Which patients benefit from treatment?
ML identifies:
- Patients most likely to respond
- Patients likely to have side effects
- Optimal treatment per patient
E-commerce
Recommendation Interventions: What recommendations actually cause purchases?
Traditional: Show recommendation, user buys
Causal: Did recommendation cause purchase or was user buying anyway?
Causal inference answers:
Which recommendations actually convert
Policy
Policy Evaluation: Does policy cause intended outcome?
Example: Job training program
Causal inference estimates:
How much does training increase future earnings
How much for different groups
ROI of program
Key Takeaways
✓ Correlation ≠ Causation – Fundamental principle, often violated
✓ Confounding is culprit – Most common reason correlation fails
✓ DAGs help reasoning – Visualize causal structures
✓ Identifiability crucial – Can we estimate causal effect?
✓ RCTs gold standard – But expensive, sometimes unethical
✓ Observational studies possible – With careful methods
✓ Methods exist – Matching, regression, instrumental variables, etc.
✓ Heterogeneous effects important – Effects differ by person
✓ Causal ML emerging – Combining ML with causal inference
✓ Domain knowledge essential – Causal inference requires understanding domain
Frequently Asked Questions
Q: How do I know if relationship is causal?
A: Randomization (RCT) is best. Observational: Domain knowledge, DAGs, careful analysis.
Q: Can ML solve causality?
A: No. ML helps estimate, but causality fundamentally about model/assumptions, not just data.
Q: What if I can’t randomize?
A: Use observational methods (matching, regression, IV) but acknowledge limitations.
Q: Are causal effects real?
A: Yes, but often heterogeneous. What works for one person may not work for another.
Q: How do I get causal graphs right?
A: Domain expertise, literature review, expert consultation. Iterative process.

