Table of Contents
Master causal discovery. Complete guide to learning causal graphs from data, automated causal structure identification, and discovering cause-and-effect relationships.
Introduction: Learning Causal Structures from Data
Previous guide on causal inference: You know the causal graph, estimate effects.
But what if you don’t know the graph?
Real-world problem: You have data, but no one documented causal structure. Variables X, Y, Z, but how do they relate causally?
Causal discovery: Automatically learn causal structure from data.
Why it matters:
- Domain knowledge incomplete: No one knows everything
- New phenomena: Discovering new relationships
- Data-driven science: Let data reveal structure
- Automated analysis: Scale beyond manual graph construction
This guide covers causal discovery: from fundamental challenges to methods (constraint-based, score-based, functional) to practical implementation.
Causal Discovery Fundamentals
The Challenge
Multiple graphs consistent with same data.
Possible Graph 1: A → B → C
Possible Graph 2: A ← B → C
Possible Graph 3: A → C ← B
All compatible with correlations in data!
Cannot distinguish without additional assumptions
Key insight: Data alone insufficient. Need assumptions about causal process.
Identifiability
Question: Can true causal graph be uniquely identified?
Answer: Only under assumptions.
Common Assumptions:
- Acyclicity: No cycles (effects don’t cause causes)
- Faithfulness: Independences in graph match independences in distribution
- Markov condition: Variables independent of non-descendants given parents
- Causal sufficiency: No hidden confounders
Markov Equivalence
Multiple graphs equivalent (same conditional independences).
A → B → C
A ← B → C
A → C ← B
These three are NOT Markov equivalent (different independences)
But:
A → B → C
and
C → B → A (reverse)
These ARE Markov equivalent (same independences if no cycle allowed)
Result: Even perfect algorithm can’t distinguish equivalent structures.
Constraint-Based Methods
Learn graph by testing conditional independences.
PC Algorithm (Peter-Clark)
Most famous constraint-based method.
Process:
- Start with complete graph (all variables connected)
- Test conditional independences
- Remove edges where independence found
- Orient edges using rules
Example:
Start: A-B-C-D (all connected)
Test: Is A ⊥ C | B? (Is A independent of C given B?)
Yes → Remove edge A-C
Test: Is A ⊥ D | B,C?
Yes → Remove edge A-D
Result: DAG reflecting independences
Advantages
- Works with any number of variables
- Identifies some causal directions (v-structures)
- Theoretically grounded
Disadvantages
- Statistical tests can fail (finite sample)
- Assumes no hidden confounders
- Unstable (small changes → big differences)
Score-Based Methods
Learn graph by optimization (maximize score).
BIC (Bayesian Information Criterion)
Score balances:
- Fit: How well does graph explain data
- Complexity: How many edges (penalize)
BIC = likelihood - penalty × number_of_parameters
Higher BIC = Better graph
Search for graph maximizing BIC
Process:
- Start with graph (usually empty)
- Try adding/removing edges
- Compute BIC for each
- Keep edge change that most improves BIC
- Repeat until convergence
Advantages
- Theoretically justified (Bayesian perspective)
- Single objective to optimize
- Works with any causal model
Disadvantages
- Computationally expensive (search space huge)
- No guarantees of finding true graph
- Still assumes no hidden confounders
Functional Causal Models
Assume specific functional form.
Linear Acyclic Models (LiNGAM)
Assume linear relationships, no cycles.
B = a1 × A + noise_B
C = a2 × B + a3 × A + noise_C
Linear functions with noise
Can recover causal structure
Key insight: Non-Gaussian noise helps identify direction.
If noise Gaussian: A → B and B → A observationally equivalent
If noise non-Gaussian: Can distinguish (identifiable)
Advantage: Closed-form solution (fast)
Disadvantage: Assumes linearity
Non-Linear Models
Generalize to non-linear relationships.
C = f(A, B) + noise_C
Where f is non-linear function
More flexible but harder to identify
Linear Models
Regression Approach
If know causal order (topological order), can identify with regression.
Known order: A → B → C
Then:
- C = α × B + β × A + γ + noise_C
- B = δ × A + ε + noise_B
- A is exogenous (no parents)
Can estimate from data
Advantage: Simple if ordering known
Disadvantage: Must know ordering
Instrumental Variables
Use exogenous variables to identify effects.
Income ← Education
↓
← Ability (unobserved confounder)
↓
Health
Use parental education as instrument
Affects education but not health directly (except through education)
Can identify education's effect on health
Non-Linear Models
Additive Noise Models
Assume non-linear relationships with additive noise.
Y = f(X) + noise
Non-linearity helps identify direction
More flexible than linear
Kernel Methods
Use kernel-based approaches for non-linear discovery.
Challenges and Limitations
Hidden Confounders
Fundamental limitation: Can’t discover if variables unmeasured.
A → C ← B
Unknown variable U confounds A and B
Observing only A, B, C:
Can't tell if U exists
Can't include in discovered graph
Finite Sample Issues
Tests unreliable with small samples.
True independence: A ⊥ B
Small sample: May appear dependent (noise)
Algorithm: Incorrect edge removal
Solution: Large samples, adjustments for multiple testing
Non-Stationarity
Causal structure changes over time.
Earlier period: A → C
Later period: B → C
Data pooled: Confusing structure
Can't discover if mixing periods
Faithfulness Violations
Real data may not satisfy faithfulness assumptions.
Assumption: Independences in graph match data
Violation: Canceling paths (multiple paths that cancel)
A → B and A ⊥ B (they cancel, violate faithfulness)
Algorithm: Can fail
Practical Considerations
Assumptions Check
Before using causal discovery, verify:
- Acyclicity reasonable (not economics with feedback loops)
- No hidden confounders likely
- Faithfulness plausible
- Causal sufficiency holds
Computational Cost
- 10 variables: Feasible
- 50 variables: Hard
- 1000 variables: Intractable
Heuristics needed for large-scale problems.
Evaluation
How do you know if discovered graph correct?
With no ground truth, difficult.
Approaches:
- Domain expert review
- Sensitivity analysis (small changes → big changes?)
- Consistency across methods
- Simulation validation (generate from graph, can it be recovered?)
Tools and Software
PC Algorithm Implementations
R: bnlearn, pcalg packages
library(pcalg)
pc_graph <- pc(suffStat, indTest, p=p, alpha=0.05)
plot(pc_graph)
DoWhy (Microsoft)
Python library for causal inference and discovery.
from dowhy import CausalModel
model = CausalModel(data, treatment, outcome)
discovered_graph = model.identify_effect()
Causal-Learn (Yale)
Python for causal structure learning.
from causallearn.search.ConstraintBased import pc
graph = pc(data, 0.05)
Key Takeaways
✓ Causal discovery is hard – Multiple graphs fit data
✓ Assumptions necessary – Data alone insufficient
✓ Constraint-based methods – Test independences, remove edges
✓ Score-based methods – Optimize BIC or similar
✓ Functional models – Assume specific functional form
✓ Hidden confounders fundamental limit – Can’t discover unmeasured
✓ Finite sample issues – Need large data for reliability
✓ Non-stationarity problematic – Structure changes over time
✓ Tools available – Multiple implementations in R, Python
✓ Human review essential – Can’t fully automate discovery
Related Articles
- Causal Inference: Understanding Cause-and-Effect
- Statistical Thinking: From Data to Decisions
- Machine Learning System Design
Frequently Asked Questions
Q: Can I really discover causation from data alone?
A: Not perfectly. Need assumptions. Useful but always review with domain experts.
Q: What if I have hidden confounders?
A: Can’t discover them. Algorithms fail. Assumption unverifiable from data alone.
Q: Should I use constraint-based or score-based?
A: Try both. Constraint-based faster, score-based more flexible. Ensemble often best.
Q: How much data do I need?
A: Depends on complexity. 10x more than variables is rough rule of thumb.
Q: Can I use causal discovery for prediction?
A: Not directly. Use for understanding. Prediction may not need causal structure.

