Causal Discovery: Learning Causal Structures from Data

By Ansarul Haque May 10, 2026 0 Comments

Master causal discovery. Complete guide to learning causal graphs from data, automated causal structure identification, and discovering cause-and-effect relationships.

Introduction: Learning Causal Structures from Data

Previous guide on causal inference: You know the causal graph, estimate effects.

But what if you don’t know the graph?

Real-world problem: You have data, but no one documented causal structure. Variables X, Y, Z, but how do they relate causally?

Causal discovery: Automatically learn causal structure from data.

Why it matters:

Domain knowledge incomplete: No one knows everything
New phenomena: Discovering new relationships
Data-driven science: Let data reveal structure
Automated analysis: Scale beyond manual graph construction

This guide covers causal discovery: from fundamental challenges to methods (constraint-based, score-based, functional) to practical implementation.

Causal Discovery Fundamentals

The Challenge

Multiple graphs consistent with same data.

Possible Graph 1:    A → B → C
Possible Graph 2:    A ← B → C
Possible Graph 3:    A → C ← B

All compatible with correlations in data!
Cannot distinguish without additional assumptions

Key insight: Data alone insufficient. Need assumptions about causal process.

Identifiability

Question: Can true causal graph be uniquely identified?

Answer: Only under assumptions.

Common Assumptions:

Acyclicity: No cycles (effects don’t cause causes)
Faithfulness: Independences in graph match independences in distribution
Markov condition: Variables independent of non-descendants given parents
Causal sufficiency: No hidden confounders

Markov Equivalence

Multiple graphs equivalent (same conditional independences).

A → B → C
A ← B → C
A → C ← B

These three are NOT Markov equivalent (different independences)

But:
A → B → C
and
C → B → A (reverse)

These ARE Markov equivalent (same independences if no cycle allowed)

Result: Even perfect algorithm can’t distinguish equivalent structures.

Constraint-Based Methods

Learn graph by testing conditional independences.

PC Algorithm (Peter-Clark)

Most famous constraint-based method.

Process:

Start with complete graph (all variables connected)
Test conditional independences
Remove edges where independence found
Orient edges using rules

Example:

Start: A-B-C-D (all connected)

Test: Is A ⊥ C | B? (Is A independent of C given B?)
Yes → Remove edge A-C

Test: Is A ⊥ D | B,C?
Yes → Remove edge A-D

Result: DAG reflecting independences

Advantages

Works with any number of variables
Identifies some causal directions (v-structures)
Theoretically grounded

Disadvantages

Statistical tests can fail (finite sample)
Assumes no hidden confounders
Unstable (small changes → big differences)

Score-Based Methods

Learn graph by optimization (maximize score).

BIC (Bayesian Information Criterion)

Score balances:

Fit: How well does graph explain data
Complexity: How many edges (penalize)

BIC = likelihood - penalty × number_of_parameters

Higher BIC = Better graph
Search for graph maximizing BIC

Process:

Start with graph (usually empty)
Try adding/removing edges
Compute BIC for each
Keep edge change that most improves BIC
Repeat until convergence

Advantages

Theoretically justified (Bayesian perspective)
Single objective to optimize
Works with any causal model

Disadvantages

Computationally expensive (search space huge)
No guarantees of finding true graph
Still assumes no hidden confounders

Functional Causal Models

Assume specific functional form.

Linear Acyclic Models (LiNGAM)

Assume linear relationships, no cycles.

B = a1 × A + noise_B
C = a2 × B + a3 × A + noise_C

Linear functions with noise
Can recover causal structure

Key insight: Non-Gaussian noise helps identify direction.

If noise Gaussian: A → B and B → A observationally equivalent
If noise non-Gaussian: Can distinguish (identifiable)

Advantage: Closed-form solution (fast)
Disadvantage: Assumes linearity

Non-Linear Models

Generalize to non-linear relationships.

C = f(A, B) + noise_C

Where f is non-linear function
More flexible but harder to identify

Linear Models

Regression Approach

If know causal order (topological order), can identify with regression.

Known order: A → B → C

Then:
- C = α × B + β × A + γ + noise_C
- B = δ × A + ε + noise_B
- A is exogenous (no parents)

Can estimate from data

Advantage: Simple if ordering known
Disadvantage: Must know ordering

Instrumental Variables

Use exogenous variables to identify effects.

Income ← Education
        ↓
        ← Ability (unobserved confounder)
        ↓
      Health

Use parental education as instrument
Affects education but not health directly (except through education)
Can identify education's effect on health

Non-Linear Models

Additive Noise Models

Assume non-linear relationships with additive noise.

Y = f(X) + noise

Non-linearity helps identify direction
More flexible than linear

Kernel Methods

Use kernel-based approaches for non-linear discovery.

Challenges and Limitations

Hidden Confounders

Fundamental limitation: Can’t discover if variables unmeasured.

A → C ← B
Unknown variable U confounds A and B

Observing only A, B, C:
Can't tell if U exists
Can't include in discovered graph

Finite Sample Issues

Tests unreliable with small samples.

True independence: A ⊥ B
Small sample: May appear dependent (noise)
Algorithm: Incorrect edge removal

Solution: Large samples, adjustments for multiple testing

Non-Stationarity

Causal structure changes over time.

Earlier period: A → C
Later period: B → C

Data pooled: Confusing structure
Can't discover if mixing periods

Faithfulness Violations

Real data may not satisfy faithfulness assumptions.

Assumption: Independences in graph match data
Violation: Canceling paths (multiple paths that cancel)

A → B and A ⊥ B (they cancel, violate faithfulness)
Algorithm: Can fail

Practical Considerations

Assumptions Check

Before using causal discovery, verify:

Acyclicity reasonable (not economics with feedback loops)
No hidden confounders likely
Faithfulness plausible
Causal sufficiency holds

Computational Cost

10 variables: Feasible
50 variables: Hard
1000 variables: Intractable

Heuristics needed for large-scale problems.

Evaluation

How do you know if discovered graph correct?

With no ground truth, difficult.

Approaches:

Domain expert review
Sensitivity analysis (small changes → big changes?)
Consistency across methods
Simulation validation (generate from graph, can it be recovered?)

Tools and Software

PC Algorithm Implementations

R: bnlearn, pcalg packages

library(pcalg)
pc_graph <- pc(suffStat, indTest, p=p, alpha=0.05)
plot(pc_graph)

DoWhy (Microsoft)

Python library for causal inference and discovery.

from dowhy import CausalModel

model = CausalModel(data, treatment, outcome)
discovered_graph = model.identify_effect()

Causal-Learn (Yale)

Python for causal structure learning.

from causallearn.search.ConstraintBased import pc
graph = pc(data, 0.05)

Key Takeaways

✓ Causal discovery is hard – Multiple graphs fit data

✓ Assumptions necessary – Data alone insufficient

✓ Constraint-based methods – Test independences, remove edges

✓ Score-based methods – Optimize BIC or similar

✓ Functional models – Assume specific functional form

✓ Hidden confounders fundamental limit – Can’t discover unmeasured

✓ Finite sample issues – Need large data for reliability

✓ Non-stationarity problematic – Structure changes over time

✓ Tools available – Multiple implementations in R, Python

✓ Human review essential – Can’t fully automate discovery

Frequently Asked Questions

Q: Can I really discover causation from data alone?
A: Not perfectly. Need assumptions. Useful but always review with domain experts.

Q: What if I have hidden confounders?
A: Can’t discover them. Algorithms fail. Assumption unverifiable from data alone.

Q: Should I use constraint-based or score-based?
A: Try both. Constraint-based faster, score-based more flexible. Ensemble often best.

Q: How much data do I need?
A: Depends on complexity. 10x more than variables is rough rule of thumb.

Q: Can I use causal discovery for prediction?
A: Not directly. Use for understanding. Prediction may not need causal structure.

✨ AI

Written By Ansarul Haque

Founder & Editorial Lead at QuestQuip

Ansarul Haque is the founder of QuestQuip, an independent digital newsroom committed to sharp, accurate, and agenda-free journalism. The platform covers AI, celebrity news, personal finance, global travel, health, and sports — focusing on clarity, credibility, and real-world relevance.

Independent Publisher Multi-Category Coverage Editorial Oversight

View All Articles

About the Author

Causal Discovery: Learning Causal Structures from Data

Table of Contents

Master causal discovery. Complete guide to learning causal graphs from data, automated causal structure identification, and discovering cause-and-effect relationships.

Introduction: Learning Causal Structures from Data

Causal Discovery Fundamentals

The Challenge

Identifiability

Markov Equivalence

Constraint-Based Methods

PC Algorithm (Peter-Clark)

Advantages

Disadvantages

Score-Based Methods

BIC (Bayesian Information Criterion)

Advantages

Disadvantages

Functional Causal Models

Linear Acyclic Models (LiNGAM)

Non-Linear Models

Linear Models

Regression Approach

Instrumental Variables

Non-Linear Models

Additive Noise Models

Kernel Methods

Challenges and Limitations

Hidden Confounders

Finite Sample Issues

Non-Stationarity

Faithfulness Violations

Practical Considerations

Assumptions Check

Computational Cost

Evaluation

Tools and Software

PC Algorithm Implementations

DoWhy (Microsoft)

Causal-Learn (Yale)

Key Takeaways

Related Articles

Frequently Asked Questions