Table of Contents
Learn A/B testing for machine learning. Complete guide to experimental design, statistical testing, and measuring model impact in production.
Introduction: A/B Testing and Experimentation
You’ve built an amazing model. It’s theoretically better than the current system. But will it actually improve business metrics?
This is where A/B testing becomes critical.
Many ML teams make a grave mistake: they evaluate models on offline metrics (accuracy, precision, recall) and assume good offline performance means good business impact. Often wrong.
The model might:
- Improve accuracy but make slower predictions (user experience suffers)
- Correctly identify edge cases but confuse common cases (overall worse)
- Reduce one metric while increasing another (trade-off)
- Improve business metrics but harm user experience long-term
- Work in testing environment but fail in production
A/B testing (also called online experimentation or randomized controlled trials) is the gold standard for measuring true impact.
This comprehensive guide covers A/B testing: how to design experiments, avoid common pitfalls, interpret results correctly, and measure true business impact.
Why A/B Testing Matters
The Gap Between Offline and Online
Offline Metrics (Simulation):
- Evaluate on historical data
- Controlled environment
- No distribution shift
- No real users impacted
- Quick results
Online Metrics (A/B Test):
- Real users, real data
- Distribution shifts
- User behavior changes
- True business impact
- Slower but reliable
Differences:
- Model exploits historical patterns that may not hold
- User behavior changes when treated differently
- Unobserved confounders in observational data
- Latency and system effects matter
Example:
Offline: Recommend "related products" improves click-through 5%
Online A/B test: Actually hurts conversion because users overwhelmed
Real impact: -2% conversion despite offline improvement
Business Impact
A/B testing connects technical improvements to business outcomes.
What Gets Measured:
- Conversion rate
- Revenue per user
- User retention
- User satisfaction
- Time spent
- Engagement metrics
Example Revenue Impact:
Improvement: +1% conversion rate
Users: 1M/month
Revenue: $100/transaction
Monthly impact: 1M × 0.01 × $100 = $1M additional revenue
Justifies massive ML investment
Risk Management
A/B testing protects against bad deployments.
Without A/B Testing:
- Deploy directly to all users
- Bad models harm all users
- Rollback requires manual intervention
- Hard to detect subtle regressions
With A/B Testing:
- Deploy to small percentage first
- Monitor for problems
- Automatic rollback on failure
- Safe experimentation
Statistical Foundations
Null and Alternative Hypotheses
Null Hypothesis (H₀): New model has no effect Alternative Hypothesis (H₁): New model has effect
Goal: Gather evidence to reject null hypothesis (prove model has effect).
Type I and Type II Errors
| True No Effect | True Effect Exists | |
|---|---|---|
| Conclude No Effect | Correct | Type II Error (β) |
| Conclude Effect | Type I Error (α) | Correct |
Type I Error: False positive (claim improvement when there’s none)
Type II Error: False negative (miss real improvement)
Standard Thresholds:
- α = 0.05 (5% false positive risk)
- β = 0.20 (20% false negative risk)
- Power = 1 – β = 0.80 (80% detection rate)
P-Values
P-value: Probability of observing results if null hypothesis true.
Interpretation:
- p < 0.05: Statistically significant (reject null)
- p ≥ 0.05: Not significant (fail to reject null)
Common Misunderstanding: p-value is NOT probability that hypothesis is true. It’s probability of data given null hypothesis.
Statistical vs Practical Significance
Statistical Significance: Difference is real (not random chance)
Practical Significance: Difference is large enough to matter
Example:
Sample size: 1M users
Conversion improvement: 0.01%
P-value: 0.02 (statistically significant)
Business impact: 1M × 0.01% × $100 = $10k/month (practically significant)
vs.
Sample size: 1000 users
Conversion improvement: 3%
P-value: 0.40 (not statistically significant)
True effect: probably noise from small sample
Experiment Design
Randomization and Assignment
Random Assignment: Each user equally likely to be in control or treatment.
Why Random?
- Removes selection bias
- Creates equivalent groups
- Allows statistical inference
Randomization Approaches:
User-Level:
- Randomize individual users
- Standard, unbiased
- May have carry-over effects
Request-Level:
- Randomize each request separately
- For stateless interactions (search results)
- More granular
Time-Based:
- Run experiment for period, then switch
- Risky (confounders, trend)
- Use only if cannot randomize users
Sample Size Calculation
How many users needed to detect effect?
Formula:
n = (2σ² × (Z_α + Z_β)²) / δ²
Where:
σ = standard deviation
Z_α = critical value for Type I error
Z_β = critical value for Type II error
δ = minimum effect size to detect
Example:
Baseline conversion: 10%
Want to detect: 15% improvement (+1.5%)
Power: 80%, Significance: 95%
Sample needed: ~40k users per group (80k total)
Duration: ~2 weeks (with 1k users/day traffic)
Experiment Duration
Why Duration Matters:
- Weekly patterns (Monday ≠ Sunday)
- User novelty effects (users change behavior over time)
- Seasonal patterns
Best Practice:
- Minimum 1-2 weeks
- Capture full weekly cycle
- Avoid holidays and special events
Common Pitfalls
1. Peeking (Early Stopping)
Problem: Looking at results before experiment complete, stopping early.
Why Bad:
- Increases false positive rate
- Statistical tests assume fixed sample size
- Can lead to bad decisions
Solution:
- Define stopping rule before experiment
- No early looks at results
- Pre-register power analysis
2. Selecting Metrics After Seeing Results
Problem: Try many metrics, report favorable ones.
Why Bad:
- With 20 metrics, expect 1 false positive by chance
- Cherry-picking results
Solution:
- Pre-register primary metric
- Secondary metrics okay but adjust significance
- Report all metrics, not just favorable
3. Insufficient Power
Problem: Run experiment too short, underpowered.
Why Bad:
- High false negative rate
- Miss real improvements
- False confidence in null hypothesis
Solution:
- Calculate required sample size
- Run minimum 2 weeks
- 80%+ power standard
4. Population Differences
Problem: Test population differs from deployment population.
Example:
Test: US desktop users
Deploy: Mobile + international
Different behavior, results don't transfer
Solution:
- Test on representative population
- Segment analysis (does effect differ?)
- Plan expansion carefully
5. Multiple Comparisons Problem
Problem: Run multiple tests, increase false positives.
Solution:
- Bonferroni correction (adjust α by number of tests)
- Pre-register metrics
- Control family-wise error rate
Advanced Techniques
Stratified Analysis
Analyze subgroups separately.
Why:
- Some groups may benefit, others harmed
- Overall positive hides heterogeneous effects
- Informs rollout strategy
Example:
Control: 10% conversion
Treatment: 12% conversion (overall +2%)
By device:
Desktop: Treatment 13% vs Control 11% (+2%)
Mobile: Treatment 11% vs Control 9% (+2%)
By geography:
US: Treatment 14% vs Control 12% (+2%)
International: Treatment 10% vs Control 10% (0%)
Insight: Mobile and international don't benefit, adjust rollout
CUPED (Controlled-experiment Using Pre-Experiment Data)
Reduce variance using pre-experiment behavior as covariate.
Idea: Use user’s baseline behavior to adjust effect estimate.
Effect:
- Smaller required sample size
- Faster experiments
- More precise estimates
Sequential Testing
Analyze results continuously with adjusted thresholds.
Advantage: Stop early if strong evidence
Requirement: Pre-register stopping rule
Network Effects
Users influence each other (social networks, marketplace).
Problem: Standard randomization violates independence assumption.
Solution:
- Randomize by network cluster
- Larger sample size needed
- More complex analysis
Multi-Armed Bandits
Standard A/B Test: Split traffic evenly, evaluate after fixed time.
Problem: Wastes traffic on underperforming variants.
Better Approach (Bandit): Dynamically allocate more traffic to better-performing variants.
Trade-off: Exploration (learn) vs Exploitation (use best).
Thompson Sampling
Probability allocation based on estimated performance.
Algorithm:
- For each variant, estimate success probability distribution
- Sample from each distribution
- Allocate traffic proportional to samples
- Update distributions as data arrives
Advantage: More data on better variants, less on worse
Disadvantage: Slightly less information about worse variants
Contextual Bandits
Adapt variant based on user context.
Example:
User 1 (new user): Show Variant A
User 2 (frequent user): Show Variant B
Allocate variants based on user characteristics
Complexity: Requires ML to map context → variant
Measurement and Metrics
Primary vs Secondary Metrics
Primary Metric:
- Key business metric
- Pre-registered
- Used for decision (ship or not)
- Usually one or two
Secondary Metrics:
- Diagnostic metrics
- Monitor for problems
- Understand effects
- Can be many
Choosing Metrics
Good Metrics:
- Align with business goals
- Sensitive (detect real effects)
- Interpretable
- Not gameable
Bad Metrics:
- Unrelated to goals
- Noisy (insensitive)
- Ambiguous
- Easily manipulated
Example Metric Sets
E-commerce:
- Primary: Revenue per user
- Secondary: Conversion, AOV, Churn, Return rate
Social Network:
- Primary: User engagement (DAU, posts)
- Secondary: Session duration, virality, retention
Ads:
- Primary: Revenue per user
- Secondary: Click-through rate, Ad recall, Brand lift
Organizational Considerations
Experimentation Culture
Characteristics of Strong Culture:
- Bias toward experimentation
- Data-driven decisions
- Tolerates failed experiments
- Rapid iteration
Building Culture:
- Education (statistical thinking)
- Tools (easy experimentation)
- Incentives (reward learnings, not just wins)
- Stories (share lessons from experiments)
Infrastructure
Required:
- Randomization system (assign users to variants)
- Logging (track all user actions)
- Analysis platform (statistical testing)
- Dashboards (visualize results)
Tools:
- Statsig, LaunchDarkly (feature flags + analytics)
- Optimizely (SaaS experimentation)
- Custom (internal infrastructure)
Decision Framework
Criteria for Shipping:
- Statistically significant (p < 0.05)
- Practically significant (meets business bar)
- No concerning secondary metrics
- Success across key segments
- Engineering and product sign-off
Example:
Metric: +2% conversion, p = 0.03
✓ Statistically significant
✓ Practically significant ($2M revenue impact)
✓ No concerning secondaries
✓ Positive in US, EU, Asia
✓ Product happy
→ Ship it
Tools and Platforms
In-House Solutions
Pros: Full control, no vendor lock-in, customizable
Cons: Expensive to build, ongoing maintenance
SaaS Platforms
Statsig:
- Feature flags + analytics
- Easy setup
- Good for growth
Optimizely:
- Enterprise platform
- Experimentation at scale
- Expensive but powerful
VWO:
- Visual testing
- A/B testing
- Analytics
LaunchDarkly:
- Feature management
- Experimentation
- Powerful controls
Key Takeaways
✓ A/B testing measures real business impact – Offline metrics insufficient
✓ Statistical rigor matters – Type I/II errors, power, sample size
✓ Randomization removes bias – Random assignment essential
✓ Sample size calculation needed – Underpowered experiments miss effects
✓ Avoid peeking – Don’t look at results before conclusion
✓ Pre-register metrics – Prevents cherry-picking
✓ Stratified analysis reveals heterogeneity – Groups may differ
✓ Bandits optimize exploration-exploitation – Better than fixed splits
✓ Metrics tell stories – Choose carefully, interpret thoroughly
✓ Culture and infrastructure matter – Easy experimentation drives innovation
Frequently Asked Questions
Q: How long should experiments run?
A: Minimum 1-2 weeks to capture weekly patterns. Longer better if possible.
Q: What if I can’t randomize users?
A: Use observational methods (causal inference, matching) but less reliable. Time-based is last resort.
Q: Should I always run A/B tests?
A: Yes, when: decisions affect many users, large investment, uncertain effect. Small changes, low traffic, can skip.
Q: What if experiment shows no effect?
A: Either model isn’t better or underpowered. Check power. Improvements might be subtle.
Q: Can I stop experiment early if winning?
A: No. Increases false positive rate. Use pre-registered stopping rule if doing sequential testing.

