Your A/B Test Isn't Statistically Significant — Here's What to Do About It
The 3-Day Test That Cost $400K
A DTC brand ran an A/B test on their product page. After 3 days, variant B showed a 12% lift in conversion rate. The marketing team celebrated, shipped it to 100% of traffic. Over the next 60 days, conversion rate actually dropped 4% compared to the original. The "winning" variant cost them an estimated $400K in lost revenue.
What happened? They called the test at 78% confidence with 1,200 visitors per variant. The result was noise, not signal.
The Minimum Viable Stats You Need
Sample Size Calculation
Before running any test, calculate the minimum sample size:
Required sample size per variant:
n = (Z² × 2 × p × (1-p)) / MDE²
Where:
Z = 1.96 (for 95% confidence)
p = baseline conversion rate
MDE = minimum detectable effect (smallest lift worth detecting)
Example:
Baseline conversion rate: 3.5%
MDE: 10% relative lift (0.35% absolute)
n = (1.96² × 2 × 0.035 × 0.965) / 0.0035²
n = (3.84 × 0.0676) / 0.00001225
n ≈ 21,200 per variant
At 5,000 visitors/day → need ~8.5 days minimum
If you can't get 21,200 visitors per variant in a reasonable timeframe, you need a larger MDE — meaning you can only detect bigger lifts. This isn't a limitation of the tool; it's a limitation of your traffic.
When to Call a Test
The Decision Framework:
Statistical significance ≥ 95% AND sample size met:
→ Ship the winner ✅
Statistical significance ≥ 90% AND large effect size (>15% lift):
→ Probably a real winner, but monitor closely after shipping
Statistical significance < 90% AND sample size met:
→ No detectable difference. The variants are equivalent.
→ Ship whichever is easier to maintain.
Sample size NOT met (regardless of significance):
→ Keep running. You don't have enough data yet.
→ Do NOT peek and make decisions based on incomplete data.
The Peeking Problem
Every time you check your test results, you increase the chance of a false positive:
Check once at end: 5% false positive rate (as designed)
Check every day for 14d: ~25% false positive rate
Check every hour: Even worse
Why? Each check is a separate statistical test. If you check 14 times,
you're running 14 tests — and the chance that at least one shows
significance by random chance is much higher than 5%.
Solutions:
- Pre-register your end date and don't look until then
- Use sequential testing (tools like Optimizely use this) which adjusts for peeking
- Use Bayesian methods which are naturally less affected by peeking
The Metrics That Actually Matter
Most teams test the wrong metric:
Weak metrics (high variance, misleading):
→ Session conversion rate (affected by repeat visitors)
→ Revenue per session (skewed by outlier orders)
→ Bounce rate (noisy, doesn't correlate with revenue)
Strong metrics (lower variance, actionable):
→ Revenue per unique visitor (deduplicated)
→ Orders per unique visitor
→ Add-to-cart rate (leading indicator)
→ Revenue per visitor including returns (lagging, but truest)
The Practical Testing Roadmap
For brands doing $5-50M in revenue:
Month 1: Foundation
→ Set up proper tracking (server-side events, deduplicated)
→ Calculate sample sizes for your traffic level
→ Identify your top 3 highest-impact pages
Month 2: High-Impact Tests
→ Homepage hero: messaging, CTA, social proof placement
→ Product page: image layout, reviews placement, add-to-cart button
→ Cart: urgency elements, cross-sell, shipping threshold
Month 3: Iteration
→ Take the winning concepts and test variations
→ Test pricing page if applicable
→ Start testing email flows (subject lines, send times)
Ongoing:
→ 2-3 tests running at all times
→ Monthly test review meeting (what won, what lost, what we learned)
→ Quarterly strategy update based on accumulated learnings
When Not to Test
Testing everything is as bad as testing nothing. Don't test when:
- Traffic is too low: If reaching sample size takes > 6 weeks, the test isn't worth running. Just make the change and measure before/after.
- The change is obvious: Fixing a broken checkout flow doesn't need an A/B test. Fix it.
- The change is irreversible: Brand redesigns, platform migrations — these aren't A/B testable in the traditional sense.
- The opportunity cost is too high: Running 20 low-impact tests means your highest-impact ideas sit in a queue.
A/B testing is a tool, not a religion. Use it when you have sufficient traffic, a clear hypothesis, and the discipline to wait for real statistical significance. Everything else is just confirmation bias with a dashboard.