A/B Testing Is Lying to You — Statistical Significance Isn't Enough
The 95% Confidence Trap
Your testing tool shows a green banner: "95% statistical significance! Variant B wins!" You ship it. A month later, revenue is flat. You run another test. Same thing. After six months of "winning" tests, your conversion rate hasn't moved.
You're not unlucky. Your testing process is broken.
Why Most A/B Tests Produce False Positives
Problem 1: Peeking
The #1 sin in A/B testing. You check results daily and stop the test when it looks good:
Day 1: Variant B is +15% (p = 0.04) → "Ship it!"
What actually happened:
Day 1: +15% (p = 0.04) ← Random variance
Day 2: +8% (p = 0.12)
Day 3: +3% (p = 0.38)
Day 7: +1% (p = 0.72) ← The real result
Day 14: -0.5% (p = 0.89) ← No difference
Why this happens: Statistical significance fluctuates wildly with small sample sizes. Checking early and stopping when it looks good is like flipping a coin 5 times, getting 4 heads, and concluding the coin is biased.
The fix: Calculate your required sample size BEFORE the test starts. Don't look at results until you hit it.
function requiredSampleSize(
baselineRate: number, // Current conversion rate
minimumEffect: number, // Smallest improvement worth detecting
power: number = 0.8, // Probability of detecting a real effect
significance: number = 0.05
): number {
// Simplified formula for two-proportion z-test
const p1 = baselineRate;
const p2 = baselineRate * (1 + minimumEffect);
const pBar = (p1 + p2) / 2;
const zAlpha = 1.96; // For 95% significance
const zBeta = 0.84; // For 80% power
const n = Math.ceil(
(zAlpha * Math.sqrt(2 * pBar * (1 - pBar)) +
zBeta * Math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2 /
(p2 - p1) ** 2
);
return n; // Per variant
}
// Example: 3% baseline, detect 10% relative lift
requiredSampleSize(0.03, 0.10);
// → ~14,300 visitors per variant
// At 1,000 visitors/day = 29 days minimumProblem 2: Testing Too Many Things
Running 5 tests simultaneously on the same page. Each test has a 5% false positive rate. The probability of at least one false positive across 5 tests:
P(at least one false positive) = 1 - (0.95)^5 = 23%
With 10 simultaneous tests: 40%
With 20 simultaneous tests: 64%
The fix: Run fewer tests with larger expected effects. Apply Bonferroni correction when running multiple tests, or better yet, prioritize ruthlessly.
Problem 3: Wrong Metric
Testing button color against click-through rate instead of revenue. Testing headline copy against page views instead of purchases. The metric you optimize for determines the result you get.
Metric hierarchy (most to least reliable):
1. Revenue per visitor ← Test against this
2. Profit per visitor ← Even better if you can
3. Conversion rate ← Acceptable
4. Add-to-cart rate ← Leading indicator only
5. Click-through rate ← Almost meaningless
6. Time on page ← Completely meaningless
The fix: Always test against a metric that's as close to revenue as possible. If your test "wins" on clicks but doesn't move revenue, it didn't win.
Problem 4: Ignoring Segments
Your test shows +5% overall. But when you segment:
New visitors: +12% conversion
Returning visitors: -8% conversion
Mobile: +2% (not significant)
Desktop: +9% (significant)
The "winner" is actually hurting your best customers
while improving performance on low-value traffic.
The fix: Pre-define segments before the test. Check results by segment. A test that hurts your highest-value segment isn't a winner.
Problem 5: Novelty Effect
You redesign the checkout page. Conversion jumps 15% in week one. By week four, it's back to baseline. The "improvement" was just users paying more attention to something new.
The fix: Run tests for at least 2 full business cycles (usually 2-4 weeks). Ignore the first 3-5 days of data. Check for decay in the effect over time.
The Testing Process That Actually Works
Before the Test
## Test Hypothesis
**What we're changing:** [Specific change]
**Why we think it'll work:** [Based on data, not opinion]
**Primary metric:** [Revenue per visitor]
**Guardrail metrics:** [Bounce rate, support tickets]
**Minimum detectable effect:** [10% relative improvement]
**Required sample size:** [14,300 per variant]
**Estimated duration:** [21 days at current traffic]
**Segments to check:** [New vs returning, mobile vs desktop]During the Test
Rules:
1. Do NOT check results before the sample size is reached
2. Do NOT stop the test early (even if it "looks good")
3. Do NOT change the test mid-flight
4. DO monitor guardrail metrics for safety
5. DO check for technical issues (tracking firing correctly?)
After the Test
Analysis checklist:
[ ] Sample size reached for all variants
[ ] Test ran for at least 2 full weeks
[ ] Primary metric is significant (p < 0.05)
[ ] Effect size is practically meaningful (not just statistically)
[ ] Guardrail metrics are not degraded
[ ] Results hold across key segments
[ ] No novelty effect (compare week 1 vs week 2+)
[ ] Revenue impact estimated in dollars, not just percentages
The Testing Roadmap
What to Test (In Order of Impact)
| Area | Typical Lift | Effort |
|---|---|---|
| Checkout flow (steps, friction) | 10-30% | High |
| Pricing and offers | 5-25% | Medium |
| Product page layout | 5-15% | Medium |
| Landing page messaging | 5-20% | Low |
| Email subject lines | 10-30% open rate | Low |
| Navigation and search | 3-10% | Medium |
| Button colors and copy | 1-3% | Low |
Notice: Button colors are at the bottom. Start with the things that actually move revenue.
The Testing Velocity That Works
Small team (< 5 engineers):
→ 2-3 tests per month
→ Focus on high-impact areas only
→ One test per page at a time
Medium team (5-15 engineers):
→ 4-8 tests per month
→ Dedicated experimentation backlog
→ Segment-level analysis
Large team (15+ engineers):
→ 10-20 tests per month
→ Experimentation platform
→ Automated analysis and reporting
Stop Shipping False Positives
The companies that win at experimentation aren't the ones running the most tests. They're the ones running the right tests, with the right methodology, and having the discipline to accept "no significant difference" as a valid result.
A test that shows no effect isn't a failure — it's information. A test that shows a false positive and gets shipped? That's the real failure. And it's happening at most companies every single week.
Fix the process. Then trust the results.