A/B Testing Is Lying to You — Statistical Significance Isn't Enough

February 8, 2026·ScaledByDesign·

ab-testingstatisticsexperimentationanalytics

The 95% Confidence Trap

Your testing tool shows a green banner: "95% statistical significance! Variant B wins!" You ship it. A month later, revenue is flat. You run another test. Same thing. After six months of "winning" tests, your conversion rate hasn't moved.

You're not unlucky. Your testing process is broken.

Why Most A/B Tests Produce False Positives

Problem 1: Peeking

The #1 sin in A/B testing. You check results daily and stop the test when it looks good:

Day 1: Variant B is +15% (p = 0.04) → "Ship it!"

What actually happened:
Day 1: +15% (p = 0.04)  ← Random variance
Day 2: +8%  (p = 0.12)
Day 3: +3%  (p = 0.38)
Day 7: +1%  (p = 0.72)  ← The real result
Day 14: -0.5% (p = 0.89) ← No difference

Why this happens: Statistical significance fluctuates wildly with small sample sizes. Checking early and stopping when it looks good is like flipping a coin 5 times, getting 4 heads, and concluding the coin is biased.

The fix: Calculate your required sample size BEFORE the test starts. Don't look at results until you hit it.

function requiredSampleSize(
  baselineRate: number,    // Current conversion rate
  minimumEffect: number,   // Smallest improvement worth detecting
  power: number = 0.8,     // Probability of detecting a real effect
  significance: number = 0.05
): number {
  // Simplified formula for two-proportion z-test
  const p1 = baselineRate;
  const p2 = baselineRate * (1 + minimumEffect);
  const pBar = (p1 + p2) / 2;
 
  const zAlpha = 1.96;  // For 95% significance
  const zBeta = 0.84;   // For 80% power
 
  const n = Math.ceil(
    (zAlpha * Math.sqrt(2 * pBar * (1 - pBar)) +
     zBeta * Math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2 /
    (p2 - p1) ** 2
  );
 
  return n; // Per variant
}
 
// Example: 3% baseline, detect 10% relative lift
requiredSampleSize(0.03, 0.10);
// → ~14,300 visitors per variant
// At 1,000 visitors/day = 29 days minimum

Problem 2: Testing Too Many Things

Running 5 tests simultaneously on the same page. Each test has a 5% false positive rate. The probability of at least one false positive across 5 tests:

P(at least one false positive) = 1 - (0.95)^5 = 23%

With 10 simultaneous tests: 40%
With 20 simultaneous tests: 64%

The fix: Run fewer tests with larger expected effects. Apply Bonferroni correction when running multiple tests, or better yet, prioritize ruthlessly.

Problem 3: Wrong Metric

Testing button color against click-through rate instead of revenue. Testing headline copy against page views instead of purchases. The metric you optimize for determines the result you get.

Metric hierarchy (most to least reliable):

1. Revenue per visitor     ← Test against this
2. Profit per visitor      ← Even better if you can
3. Conversion rate         ← Acceptable
4. Add-to-cart rate        ← Leading indicator only
5. Click-through rate      ← Almost meaningless
6. Time on page            ← Completely meaningless

The fix: Always test against a metric that's as close to revenue as possible. If your test "wins" on clicks but doesn't move revenue, it didn't win.

Problem 4: Ignoring Segments

Your test shows +5% overall. But when you segment:

New visitors:      +12% conversion
Returning visitors: -8% conversion
Mobile:            +2% (not significant)
Desktop:           +9% (significant)

The "winner" is actually hurting your best customers
while improving performance on low-value traffic.

The fix: Pre-define segments before the test. Check results by segment. A test that hurts your highest-value segment isn't a winner.

Problem 5: Novelty Effect

You redesign the checkout page. Conversion jumps 15% in week one. By week four, it's back to baseline. The "improvement" was just users paying more attention to something new.

The fix: Run tests for at least 2 full business cycles (usually 2-4 weeks). Ignore the first 3-5 days of data. Check for decay in the effect over time.

The Testing Process That Actually Works

Before the Test

## Test Hypothesis
**What we're changing:** [Specific change]
**Why we think it'll work:** [Based on data, not opinion]
**Primary metric:** [Revenue per visitor]
**Guardrail metrics:** [Bounce rate, support tickets]
**Minimum detectable effect:** [10% relative improvement]
**Required sample size:** [14,300 per variant]
**Estimated duration:** [21 days at current traffic]
**Segments to check:** [New vs returning, mobile vs desktop]

During the Test

Rules:
1. Do NOT check results before the sample size is reached
2. Do NOT stop the test early (even if it "looks good")
3. Do NOT change the test mid-flight
4. DO monitor guardrail metrics for safety
5. DO check for technical issues (tracking firing correctly?)

After the Test

Analysis checklist:
  [ ] Sample size reached for all variants
  [ ] Test ran for at least 2 full weeks
  [ ] Primary metric is significant (p < 0.05)
  [ ] Effect size is practically meaningful (not just statistically)
  [ ] Guardrail metrics are not degraded
  [ ] Results hold across key segments
  [ ] No novelty effect (compare week 1 vs week 2+)
  [ ] Revenue impact estimated in dollars, not just percentages

The Testing Roadmap

What to Test (In Order of Impact)

Area	Typical Lift	Effort
Checkout flow (steps, friction)	10-30%	High
Pricing and offers	5-25%	Medium
Product page layout	5-15%	Medium
Landing page messaging	5-20%	Low
Email subject lines	10-30% open rate	Low
Navigation and search	3-10%	Medium
Button colors and copy	1-3%	Low

Notice: Button colors are at the bottom. Start with the things that actually move revenue.

The Testing Velocity That Works

Small team (< 5 engineers):
  → 2-3 tests per month
  → Focus on high-impact areas only
  → One test per page at a time

Medium team (5-15 engineers):
  → 4-8 tests per month
  → Dedicated experimentation backlog
  → Segment-level analysis

Large team (15+ engineers):
  → 10-20 tests per month
  → Experimentation platform
  → Automated analysis and reporting

Stop Shipping False Positives

The companies that win at experimentation aren't the ones running the most tests. They're the ones running the right tests, with the right methodology, and having the discipline to accept "no significant difference" as a valid result.

A test that shows no effect isn't a failure — it's information. A test that shows a false positive and gets shipped? That's the real failure. And it's happening at most companies every single week.

Fix the process. Then trust the results.

Your AI Agent Isn't Working Because You Skipped the Guardrails

Prompt Engineering Is Dead — Context Engineering Is What Matters

A/B Testing Is Lying to You — Statistical Significance Isn't Enough

February 8, 2026·ScaledByDesign·

ab-testingstatisticsexperimentationanalytics

The 95% Confidence Trap

You're not unlucky. Your testing process is broken.

Why Most A/B Tests Produce False Positives

Problem 1: Peeking

The #1 sin in A/B testing. You check results daily and stop the test when it looks good:

Day 1: Variant B is +15% (p = 0.04) → "Ship it!"

What actually happened:
Day 1: +15% (p = 0.04)  ← Random variance
Day 2: +8%  (p = 0.12)
Day 3: +3%  (p = 0.38)
Day 7: +1%  (p = 0.72)  ← The real result
Day 14: -0.5% (p = 0.89) ← No difference

The fix: Calculate your required sample size BEFORE the test starts. Don't look at results until you hit it.

function requiredSampleSize(
  baselineRate: number,    // Current conversion rate
  minimumEffect: number,   // Smallest improvement worth detecting
  power: number = 0.8,     // Probability of detecting a real effect
  significance: number = 0.05
): number {
  // Simplified formula for two-proportion z-test
  const p1 = baselineRate;
  const p2 = baselineRate * (1 + minimumEffect);
  const pBar = (p1 + p2) / 2;
 
  const zAlpha = 1.96;  // For 95% significance
  const zBeta = 0.84;   // For 80% power
 
  const n = Math.ceil(
    (zAlpha * Math.sqrt(2 * pBar * (1 - pBar)) +
     zBeta * Math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2 /
    (p2 - p1) ** 2
  );
 
  return n; // Per variant
}
 
// Example: 3% baseline, detect 10% relative lift
requiredSampleSize(0.03, 0.10);
// → ~14,300 visitors per variant
// At 1,000 visitors/day = 29 days minimum

Problem 2: Testing Too Many Things

Running 5 tests simultaneously on the same page. Each test has a 5% false positive rate. The probability of at least one false positive across 5 tests:

P(at least one false positive) = 1 - (0.95)^5 = 23%

With 10 simultaneous tests: 40%
With 20 simultaneous tests: 64%

The fix: Run fewer tests with larger expected effects. Apply Bonferroni correction when running multiple tests, or better yet, prioritize ruthlessly.

Problem 3: Wrong Metric

Testing button color against click-through rate instead of revenue. Testing headline copy against page views instead of purchases. The metric you optimize for determines the result you get.

Metric hierarchy (most to least reliable):

1. Revenue per visitor     ← Test against this
2. Profit per visitor      ← Even better if you can
3. Conversion rate         ← Acceptable
4. Add-to-cart rate        ← Leading indicator only
5. Click-through rate      ← Almost meaningless
6. Time on page            ← Completely meaningless

The fix: Always test against a metric that's as close to revenue as possible. If your test "wins" on clicks but doesn't move revenue, it didn't win.

Problem 4: Ignoring Segments

Your test shows +5% overall. But when you segment:

New visitors:      +12% conversion
Returning visitors: -8% conversion
Mobile:            +2% (not significant)
Desktop:           +9% (significant)

The "winner" is actually hurting your best customers
while improving performance on low-value traffic.

The fix: Pre-define segments before the test. Check results by segment. A test that hurts your highest-value segment isn't a winner.

Problem 5: Novelty Effect

You redesign the checkout page. Conversion jumps 15% in week one. By week four, it's back to baseline. The "improvement" was just users paying more attention to something new.

The fix: Run tests for at least 2 full business cycles (usually 2-4 weeks). Ignore the first 3-5 days of data. Check for decay in the effect over time.

The Testing Process That Actually Works

Before the Test

## Test Hypothesis
**What we're changing:** [Specific change]
**Why we think it'll work:** [Based on data, not opinion]
**Primary metric:** [Revenue per visitor]
**Guardrail metrics:** [Bounce rate, support tickets]
**Minimum detectable effect:** [10% relative improvement]
**Required sample size:** [14,300 per variant]
**Estimated duration:** [21 days at current traffic]
**Segments to check:** [New vs returning, mobile vs desktop]

During the Test

Rules:
1. Do NOT check results before the sample size is reached
2. Do NOT stop the test early (even if it "looks good")
3. Do NOT change the test mid-flight
4. DO monitor guardrail metrics for safety
5. DO check for technical issues (tracking firing correctly?)

After the Test

Analysis checklist:
  [ ] Sample size reached for all variants
  [ ] Test ran for at least 2 full weeks
  [ ] Primary metric is significant (p < 0.05)
  [ ] Effect size is practically meaningful (not just statistically)
  [ ] Guardrail metrics are not degraded
  [ ] Results hold across key segments
  [ ] No novelty effect (compare week 1 vs week 2+)
  [ ] Revenue impact estimated in dollars, not just percentages

The Testing Roadmap

What to Test (In Order of Impact)

Area	Typical Lift	Effort
Checkout flow (steps, friction)	10-30%	High
Pricing and offers	5-25%	Medium
Product page layout	5-15%	Medium
Landing page messaging	5-20%	Low
Email subject lines	10-30% open rate	Low
Navigation and search	3-10%	Medium
Button colors and copy	1-3%	Low

Notice: Button colors are at the bottom. Start with the things that actually move revenue.

The Testing Velocity That Works

Small team (< 5 engineers):
  → 2-3 tests per month
  → Focus on high-impact areas only
  → One test per page at a time

Medium team (5-15 engineers):
  → 4-8 tests per month
  → Dedicated experimentation backlog
  → Segment-level analysis

Large team (15+ engineers):
  → 10-20 tests per month
  → Experimentation platform
  → Automated analysis and reporting

Stop Shipping False Positives

A test that shows no effect isn't a failure — it's information. A test that shows a false positive and gets shipped? That's the real failure. And it's happening at most companies every single week.

Fix the process. Then trust the results.

Your AI Agent Isn't Working Because You Skipped the Guardrails

Prompt Engineering Is Dead — Context Engineering Is What Matters

A/B Testing Is Lying to You — Statistical Significance Isn't Enough

The 95% Confidence Trap

Why Most A/B Tests Produce False Positives

Problem 1: Peeking

Problem 2: Testing Too Many Things

Problem 3: Wrong Metric

Problem 4: Ignoring Segments

Problem 5: Novelty Effect

The Testing Process That Actually Works

Before the Test

During the Test

After the Test

The Testing Roadmap

What to Test (In Order of Impact)

The Testing Velocity That Works

Stop Shipping False Positives

Ready to Ship?

A/B Testing Is Lying to You — Statistical Significance Isn't Enough

The 95% Confidence Trap

Why Most A/B Tests Produce False Positives

Problem 1: Peeking

Problem 2: Testing Too Many Things

Problem 3: Wrong Metric

Problem 4: Ignoring Segments

Problem 5: Novelty Effect

The Testing Process That Actually Works

Before the Test

During the Test

After the Test

The Testing Roadmap

What to Test (In Order of Impact)

The Testing Velocity That Works

Stop Shipping False Positives

Ready to Ship?