ScaledByDesign/Insights
ServicesPricingAboutContact
Book a Call
Scaled By Design

Fractional CTO + execution partner for revenue-critical systems.

Company

  • About
  • Services
  • Contact

Resources

  • Insights
  • Pricing
  • FAQ

Legal

  • Privacy Policy
  • Terms of Service

© 2026 ScaledByDesign. All rights reserved.

contact@scaledbydesign.com

On This Page

The 3-Day Test That Cost $400KThe Minimum Viable Stats You NeedSample Size CalculationWhen to Call a TestThe Peeking ProblemThe Metrics That Actually MatterThe Practical Testing RoadmapWhen Not to Test
  1. Insights
  2. Split Testing & Tracking
  3. Your A/B Test Isn't Statistically Significant — Here's What to Do About It

Your A/B Test Isn't Statistically Significant — Here's What to Do About It

March 27, 2026·ScaledByDesign·
ab-testingstatisticsecommerceconversion-optimizationdata

The 3-Day Test That Cost $400K

A DTC brand ran an A/B test on their product page. After 3 days, variant B showed a 12% lift in conversion rate. The marketing team celebrated, shipped it to 100% of traffic. Over the next 60 days, conversion rate actually dropped 4% compared to the original. The "winning" variant cost them an estimated $400K in lost revenue.

What happened? They called the test at 78% confidence with 1,200 visitors per variant. The result was noise, not signal.

The Minimum Viable Stats You Need

Sample Size Calculation

Before running any test, calculate the minimum sample size:

Required sample size per variant:
  n = (Z² × 2 × p × (1-p)) / MDE²

Where:
  Z = 1.96 (for 95% confidence)
  p = baseline conversion rate
  MDE = minimum detectable effect (smallest lift worth detecting)

Example:
  Baseline conversion rate: 3.5%
  MDE: 10% relative lift (0.35% absolute)
  
  n = (1.96² × 2 × 0.035 × 0.965) / 0.0035²
  n = (3.84 × 0.0676) / 0.00001225
  n ≈ 21,200 per variant
  
  At 5,000 visitors/day → need ~8.5 days minimum

If you can't get 21,200 visitors per variant in a reasonable timeframe, you need a larger MDE — meaning you can only detect bigger lifts. This isn't a limitation of the tool; it's a limitation of your traffic.

When to Call a Test

The Decision Framework:

  Statistical significance ≥ 95% AND sample size met:
    → Ship the winner ✅

  Statistical significance ≥ 90% AND large effect size (>15% lift):
    → Probably a real winner, but monitor closely after shipping

  Statistical significance < 90% AND sample size met:
    → No detectable difference. The variants are equivalent.
    → Ship whichever is easier to maintain.

  Sample size NOT met (regardless of significance):
    → Keep running. You don't have enough data yet.
    → Do NOT peek and make decisions based on incomplete data.

The Peeking Problem

Every time you check your test results, you increase the chance of a false positive:

Check once at end:        5% false positive rate (as designed)
Check every day for 14d:  ~25% false positive rate
Check every hour:         Even worse

Why? Each check is a separate statistical test. If you check 14 times,
you're running 14 tests — and the chance that at least one shows
significance by random chance is much higher than 5%.

Solutions:

  • Pre-register your end date and don't look until then
  • Use sequential testing (tools like Optimizely use this) which adjusts for peeking
  • Use Bayesian methods which are naturally less affected by peeking

The Metrics That Actually Matter

Most teams test the wrong metric:

Weak metrics (high variance, misleading):
  → Session conversion rate (affected by repeat visitors)
  → Revenue per session (skewed by outlier orders)
  → Bounce rate (noisy, doesn't correlate with revenue)

Strong metrics (lower variance, actionable):
  → Revenue per unique visitor (deduplicated)
  → Orders per unique visitor
  → Add-to-cart rate (leading indicator)
  → Revenue per visitor including returns (lagging, but truest)

The Practical Testing Roadmap

For brands doing $5-50M in revenue:

Month 1: Foundation
  → Set up proper tracking (server-side events, deduplicated)
  → Calculate sample sizes for your traffic level
  → Identify your top 3 highest-impact pages

Month 2: High-Impact Tests
  → Homepage hero: messaging, CTA, social proof placement
  → Product page: image layout, reviews placement, add-to-cart button
  → Cart: urgency elements, cross-sell, shipping threshold

Month 3: Iteration
  → Take the winning concepts and test variations
  → Test pricing page if applicable
  → Start testing email flows (subject lines, send times)

Ongoing:
  → 2-3 tests running at all times
  → Monthly test review meeting (what won, what lost, what we learned)
  → Quarterly strategy update based on accumulated learnings

When Not to Test

Testing everything is as bad as testing nothing. Don't test when:

  • Traffic is too low: If reaching sample size takes > 6 weeks, the test isn't worth running. Just make the change and measure before/after.
  • The change is obvious: Fixing a broken checkout flow doesn't need an A/B test. Fix it.
  • The change is irreversible: Brand redesigns, platform migrations — these aren't A/B testable in the traditional sense.
  • The opportunity cost is too high: Running 20 low-impact tests means your highest-impact ideas sit in a queue.

A/B testing is a tool, not a religion. Use it when you have sufficient traffic, a clear hypothesis, and the discipline to wait for real statistical significance. Everything else is just confirmation bias with a dashboard.

Previous
Building an On-Call Rotation That Doesn't Destroy Your Team
Insights
Your A/B Test Isn't Statistically Significant — Here's What to Do About ItServer-Side Tracking in a Cookieless World — The Implementation GuideYour Analytics Are Double-Counting Revenue — And Nobody NoticedA/B Testing Is Lying to You — Statistical Significance Isn't EnoughServer-Side Split Testing: Why Client-Side Tools Are Costing You RevenueThe Tracking Stack That Survives iOS, Ad Blockers, and Cookie DeathHow to Run Pricing Experiments Without Destroying TrustYour Conversion Rate Is a Vanity Metric — Here's What to Track InsteadBuilding a Feature Flag System That Doesn't Become Technical DebtThe Data Layer Architecture That Makes Every Test Trustworthy

Ready to Ship?

Let's talk about your engineering challenges and how we can help.

Book a Call