Experimentation · April 2026

Why most A/B tests are statistically broken

If you run CRO seriously for a year, this is the quiet realisation: most of the "wins" you ship don’t actually work. The test said they did. The data said they did. They didn’t.

Here are the three reasons that’s usually happening.

1. Peeking

The simplest and most common. You start a test on Monday. By Thursday, variation B is up 8% at 93% confidence. You stop the test and ship B.

What happened: you looked at the test every day and stopped it the moment it crossed a threshold. This is called peeking, and it inflates your false-positive rate dramatically. A test designed to be wrong 5% of the time becomes wrong 20-30% of the time when you peek daily and stop on the first significant result.

The fix is simple and painful: decide the sample size before the test goes live, and refuse to look at confidence until you hit it.

2. Stopping at the first positive result

Related to peeking but worse. You do the full test. It hits significance. You ship. But the effect size you saw was almost certainly inflated, because of a phenomenon called "winner’s curse". The estimate at the first-crossing moment is systematically higher than the true effect.

Translation: your test showed +8%. The real effect, if there even was one, was probably +2-4%.

The fix: don’t stop at the first crossing. Run for the planned duration. Better yet, run for two full business cycles, so you capture weekday and weekend variance.

3. Segment hunting

Test fails on the main metric. Someone says "let’s check by device" or "let’s look at returning users only". You slice ten times. One of the slices is positive. You ship the "mobile-only win".

If you run 10 hypothesis tests at 95% confidence, the probability that at least one of them is a false positive is 40%. If you run 20, it’s 64%. Segment hunting looks rigorous and is actually a false-positive factory.

The fix: commit to the primary metric and audience before the test starts. Segments are hypothesis generation for the next test, not validation for this one.

The quiet cost

A team running 40 tests a year with these problems probably ships 15-20 "winners". Of those, maybe 8-10 are real. The rest is drift: changes that feel like progress and didn’t move the revenue number when rolled out to 100%.

You can usually tell which team this is. They run lots of tests. They’ve shipped lots of wins. And the revenue line doesn’t bend.

How we run tests

Plain English, no frills:

Primary metric locked before launch. Usually revenue-per-visitor.
Sample size and duration agreed in writing before launch. Minimum two full business cycles.
No peeking. Confidence gets looked at once: at the end.
Segments looked at for hypothesis generation only. Never to rescue a failed main-metric test.
Guardrails. If a key metric (checkout errors, page load) degrades, we stop regardless.

Boring. And it’s the reason our clients’ revenue lines bend when we ship a winner.

If your testing program is shipping wins that don’t seem to show up in the P&L, that’s the problem. Book a 15-minute call and we can have a look at how you’re running tests.