Type I Error (False Positive)

Contributor

Ronny Kohavi,

Best-selling author and former executive at Microsoft, Airbnb, and Amazon

Connect on LinkedIn

What is a Type I Error (False Positive)?

In experimentation, a Type I Error, or a false positive, happens when you conclude that a change made a difference when it actually didn’t.

It’s like announcing a winner in a race when the runners tied.

In A/B testing, the null hypothesis assumes no difference between variants. A Type I Error means you incorrectly reject the null hypothesis, thinking your treatment is better when it’s not.

The Type I error rate, denoted α, is the probability that the testing procedure declares a result statistically significant even though there is truly no effect

“When you have a statistically significant A/B test and launch the treatment, the False Positive Risk (FPR) provides an estimate that you are making a mistake if your goal was to launch a statistically significant improvement.

The FPR is the probability that a statistically significant result is a false positive; that is, the true effect is consistent with the null hypothesis, given our sample size. A very common misunderstanding is that the p-value is this probability, but FPR needs to be computed using Bayes Rules, and in domains with low experiment success rate of 10-20%, it can be 4 to 5 times higher than the p-value.

To reduce the FPR, lower alpha, the p-value threshold for statistical significance (e.g., from 0.05 to 0.005), as recommended in this paper by 72 leading authors.

Because lowering alpha increases false negatives, replicate A/B tests with borderline p-values and combine the results from the replications using meta-analysis techniques.”

Ronny Kohavi, Best-selling author and former executive at Microsoft, Airbnb, and Amazon

Why It Happens

Several testing behaviors can inflate your false positive risk:

Peeking at results and stopping early: This violates statistical assumptions. Each look at your data increases the chances of being fooled by randomness.
Multiple comparisons without correction: Running many tests (variants, metrics, or segments) and using the same 0.05 threshold inflates your odds of getting a lucky result.
Platform or setup errors: Poor randomization, sample ratio mismatches (SRM), or bugs in tracking can create misleading “significance.”
Misunderstanding p-values: A p-value < 0.05 does not mean there’s a 95% chance your result is real. It means there’s less than a 5% chance you’d see that result if the null hypothesis were true.

Consequences of Acting on a False Positive

You roll out a change that doesn’t actually help, and may even hurt performance.
You waste engineering and design time scaling the wrong idea.
You lose trust in your experimentation program.
You delay discovering real, impactful changes because of bad data.

In high-stakes environments, false positives can lead to millions in lost revenue or degraded user experience. That’s why teams often use stricter thresholds (e.g., 0.01 or even 0.005) when the cost of error is high.

How to Reduce Type I Errors

Stick to your pre-defined alpha: Set your significance threshold before the test and don’t move it.
Use sequential testing or Bayesian methods if you need to monitor results before the end.
Correct for multiple comparisons: Use methods like Bonferroni or Benjamini-Hochberg if testing lots of variants or metrics.
Run A/A tests: This helps verify whether your platform’s error rate is behaving as expected.
QA your experiment setup: Fix bugs, check triggers, and look for SRMs.
Lower alpha when needed: In low-tolerance situations, use stricter thresholds like 0.01 or 0.005.

What’s the Difference Between P-Value and False Positive Risk?

P-value is the probability of observing your result (or more extreme) assuming no real effect.
False Positive Risk (FPR) is the chance that a statistically significant result is actually a false positive.

When most of your tested ideas have no real effect (as is often the case), your actual false positive risk can be far higher than your alpha.

This is why some teams advocate for replication and meta-analysis—running the test again or pooling data from multiple tests to see if the effect holds up.

Related Terms

Back to Glossary