Statistical Significance

Contributor

Andrea Corvi,

Head of Experimentation at Livescore Group.

What is Statistical Significance in A/B Testing?

Statistical significance is a way to quantify whether the difference between a control and variant is likely real or just due to random variation. In an A/B test, a result is statistically significant if the p-value is smaller than a pre-set threshold, often 0.05.

This means the observed difference is unlikely under the assumption that there’s no real difference (the null hypothesis is true). So, you’re confident enough to reject the null and conclude the change may have caused the observed effect.

There’s also a dual relationship between p-values and confidence intervals: If the 95% confidence interval of the difference between groups doesn’t include zero, the result is statistically significant at the 0.05 level.

In Simpler Terms…

Imagine you’re testing two versions of a product page. Version A has the original “Buy Now” button. Version B changes the button color from blue to green. After a week, Version B gets more purchases. Great, so green wins, right?

Not so fast.

How do you know that the increase wasn’t just random luck? Maybe it rained that week, and more people stayed indoors and shopped online. Maybe a bug slowed down Version A for a few users. Maybe it’s just noise.

Statistical significance helps you figure out if what you’re seeing is likely real or just random.

It’s like flipping a coin. You expect heads and tails to come up equally over time. But if you flip it 10 times and get 7 heads, is the coin rigged? Probably not. Flip it 1,000 times and get 700 heads? Now that’s suspicious. That’s the point: the bigger and clearer the difference, the less likely it’s random.

In an A/B test, we calculate something called a p-value. If it’s below a certain threshold—usually 0.05—we call the result statistically significant. That means the odds of seeing this result purely by chance (assuming there’s no real difference) are very low.

Statistical significance doesn’t guarantee your test is “right.” But it does give you confidence that the change you made likely had a real effect.

Where Statistical Significance Fits in Experimentation

It’s a core part of the hypothesis testing process:

You form a testable prediction (hypothesis)
Run an A/B test to collect data
Then use statistical significance to help decide if the outcome supports that prediction

But it’s not binary or definitive. It’s a probability threshold that helps you reduce risk, not guarantee correctness.

“Statistical Significance—a tongue-twister for most people—is often misused as a ‘magical threshold’ to validate test results and ignore other important aspects of the experiment.

When running A/B tests, teams often fixate on reaching statistical significance without considering the broader statistical landscape and business context. This tunnel vision can cause us to lose sight of our true objective: make an informed decision and reduce the risk of a change.

Speaking of risk, the significance threshold you choose should align with your business’s risk tolerance. A more lenient threshold (90%) might be appropriate for low-risk changes, while critical updates may warrant stricter standards (99%). This decision should factor in your business scale, implementation costs, potential risks, and overall business impact.”

Andrea Corvi, Head of Experimentation at Livescore Group.

How to Interpret Statistical Significance

A result is statistically significant when the p-value is below your predefined threshold (e.g., 0.05). This means:

The observed result would be rare if there were no actual difference
You have enough evidence to reject the null hypothesis
The result is unlikely due to chance, but that doesn’t mean it’s important or repeatable

P-value ≠ probability your hypothesis is true.

It’s the probability of seeing your result (or more extreme) assuming no effect exists.

Choosing the Right Significance Threshold

There’s no one-size-fits-all. Your threshold should reflect your business’s risk profile:

Confidence Level	P-value	When to Use
90%	0.10	Low-risk, exploratory tests
95%	0.05	Standard for most A/B tests
99%	0.01	High-risk, costly decisions

Learn more about confidence level.

Use higher confidence (lower p-value) when:

You’re testing a critical flow
A mistake is hard to undo
The test affects lots of users or high-value segments

Statistical vs. Practical Significance

Statistical significance tells you if the difference is real
Practical significance asks: is the difference big enough to matter?

Example: A 0.5% increase in conversion rate might be statistically significant, but if it costs $20K to implement, it may not be worth it.

Always ask: Is the result actionable, not just significant?

Common Pitfalls and Misinterpretations

Peeking: Checking results early increases false positives.
Over-interpreting non-significant results: A lack of significance doesn’t mean there’s no effect. Maybe the sample size was too small.
Ignoring test assumptions: Many tests assume data normality, independence, etc., violating these assumptions weakens results.
Multiple testing: Testing lots of variants or metrics increases false positives unless corrected.
Misusing power analysis: Post-hoc power calculations are unreliable. Power should be calculated before you run the test.

Best Practices for Using Statistical Significance

Predefine your threshold before launching the test
Set sample size based on power and minimum effect of interest
Always consider practical significance before acting on a result
Use confidence intervals to understand uncertainty and estimate the range of plausible outcomes
Control for multiple comparisons when testing many variants or metrics
Avoid early stopping unless using proper sequential testing
Trust, but verify: Run A/A tests and monitor guardrails to ensure platform reliability

Related Terms

Back to Glossary