P-Hacking
Contributor
What is P-hacking?
P-hacking happens when experimenters manipulate their experiment’s data or process to make a result look statistically significant, even if no real effect exists.
It breaks the assumptions behind statistical tests, inflates the Type I error rate, and undermines the trustworthiness of A/B testing programs. P-hacking is sometimes called “data dredging” and usually comes from actions like peeking at results too often, stopping tests prematurely, or cherry-picking winning metrics.
Common Behaviors That Lead to P-hacking
- Repeatedly checking results (“peeking”) and stopping when p-values look good.
- Analyzing many metrics or user segments and only reporting the significant ones.
- Testing multiple variants without correcting for multiple comparisons.
- Running the same experiment several times until something appears significant.
- Making decisions based on early, incomplete data.
Why Does P-hacking Happen?
Often, it comes from pressure to show results or a misunderstanding of how p-values work. A p-value of 0.05 doesn’t mean the result is definitely real—it means there’s a 5% chance of observing a result this extreme if there were actually no effect.
If you look at your data often enough or test enough metrics, you’ll eventually hit that 5% threshold just by chance.
That’s not a real signal. That’s false discovery.
Other reasons include:
- Desire to validate pre-existing ideas, and
- Organizational incentives that reward positive results over careful science.
When companies celebrate “win rates” without checking rigor, they often encourage P-hacking, whether they realize it or not.
Consequences of P-hacking
- Launching features that don’t really improve metrics.
- Wasting development resources.
- Damaging user experience.
- Losing trust in the experimentation program.
- Making decisions based on noise instead of real effects.
In short, P-hacking creates an illusion of progress while hiding the reality.
How to Prevent P-hacking
- Set your sample size and test duration before launching and stick to them.
- Use sequential testing if you need the flexibility to peek without inflating Type I error.
- Correct for multiple comparisons if you analyze multiple metrics, variants, or segments.
- Don’t modify tests mid-flight by adjusting targeting, KPIs, or variations.
- Run regular A/A tests to see how often false positives happen naturally.
- Encourage intellectual humility and reward teams for accurate insights, not just “wins.”
“P-hacking is one of the most dangerous pitfalls in A/B testing because it makes random noise look like significant results.
When testers stop experiments prematurely after seeing a “significant” p-value, they’re cherry-picking data points that support their hypothesis while ignoring statistics. You start to roll out changes you think are impactful but aren’t, which undermines your credibility.
To avoid p-hacking, determine your sample size and test duration before starting, and be disciplined to run the full experiment regardless of interim results. Otherwise, you’re jeopardizing the test results and the entire testing program.”