Design A/B tests with clear statistical guardrails before the test begins and stick to them during analysis. Recommended practices:
● Pre-calculate sample size and test duration so you don't stop tests early
● Define one primary metric and treat others as guardrails
● Apply Bonferroni or Sidak correction for multiple variants
● Avoid post-test segmentation and metric switching
Tools like Convert help enforce these with primary goal labeling, p-value visualization, sequential testing, and built-in multiple variant correction.
Is your site converting as well as it could be?
Get a CRO Audit - $99In this article, we're looking at where p-hacking creeps into experimentation (intentionally or accidentally). And show how to prevent it without slowing down your decision-making.
What is P-Hacking in A/B Testing?
P-hacking in A/B testing means manipulating your experiment data until statistical significance appears. This is a byproduct of Goodhart's Law: optimizing for test wins rather than insights learned.
When testers are overly focused on getting a winner at the expense of statistical rigor, they often exhibit p-hacking behavior.
What Damage Does P-Hacking Cause in Experimentation Programs?
P-hacking inflates the number of "winning" experiments that are actually statistical noise.
A 2018 analysis of 2,101 commercially run A/B tests found that about 57% of experimenters engaged in p-hacking when their results reached the 90% confidence level. Even a modest amount of p-hacking increases the False Discovery Rate (FDR) of a testing program.
At a 90% confidence threshold, testers who engage in p-hacking push their FDR from 33% to 42%.
This means over 40% of the reported wins could actually be false positives. Sundar Swaminathan explains the danger clearly:
P-hacking is one of the most dangerous pitfalls in A/B testing because it makes random noise look like significant results.
When testers stop experiments prematurely after seeing a “significant” p-value, they’re cherry-picking data points that support their hypothesis while ignoring statistics. You start to roll out changes you think are impactful but aren't, which undermines your credibility.
To avoid p-hacking, determine your sample size and test duration before starting, and be disciplined to run the full experiment regardless of interim results. Otherwise, you’re jeopardizing the test results and the entire testing program.
Learn More: Decode and Master A/B Testing Statistics
When Does P-Hacking Creep into A/B Testing Workflows?
P-hacking, although not deliberate at times, often appears as small shortcuts during experiment execution or analysis.
During the Experiment
- Early stopping after peeking: Simply looking at A/B test results before the pre-calculated stopping point isn't a sin. The real problem is when you allow interim results to determine when the test stops. P-values fluctuate as you collect data. Early statistical significance often disappears as the sample grows. Stopping the test the moment a p-value indicates your preferred winner is when peeking turns into p-hacking.
- Extending the test duration selectively: Opposite of early stopping, this happens when you let the test run longer than its planned duration because it's close to significance (e.g., when p=0.07 and you think it could be lower if you "let it run a bit longer").
- Testing many variants without correcting alpha: Without applying the Bonferroni correction, the probability of false positives increases. Bonferroni divides the significance level by the number of comparisons being made. You can also use the slightly more powerful Sidak correction.
- Restarting a test without any solid hypothesis: Don't restart a test because you don't like where it's headed. Only restart for QA issues or config errors.
- Stopping early for small wins: The research suggested that early stopping is more common for modest uplifts. When testers fear they'll lose the winner, they buckle under pressure and stop the test early. Because the early lift shrinks as data accumulates, stopping early can lead to the winner’s curse, where effects observed at the moment of early stopping tend to be overestimated.
After the Experiment
- Stratification: When you analyze sub-strata of your test audience, looking for wins. Any patterns you find this way should be treated as hypotheses for your next tests, not confirmed findings, or, as Ron Kohavi puts it, "Post-analysis patterns are always interesting."
- Cherry-picking metrics: You started with multiple metrics to get a holistic view of your findings, but after the test, you swapped the primary metric for one with a strong positive trend. This is especially risky when switching between conversion rate (a binomial metric) and revenue per visitor (a non-binomial metric), because the underlying statistical test itself changes with the metric type.
- Re-running the same test: Repeating the same test simply to obtain a better p-value introduces bias.
Unintentional P-Hacking Behaviors
- Filtering users after test launch: You may think that including only people who "saw" the changed element is the same as triggering. It isn't. The design may have influenced how far audiences scroll. And post-deployment filtering breaks randomization.
- Adjusting conversion windows: In ad-to-landing-page A/B tests, media buyers sometimes extend the conversion window to capture more purchases and achieve statistical significance. Changing measurement rules mid-experiment distorts results.
- Switching from two-tailed to one-tailed tests: After seeing that one direction is significant, some testers retroactively change the test type to one-tailed to halve the p-value.
How to Prevent P-Hacking (with Convert)
To prevent p-hacking in A/B testing, you need two things: discipline in experiment design and tools that enforce statistical guardrails.
Convert Experiences supports that process. With Convert, you get features that help you maintain statistical rigor throughout the testing workflow and reduce the risk of p-hacking.
1. Define a clear north star with primary goal configuration
With a clearly labeled primary goal, the temptation to switch success criteria after seeing the results drops. Secondary goals will be tracked as guardrails, not success metrics. You only have to pay attention to them when they change significantly.
2. Plan and analyze with pre- and post-test calculators
Plan tests with A/B testing calculators and determine the expected test duration before you begin. Know the required sample size and statistical power as well to protect your tests and stop exactly when you should.
Learn More: Stats 101 - A Visual Guide
3. Monitor p-value fluctuation
Deter knee-jerk responses to p-value volatility with Convert's Report dashboard. It emphasizes that the p-value isn't a reliable enough gauge for distinguishing a true difference between your variants from noise.
4. Correct for multiple variants with Bonferroni or Sidak
Convert applies multiple comparison corrections out of the box to adjust significance thresholds when several variants are tested simultaneously. You can choose between Bonferroni, Sidak, or no correction.
Sidak is the recommended option for mission-critical experiments because it controls the family-wise error rate without reducing statistical power as aggressively as Bonferroni.
5. Adopt sequential testing for continuous monitoring
Sequential testing allows you to check test results while the experiment is running without inflating the Type I error rate.
Convert’s implementation uses confidence sequences (based on Waudby-Smith et al., 2023), which provide always-valid inference at any point during data collection. The tuning parameter (in Stats Settings) controls how tight the confidence sequences are. A higher value demands more data before declaring significance, which is useful for critical decisions. The default is set at 5,000 visitors.
6. Use multi-armed bandits when speed matters
MAB dynamically allocates traffic to stronger variants while the test runs, using one of three strategies: Thompson Sampling (the default), Epsilon-Greedy, or UCB.
This is useful when finding the best-performing variant fast is the priority.
Note that MAB requires Sequential testing to be enabled in Frequentist mode, and it optimizes based on a single primary metric (conversion rate, RPV, or APPV) that you lock in before starting. MAB is designed for speed of optimization, not for generating classical statistical proof of a causal effect.
7. Design A/B tests with dynamic triggers
To avoid post-experiment filtering that defeats the idea of randomization, dynamic triggers filter out populations that shouldn't be included due to improper targeting.
8. Enable SRM checks to catch broken experiments
Even a perfectly designed test can produce misleading results if the traffic split is broken. A Sample Ratio Mismatch (SRM) occurs when the observed visitor distribution across variants deviates significantly from the expected allocation.
This can happen due to tracking issues, redirect problems, or bot filtering. Convert includes a built-in SRM check (using a Chi-square goodness-of-fit test at 99% confidence) that you can enable in your Project Configuration.
If an SRM is detected, treat the results as unreliable regardless of the p-value. Fix the underlying issue and rerun the test.
9. Stick to your planned test type
Unless absolutely necessary, do not switch from two-tailed tests to one-tailed to half the p-value after your test has already started running. Choose your test type before launching your experience and stick to it.
Convert defaults to two-tailed testing, which is most conservative, but you also have the choice to choose all kinds of options, including one-tailed frequentist or even better, sequential with always valid p-value/confidence. Good tip, use the change logs to learn of any updates to the experiment configuration post-launch.
Conclusion
P-hacking often results from succumbing to pressure. Pressure to demonstrate wins, ship improvements quickly, maintain a high testing velocity, etc.
With rigid stopping rules, predefined metrics, statistical safeguards, and a platform like Convert that helps you reinforce those safeguards, you can protect your A/B tests' statistical integrity and generate insights you can trust.
Frequently Asked Questions
1. How can you detect p-hacking in A/B testing results?
P-hacking warning signs include repeatedly peeking at results, only reporting the smallest p-value, switching metrics, slicing many segments, or rerunning tests until you get a win. To prevent this, lock the primary metric, define stopping rules, and log all your analyses.
2. What is sequential testing, and how does it prevent p-hacking?
Sequential testing allows you to evaluate tests as the data come in using predefined statistical rules that control the false positive rate. Because of its design, you can check results early without biasing significance.
3. Why does repeated significance testing increase false positives?
Each additional look at the data increases the chance of random variation crossing the significance threshold. If you're always eyeing the p-values and stop the test when one dips below 0.05, the statistical assumptions behind the test break. This inflates Type I error, i.e, a higher chance of reporting a false win.
4. What is false discovery rate (FDR) in A/B testing?
This is the proportion of false positives (or false wins) across your entire testing program. This isn't alpha or significance level, which sets the probability of a Type I error (false win) for a single A/B test. Rather, FDR measures the cumulative damage of such errors when you run multiple tests, like when p-hacking behaviors contaminate your program over time.
5. How does p-hacking affect the false discovery rate (FDR) of an experimentation program?
Peeking, testing many metrics, or reporting only favorable results makes your experiments appear more successful than they really are. Over time, these p-hacking behaviors contaminate your experiment program and turn random noise into apparent wins, increasing the false discovery rate.
6. Do Bayesian A/B testing methods reduce the risk of p-hacking?
Bayesian tests do not eliminate p-hacking. But because Bayesian methods handle continuous monitoring differently than traditional fixed-horizon tests, they can reduce it. In Convert’s Bayesian mode, you also get a Risk metric (the probability of picking a losing variant) and an Expected Uplift estimate, which gives you a more complete picture of whether a result is trustworthy.