- Metric: Conversion Rate
- Statistics: Z-Test
- Tails: 2
- Confidence: 95%
- Power: 80%
- SRM Confidence: 99%
A/B Testing Significance Calculator: Supports Revenue Metrics like AOV & ARPV
How do you calculate the sample size for an A/B test?
To do this, we do power calculations based on the weekly values that you provide us with.
The sample size for an A/B test is one of the key factors that Conversion Rate Optimizers look to when deciding whether running a test is feasible or not. This comes down to common sense. If given your preferred statistical significance threshold and power setting, the sample size required is three times the monthly traffic to a site or an online property, the experiment would be interminable.
In general, the stricter you are with the risk you are willing to accept your test result being due to random chance, the larger sample size you require to reach a conclusive winner.
Here are some of the factors on which the calculation of sample size depends:
- The control’s conversion rate
- The MDE you are willing to work with
- The statistical significance setting
- The power or sensitivity of a test
- The type of test - whether it is one-tailed or two-tailed
What is a conversion rate?
A conversion rate is a ratio of conversions over visitors. If you have an equal number of visitors and conversions it is 1 or 100%. Usually, your number of conversions will always be less than the number of visitors, so the conversion rate of a page will always be inferior to 1 or 100%. Let’s say you had 100 visitors to a given page on a given day, and out of those 10 converted.
Our conversion rate is then 10/100 = 0.1 = 10%.
What is statistical significance in A/B testing?
The statistical significance in A/B testing is when the p-value becomes inferior to our significance threshold. It signifies that our null hypothesis is highly unlikely to be true, hence it proves that we have the effects we are observing are not due to random chance.
Let us assume you’re running an A/B test on two landing pages selling the same product. The first one (we’ll call it A, or the control) does not have 3D product images. The second one (we’ll call it B, or the variation) has them.
The conversion rate in terms of “Add to Cart” is 7% for A and 9% for B. So should you just add 3D images to your product pages across the site?
No! Because all the visitors to your website have not seen page B and you can’t make assumptions about their preferences simply from observing the behavior of a much smaller sample size. Right? (PS: Don’t make assumptions in marketing or optimization… being data-driven is the way to go).
How to solve this little problem?
P-value comes to the rescue.
P-value gives you the probability that you have seen a 2 percentage point increase in the “Add to Cart” KPI for your variation (Page B) or a more extreme result if the null hypothesis (that there is really no change) was true. If it is inferior to or less than the “risk” you are willing to accept - the standardized value of this being 5% (the significance level) - then you can be reasonably confident of your test (at your chosen level of confidence, in that case, 100% - 5% significance level = 95% confidence level).
What is the power of an A/B test?
The power of an A/B test is a measure of how likely we are to observe an effect if it is present. Usually, we want it to be at least 80%. The higher the better, it assures us that we are not missing anything!
It plays an important role in sample size calculations because, along with the MDE (minimum detectable effect) and the confidence level, it allows us to determine the required sample size to wait before analyzing a test.
What is the confidence level of an A/B test? How does it differ from significance?
Both the confidence level and the significance level are two sides of the same coin!
Hence if we have a confidence level of 95%, it entails a significance level of 1 - 95% = 5% = 0.05. We can then use the significance level value to compare against our p-value. If our p-value < 0.05, we say that our A/B test has reached “statistical significance”, with a confidence level of 95%.
What is an MDE?
MDE stands for Minimum Detectable Effect and is simply the minimum effect size that would be worth it for us to observe, under which the cost or effort of implementing the new variant wouldn’t be worth it.
The more power you give your test, the more “sensitive” it is to detecting conversion lifts. This is why tests with higher power can have smaller Minimum Detectable Effects (MDE), and in general, require larger sample sizes to conclude.
What is SRM?
SRM stands for Sample Ratio Mismatch and is a statistical test that allows us to check if the sample ratios per variant differ too much from their predefined levels or not.
Hence it can be used as a sanity check, to make sure that no problem has happened during the sample collection. Like a failure of the randomization system that could cause unequal bucketing.
For instance, if we have two variants, one is the control and the other is the variant with 50% of traffic distribution each, but at the end of the test, we see that one has 5500 visitors and the other only 4900. In that case, the SRM test would return positive! But if the second had 5400 visitors, it would return negative as 5500 and 5400 are “close enough” to be due to random chance.
What questions does this calculator answer?
This calculator answers questions about the three most-used metrics in A/B testing, conversion rate, average order value, and average revenue per visitor. It can handle both pre-test and post-test scenarios.
In the pre-test mode, it does test planning by calculating the minimum detectable effects that would work for up to 12 weeks, given the current traffic and conversion values. For instance, given 5000 visitors per week with 250 of those converting, we can observe at least those Minimum Detectable Effects in a statistically significant fashion over a period of weeks. You can then choose a compromise depending on how long you are ready to wait and how small a Lift you wouldn’t want to miss.
In its post-test mode, it does post hoc results analysis, using power calculations it can predict dynamically how advanced the test currently is.
Both modes support revenue metrics, for pre-test you will need a CSV file with one column of a week’s worth of order values to run MDE calculations for it. For post-test, your CSV needs to have at least 2 columns of order values for the duration of your experiment.
What are Type I vs Type II errors?
A Type I error is simply a false positive, for example when the test shows that a variant is a winner but in reality, it’s not. The reason for it is really nothing else than random chance. What you can do is properly “control” the probability of making such an error, let’s say to 5% (in the case of a 95% confidence level) by planning your test properly and sticking to the plan. In other words, you should not look at the results before a certain time or number of samples have been collected.
In other words, we dynamically compute the test progress in percentage form based on the effect size at the moment observation, with the desired confidence and power levels targets.
The second type of error that can happen are Type II errors. Those are nothing other than false negatives.
In reality, there is a change, but it fails to show up at the time that we look at our results and we run away thinking that our variant is under-performing! The way to control this error is to wait for the test to accumulate enough power before assuming that a change is not present! The recommended minimum power level is 80%, but the more the better! Here again, our calculator has a test-progress estimation that will take the desired power level into consideration.
We have written an exhaustive blog on this subject. Check it out here.