How to Reduce Sample Size Pollution for Accurate A/B Test Results
August 7, 2020 –
You spent hours strategizing your test.
Your team creates a hypothesis.
You run the test and await the results.
But you find your test failed. The results have been tainted. But how?
Don’t beat yourself up. There is a dirty little secret in the testing world called sample size pollution.
Pollution of your sample audience can unknowingly cause tests to be doomed before they even start.
There is a long list of potential reasons tests fail, but one of the most frustrating is sample size pollution.
This article will help you understand:
- Why sample size pollution occurs.
- How to know if your test is polluted.
- Steps to take to minimize sample size pollution from happening.
Let’s have a look…
Sample Size 101
Definition of Sample Size
Most online calculators are simple to use. With Convert’s calculator, you only need to plug in three values:
- Existing Conversion Rate
- Expected Improvement
- Confidence Level
If the existing conversion rate is 3% and the expected improvement is 20% while testing two variations at a confidence level of 95%, you would need a sample size of 42,034 to get confident results. At 2,000 daily visitors to this test group, it would take 22 days according to our duration calculator.
Determine Who Will Be In Your Sample
The easiest way to answer this question of “WHO?” or the segment, is by reviewing the demographics and sources of your current website visitors. Tap into the existing data for clues. Who are they? Where are they coming from?
Tools like Convert Experiments actually allow you to test using a specific segment of your website visitors and create custom audiences.
Several factors can help you undercover the ‘who’:
- Type of Traffic
Do you get seasonal traffic? Do you expect an influx of visitors based on approaching holidays? Does your traffic numbers fluctuate depending on the day of the week?
- Traffic Source
Where does your traffic come from? People behave differently based on the source they enter your site from. For example, a visitor from LinkedIn may not interact with your site the same as someone coming from Facebook.
Examine Google Analytics to get an overview of visitor engagement based on Source.
- New vs. Old
Statistics show that returning visitors remain on your site longer than new visitors. Think about how this will affect your test.
The goal of this consideration stage is to help you build representative samples.
A representative sample is one that has strong external validity in relationship to the target population the sample is meant to represent. As such, the findings from the survey can be generalized with confidence to the population of interest.
To make sure you have a representative sample, Convert suggests running a test for at least one business cycle. This ensures your test has time to account for visitor variance that may happen within a cycle.
What is Sample Size Pollution?
Now that you understand what sample size is you can explore the factors that can corrupt your sample size, and screw up your test. This is how sample size affects validity. Sample size factors that negatively affect test results are known as sample size pollution.
Invespcro defines sample pollution as:
“…factors that invalidate your A/B test data by influencing the samples or data used while conducting your test.”
This problem is more common. Look at this complaint:
In most cases, you want a random sampling, which means each visitor of your website has the same chance of seeing a particular variation before they are bucketed. Once placed in a bucket the user will see the same variant for the duration of the test.
However, if you use an A/B testing tool that doesn’t perform randomization well, the randomization is not guaranteed and it can invalidate the test.
A simple way to combat biased sampling is to use a good A/B testing tool like Convert that performs randomization and bucketing correctly. Starting your testing off with an A/A test to check if the randomization works properly.
You want to be aware of the potential of sample bias when you are considering the details of your test.
Sources That Cause Sample Size Pollution
There are four common types of sample pollution are timing, device, browser, and cookie.
Let’s look at each of them…
The length of your test influences the validity of your results. So it is no surprise “how long should I run my A/B test” is a common question.
CRO professionals have conflicting ideas on what’s an acceptable benchmark. Actually, your test variables should drive the proper length of your test.
A straightforward solution may appear to be just allowing your test to run and run and run. But this too can cause issues. Added time means an increase in potential pollution from external factors.
You want to find the sweet spot.
Another common mistake regarding the length of testing is stopping a test too early. This may not lead to sample size pollution, but it can negatively affect your test.
The same is true if you stop the test when you reach statistical significance. For a valid test, it should also reach your calculated sample size for your desired MDE (Minimum Detectable Effect).
Along similar lines, never ever stop a variant of a running test. This will cause catastrophic pollution. You would be unable to compare the “stopped” variant against the “running at all time,” control. You would have no way to compare “apples to apples.” Never stop and later restart a variant in a test.
Don’t interrupt your tests until the data is consistent for the sample size amount.
Cookies may cause the most insidious form of sample size pollution.
A cookie is a text file that a Web browser stores on a user’s machine. Cookies are a way for Web applications to maintain application state. They are used by websites for authentication, storing website information/preferences, other browsing information and anything else that can help the Web browser while accessing Web servers. HTTP cookies are known by many different names, including browser cookies, Web cookies or HTTP cookies.
As marketers, cookies allow you to track your visitors’ behaviors on your site.
The lifespan of cookies is volatile. Visitors can delete them at their slightest whim.
The longer your test runs, the more vulnerable you are to cookies being deleted – again leading to another form of sample size pollution. To mitigate this phenomenon, Convert advises customers to run tests no more than 90 days.
Visitors visit your site from multiple devices: mobile, laptops, tablets, desktops, and even smartwatches.
Just think of your browsing behavior. You may spot something on your mobile device while at the gyms. Later in the day, you may revisit the website on your desktop computer.
If this happens in the confines of your A/B test, it may appear that two different people visited your site when in fact it is the same person browsing from two different devices.
Even more dangerous to your testing efforts is, this same person may see a different variant on each device.
There is an inverse example of this. What happens when two people use the same device to visit your website?
Imagine two brothers live in the same house. They share a desktop computer. Both are preparing for vacation and need to order new t-shirts and footwear. If an A/B test is running on the e-commerce site at the time of their visit, the data would show these two people as a single user, again, corrupting your sample size.
When the average person gets online, they do not consider the ramification using different browsers to visit the same website will have on an A/B test. But going to the same website from one browser to another, like Safari and then Chrome can lead to similar sample size pollution that occurs with multi-devices.
However, this specific form of pollution is rare, as most people will stick to using one preferred browser per device.
Browsers, device type, cookies, and length of tests are the most common sample size pollutants, but it looks like a new pollutant is entering the conversation. Industry professionals are complaining about Bots creating sample size pollution.
Thankfully at Convert, we have strong bot mitigation measures embedded within our tool so that will not be an issue.
Tips on How to Limit Sample Size Pollution
Because Sample Size Pollution is a major issue, many companies have come up with creative fixes, like putting users into different buckets based on location.
But such tactics can strip tests of “user randomness,” and can reduce your confidence that the test results are valid.
Below are a few things you can do to reduce the chances of sample pollution:
- Run test for separate devices.
- Run test for separate browsers.
- Identify patterns. How has your data looked in the past? It should be similar during testing – data consistency.
Here are a few more things to consider…
Variance and standard deviation go hand-in-hand with consistency. Essentially, they will tell you how far away from the average your numbers are. Low variance means your data is consistent with the average, which puts you at a lower risk of pollution.
You can do the math by hand yourself or just use a simple standard deviation calculator.
Be Aware of Potential Sampling Issues
There are inherent problems with a/b testing, including the possibility of sample size pollution.
Knowledge of potential sample size issues empowers you to make better choices as you design, create treatments, and run experiments.
Now You Can Beat Sample Pollution
Good testing practices mean you start your projects with a full understanding of what can go wrong.
Sample size pollution is a negative by-product that’s experienced when you run A/B tests. Your job is to reduce these negative effects as much as you can so you can have a successful test.
Remember, mitigation happens before your test begins.
Your experimentation strategy and power of your software will make the difference in how well you minimize sample size pollution.
Now that you know this potential blind-spot in your testing it can’t creep up on you.
Share with us, do you have any tips or strategies to reduce sample size pollution?