The 2026 Glossary of A/B Testing Terms Every Modern Experimenter Should Know!

From HARK-ing to the intricacies of Statistical Significance, each term is guided by an expert with decades of experience. AI is not beating this with better content. Convert promise.

Search

A

A/A Testing

A/A tests aren’t a one-time task—they should be a regular part of your experimentation process. They help ensure your setup works as intended by verifying that traffic is correctly split (e.g., 50/50), all intended visitors are included, and key performance indicators (KPIs) are tracked properly.

If an A/A test shows a big difference between the two identical versions, it could mean there’s a problem—like tracking errors, incorrect visitor splits, or other setup issues. However, if the test results look normal (inconclusive), that doesn’t automatically mean everything is perfect. And if you do see a difference, don’t panic! Random chance can sometimes create false positives, depending on the significance level. Instead of focusing on a single result, think of A/A testing as a routine health check for your platform.

By incorporating A/A tests into your process, you can catch issues early and ensure your A/B test results lead to the right decisions.

Contributor

Global Head of Digital Experience Optimization at MediaMarktSaturn

A/B Test

Your "A" version is your control and has no changes. You test this against your "B" version, which has an element/page/component different from the control. You can run them side-by-side to see what impact the change has.

An A/B test can quantify the impact of the change as long as proper statistics are employed. If the correct principles are followed, confidence in decision-making is unparalleled, and A/B testing is seen as the gold standard among test methods. Some people may opt out of this 'gold standard' due to difficulty setting up an A/B test (it can be costly and/or time-consuming) or lack of knowledge to set up and analyze an A/B test.

B

Bayesian

Bayesian inference is a probabilistic method used to make more informed decisions. Let's take an example to understand how companies use this method.

Imagine a credit card company wants to determine the likelihood that a transaction is fraudulent, given that it triggered a fraud alert.

To calculate the probability that the transaction is actually fraudulent, we need the following:

• The general probability of a transaction being fraudulent based on historical data (prior probability).
• The probability that a fraudulent transaction triggers the alert (true positive rate or sensitivity).
• The probability that a legitimate transaction also triggers the alert (false positive rate).

By applying Bayes' Theorem with these inputs, the system can estimate how likely the transaction is to be fraudulent. The system may block the transaction or flag it for review if the probability is high. If it's low, it may let the transaction proceed, improving customer experience while maintaining security.

Contributor

Data Scientist

C

Client-Side Testing

Client-side testing is generally faster, easier to deploy than server-side testing, and a good fit for companies starting or developing their A/B testing culture. Server-side testing requires a stronger collaboration with product and engineering teams but offers more possibilities to test complex logic such as pricing, algorithms, journeys, etc., and a seamless user experience.

If you've managed to handle all testing projects with front-end changes until now, you might not need server-side tools yet. Server-side testing requires more time, process, and technical development. Organizations should assess their technical capabilities, experimentation culture, and whether the added complexity required for server-side testing aligns with their strategic goals and coming projects. Once ready, server-side testing will help mature teams reach the next level.

Contributor

CRO Lead France at Carrefour

Code Editors

To get the most out of your CRO program, you'll want to use Code Editor.

The WYSIWYG-style visual editor is a good way to get started with relatively simple changes like headline tests. For experiments beyond those basic surface-level adjustments, you need a code editor.

Most experienced teams and reputable CRO agencies will use code editors even for basic tests. It gives you control of the variations in a way that the visual editor doesn't. Developers can directly manipulate JavaScript, HTML, and CSS elements with code editors, allowing them to run more sophisticated experiments.

A major benefit is code quality and debugging. Using code editors reduces the risk of technical issues, which can degrade user experience and negatively impact test results. Developers can implement proper code reviews and approvals. Convert's code editor also helps you catch errors early and test variations properly before going live.

Confidence Level

The higher the confidence level, the more confident you can be that your results are trustworthy. In general, experimenters should aim to achieve a confidence level of 95% or higher.

However, it's important to note the highest level you can get is 99.99%+. You can never be absolutely 100% sure your data is entirely accurate.

A 95% confidence level means the difference between versions is real and not just statistical noise or due to random chance.

However, it's important to realize this metric does NOT mean there's a 95% chance of making the right decision based on the test results. It only tells you that you can be 95% sure the results reported are reliable.

Contributor

MSc, Founder of GuessTheTest

Confounding Variables

In a world where 'move fast and break things' is a leitmotif for so many online companies, understanding the true effect of initiatives can cause headaches.

For example, we run a test to understand if charging customers in their local currency boosts purchases. The test shows a positive (and significant) impact. So we decide to put it live. Unfortunately, the marketing team launched a discount campaign on the same day.

Because of the confounding variable (discount campaign), we can't determine the true effect of the currency test.

Contributor

Founder / Data Analyst, Sterling

Control

Why do we use control or holdout groups in an experiment?

Without a CONTROL GROUP, you don’t know if the change you made in the VARIANT GROUP made the difference—or if it was simply by chance.

A control is your baseline for the same time period in which you run your experiment. It shows what happens without the change being present.

Contributor

Associate Director of Strategy at Speero

Conversion Rate

When doing CRO initiatives or any form of experiment, it's easy to look at conversion rate as a default metric—it's right there in the name, after all. But remember that CRO is about finding ways to alleviate a pain point, bridge a gap, or solve a problem.

Sometimes, conversion rate can't capture and reflect what we're trying to measure. For instance, a test variant saw an increase in conversion rate (say a 10% lift, and that's a win), but checking your test design, you're reminded that your problem is a decline in revenue. So, instead of defaulting to conversion rate, consider average transaction per customer or average ticket size as your test KPI. It could be more reflective of your performance.

Contributor

Head of Client Success Management, Spiralytics, Inc.

Covariates

Covariate selection is important to approach deliberately, emphasizing sound practical and analytical justification. By proactively addressing outcome variability and relevant externalities upfront, you can strengthen the credibility of any causal relationships suggested by the results.

In addition to theory-guided covariates, methods such as LASSO regression can be leveraged to trim down the feature space analytically. Running null simulations and bootstrapped results also lend credence to the overall integrity of the model and confidence in the results.

Contributor

Deployment Strategist at Peregrine

D

Deductive Reasoning

Data alone, or even isolated insights, won't reveal what truly works. How do you drive meaningful change? You start with a hypothesis and aim to validate it through A/B testing.

In practice, some parts of your hypothesis work, and others don't. To truly understand the reasoning behind these outcomes and to develop genuinely data-driven solutions, you must connect your initial hypothesis with your final results.

This involves analyzing different dimensions of your data, constructing a hierarchy of metrics, and applying logical thinking to your business processes and user journey. This is where deductive reasoning comes into play.

Deductive reasoning is essential because, without it, you can only derive meaningful insights if all your experiments yield positive results across all metrics. And let's be honest, that's not very common. By employing deductive reasoning, you can systematically understand and interpret your A/B testing results, leading to more informed decisions and impactful changes.

Contributor

Staff Data Scientist at Meta

Dynamic Triggers

Dynamic triggers improve the fidelity of your experiment result data. They exclude users from the tested sample who would not have experienced the variation change, thereby increasing the signal-to-noise ratio of your data.

For example, if you are testing a change to a flyout menu. Not every user will open that menu. The users that don't would not experience the changes your experiment is evaluating, and thus, their behavior wouldn't be affected. You don't want to include users who don't experience the variation change in your experiment result data. This is something you already do with basic test targeting conditions. Dynamic triggers enable you to refine the conditions for variation exposure beyond a simple page/screen load. Using them will fine-tune your sample size and provide more statistically sound results.

Contributor

Founder at Corvus CRO

F

Flicker Effect

The flicker effect, commonly associated with client-side testing, occurs when the original page briefly appears before the variant design loads. Although it’s not ideal for user experience, a well-optimized experiment setup can significantly reduce the flicker effect. From my perspective, the flexibility and speed of client-side testing often justify this tradeoff, especially when it empowers both technical and non-technical teams to execute rapid experimentation. However, monitoring the flicker effect’s influence on user behavior is important by tracking metrics like bounce and exit rates. Severe flicker or load speed issues could detract from the user experience, ultimately hindering the variant’s performance and skewing test results.

Contributor

Director of Conversion Rate Optimization at ClassPass

Frequentist

When using frequentist methods, the first step is to determine the required sample size. To calculate it, we need to define the statistical power, significance level, and Minimum Detectable Effect (MDE).

The frequentist approach is more rigid than the Bayesian approach. You must wait until the predetermined sample size is reached before calculating the p-value. Based on the p-value, the null hypothesis is either rejected or not. Usually, we reject the null hypothesis if the p-value is equal to or less than 0.05.

The interpretation of the p-value is confusing for many people. It is not the probability that B is better than A (this is Bayesian), but rather the probability of obtaining the observed data or more extreme results if the null hypothesis were true.

Contributor

CRO Technical Lead at Garaje de ideas

Full-Stack Experimentation

Full stack experimentation tests both frontend and backend systems to optimize user experiences, infrastructure, and business outcomes across the entire digital stack.

Contributor

Product, Digital Marketing & Conversion Rate Optimisation

H

HARKing

HARKing is generally a bad idea because you don't actually know why the uplift happened. You can venture a guess; I can explain yesterday's horrible weather after the fact for many reasons. But if I were able to predict the weather in advance, it shows I understand something about what brings that weather about.

In experimentation, it's better to know why a certain thing happens than to guess. I know users on my website took action X because of reason Y. To stop HARKing from happening, all hypotheses should include details about the research done to arrive at the hypothesis. That research should be documented and shared openly. Experimentation dies if there is no documentation.

Contributor

Experimentation and Web Analytics Lead at Nawiri Group

Holdout Group

A holdout group is a segment of users intentionally excluded from all experiments and experiment winners over time (i.e., a year). They serve as a baseline for comparison against the group that sees the optimized website.

The main advantage is that you can measure the true impact of your experimentation program. By comparing the holdout group's conversions and AOV with regular website visitors, you know the uplift caused by your experimentation efforts.

While this might sound ideal, it has many challenges and major downsides.

First, it is hard to track users for an extended period. This will primarily only work on logged-in users, which means you can only compare them with other logged-in users. Therefore, you will miss a lot of data if not everyone is logged in all the time.

Assuming your experimentation leads to higher revenue, you will also miss extra revenue from the users in your holdout group. As the users in the holdout group can not be part of any experiment, your MDEs increase, slowing down your experimentation program.

In summary, while a holdout group sounds ideal for analyzing the true impact of experimentation, it is hard to set up and comes with several major downsides.

Contributor

Independent Experimentation & Decision Strategy Leader

Hypothesis

Your hypothesis forms the basis of your experiment. It's a crucial building block. The same test could be considered a success or failure based on whether the data supports or disproves the hypothesis.

A good hypothesis should summarise what you're changing, why, and what you think will happen. You can take it further and explain why you think the change will occur. For example, 'we believe that changing the color of the CTA from red to green will draw more attention, therefore increasing the click rate because psychologically, green is a positive color associated with go rather than stop.'

I also find hypotheses help me focus on analyzing the data that relates to the test rather than losing track.

Contributor

Senior Strategy Consultant at CreativeCX

I

Inconclusive Results

When a test has no KPIs that have reached confidence, the test is considered inconclusive.

When metrics don't reach confidence, you don't know how the test will behave in the wild if you roll it out. It could have a positive effect, it could have a negative effect, or it could do nothing.

At best, this means your customers were ambivalent about the changes; at worst, the changes were too small for customers to notice. When you have inconclusive results, the best thing to do is to review the qualitative data to try and determine why customers reacted the way they did, iterate based on their feedback, and retest. If tests are inconclusive due to low traffic, then you need to expand the test audience and re-run.

Contributor

Sr. Manager, Customer Research and A/B Testing at Tractor Supply Company

Inductive Reasoning

Inductive reasoning uses specific observations to create general explanations or hypotheses. For example, heavy FAQ usage on a product page might suggest that users are uncertain about the product features.

Experiments should be used to validate these generalizations. For example, you could A/B test more precise descriptions to see if engagement with the FAQs decreases without negatively impacting conversion rates.

However, it's important to remember that these generalizations are not guaranteed truths. They are based on human assumptions and are, therefore, subject to biases and thought fallacies. This means that critical thinking should be applied to ensure you’ve explored alternative explanations for the observations. For example, the prominence of the FAQs may have simply drawn more attention.

Overall, inductive reasoning is a valuable tool for understanding where hypotheses came from, encouraging the consideration of alternative ideas, and generating ideas for experiments.

Contributor

Fractional AI Advisor and Founder, Ressada

M

Multi-Armed Bandit

A Multi-Armed Bandit test (MAB) is ideal for high-opportunity-cost scenarios (e.g., Black Friday or Cyber Monday) where traditional testing is not practical, long-term learnings are not the primary goal, and quick optimization matters more than understanding why.

During these 3-4 day events, visitors are heavily "sale-biased," fundamentally different from a typical user, and unlikely to return, making permanent variant launches unnecessary. Traditional A/B testing wastes valuable conversion opportunities during these critical periods.

Launch 3-4 variants (maximum 6-7) for quick exploration and optimization. MABs rapidly identify and automatically shift traffic to top-performing variants, maximizing goal completion within a short timeframe. Be aware that more variants extend exploration time, reducing effectiveness for time-sensitive events.

All major ad platforms use specialized MABs to place different Ad copy in front of users. It auto-shuts non-performing variants and allocates budgets to winners. They also power product variant recommendations (e.g., size L) based on user context at large e-commerce stores.

Contributor

Data Scientist

Multivariate Testing (MVT)

Multivariate testing (MVT) is a smart way for businesses to test multiple page elements at once and see how they interact. Instead of running separate tests for each change, MVT looks at different combinations of features—like headlines, images, and call-to-action buttons—to understand what works best together.

What makes MVT so powerful is its ability to show how one change affects others. For example, tweaking a headline might change how users engage with a call-to-action button. By testing multiple variations simultaneously, businesses can speed up optimization and make more informed design decisions.

That said, MVT works best for high-traffic websites since it requires a large number of visitors to produce reliable results. Data interpretation can also be complex, requiring solid statistical skills to translate findings into meaningful improvements. Before diving in, companies should consider whether they have enough traffic and the right expertise to get the most value from MVT.

Contributor

Senior Experimentation Consultant at Up Reply

P

P-Hacking

P-hacking is one of the most dangerous pitfalls in A/B testing because it makes random noise look like significant results.

When testers stop experiments prematurely after seeing a "significant" p-value, they're cherry-picking data points that support their hypothesis while ignoring statistics. You start to roll out changes you think are impactful but aren't, which undermines your credibility.

To avoid p-hacking, determine your sample size and test duration before starting, and be disciplined to run the full experiment regardless of interim results. Otherwise, you're jeopardizing the test results and the entire testing program.

Peeking

Peeking refers to checking the interim results of an A/B test with the intent to take action before it completes. It is very common for experiments to look at "significant" in the beginning due to noise in the data, novelty effects, etc. This can lead to wrong decisions based on a subset of the sample that can be very costly for the organization.

Peeking can be avoided by building a robust "test plan" where the sample size, significance level, and test duration are pre-determined before starting an experiment. It also helps to invest in educating the broader stakeholder group on why it’s important to wait until the test sample is reached before any decision is made. Another less common alternative is using sequential tests (instead of more commonly used "fixed sample" tests), which allows for peeking but at the cost of sacrificing some statistical power.

Contributor

AI Strategy and Transformation Leader

Q

Quality Assurance (QA)

A robust quality assurance process is crucial in experimentation. It can mean the difference between actionable insights that drive business growth and misleading results caused by undetected errors.

While QA is often viewed as simply testing the experiment design across browsers and devices, it's equally important to account for all potential user states to ensure the experiment performs reliably for every audience segment. Additionally, leveraging guardrail metrics and session replay tools can serve as valuable methods for identifying and addressing any user-facing issues that may compromise the integrity of the results.

Contributor

Director of Conversion Rate Optimization at ClassPass

R

Regression to the Mean

In statistics, the tendency to move back towards the mean is called regression to the mean. It happens because extreme events are usually followed by more typical ones. Since most values are near the average, it's much more likely to get an average number than another extreme one.

Suppose we have a distribution from 1 to 100, with a mean value of 50. If we pick a random number from this distribution and get an extreme value, like 95, and then pick another number, the second number will likely be less extreme and closer to the mean. That's because there's a 95% chance of picking a smaller number on the second draw, which would naturally be closer to the average.

It's important to remember that in data science, sometimes, big changes we see in data may naturally "regress" without any real reason. Hence, we need to be careful when analyzing results.

Contributor

Data Scientist

S

Sample Size

I can't stress this enough, sample size is super important in A/B testing. To make the best decisions, you must ensure that each test variation has a large enough sample size—that is, enough users or observations to collect solid statistical data for your analysis.

Why does this matter? Well, think about it: if you only ask two friends for restaurant recommendations and assume their favorite is the best choice in town, you might miss out on some amazing places. The same is true in testing. You can't say which variation won with a small sample size. This uncertainty can lead to decisions that might not be in your best interest in the long run. So, having a good sample size isn't just a technical detail; it's key to making informed choices that can significantly impact your business.

Contributor

Solution Consultant at Contentsquare

Segmentation

Segmentation helps you speak the same language as your customers. Your messaging and campaigns will resonate better if you group customers by similar attributes, characteristics, behaviors, and needs.

You can split your customers into segments based on attributes such as acquisition channel, time zone, or Recency, Frequency, and Monetary Value (RFM.) But you should choose segments that are relevant and actionable.

Showing sandals to people in warmer regions makes sense if you sell shoes. But if you sell software, weather is less relevant than job seniority or company size.

Start by thinking about how you'll gather customer data, and then prioritize your initiatives based on segment size and value. Segmentation is also essential in the post-test analysis to help you get deeper insights. For example, test metrics segmented by new vs. returning users, desktop vs. mobile, android vs. iOS.

Contributor

Growth Strategy Analyst at Salt Bank

Sequential Testing

Sequential testing is useful for optimizing time-sensitive assets or when real-time decision-making is critical.

Sequential testing offers a robust framework for building automated decision-making systems. In digital advertising, for example, sequential testing can be used to test variations of time-sensitive promotions or event-based campaigns. It allows teams to quickly identify the best-performing variations and optimize their spending by showing them to everyone.

Other common use cases include ephemeral content, ramps, fraud & failure detection. Sequential experiments are an extremely powerful tool in an experimenter's armory.

Contributor

Data Science Manager

Server-Side Testing

Server-side testing is critical for experimentation programs that need robust, secure, and consistent user experiences. Server-side testing determines which experience a user receives before any content is sent to their device.

Experimenters should care about server-side testing because it:

• Eliminates "flickering," where users briefly see the original experience before the test variant loads.

• Provides consistent experiences across devices and sessions.

• Enables testing of complex features requiring significant backend changes.

• Improves security by keeping experimental code on your servers.

• Reduces performance impacts since processing happens server-side.

• Allows testing behind authentication or with sensitive user data.

• Supports testing with server-rendered frameworks (Next.js, Django, Rails).

For product analytics teams, server-side testing enables more sophisticated experiments, especially for critical user journeys where performance and consistency are paramount. It also facilitates A/B testing of recommendation algorithms, search results, pricing strategies, and other backend-dependent features.

While requiring more engineering resources than client-side alternatives, server-side testing ultimately provides greater control, reliability, and versatility for mature experimentation programs.

Contributor

Director of Product Analytics, PrizePicks

Statistical Significance

Statistical Significance—a tongue-twister for most people—is often misused as a 'magical threshold' to validate test results and ignore other important aspects of the experiment.

When running A/B tests, teams often fixate on reaching statistical significance without considering the broader statistical landscape and business context. This tunnel vision can cause us to lose sight of our true objective: make an informed decision and reduce the risk of a change.

Speaking of risk, the significance threshold you choose should align with your business's risk tolerance. A more lenient threshold (90%) might be appropriate for low-risk changes, while critical updates may warrant stricter standards (99%). This decision should factor in your business scale, implementation costs, potential risks, and overall business impact.

Contributor

Head of Experimentation at Livescore Group.

T

Type I Error (False Positive)

When you have a statistically significant A/B test and launch the treatment, the False Positive Risk (FPR) provides an estimate that you are making a mistake if your goal was to launch a statistically significant improvement.

The FPR is the probability that a statistically significant result is a false positive; that is, the true effect is consistent with the null hypothesis, given our sample size. A very common misunderstanding is that the p-value is this probability, but FPR needs to be computed using Bayes Rules, and in domains with low experiment success rate of 10-20%, it can be 4 to 5 times higher than the p-value.

To reduce the FPR, lower alpha, the p-value threshold for statistical significance (e.g., from 0.05 to 0.005), as recommended in this paper by 72 leading authors.

Because lowering alpha increases false negatives, replicate A/B tests with borderline p-values and combine the results from the replications using meta-analysis techniques.

Contributor

Best-selling author and former executive at Microsoft, Airbnb, and Amazon

Type II Error (False Negative)

The best fix for False Negatives? Probably avoid them in the first place - but you can use some of these approaches to investigate your results:

Plan your sample size, or work out if you had enough to begin with. Use tools like G*Power or GLIMMPSE to calculate the minimum number of participants needed for reliable results. Guessing isn't a strategy, and many tools are free.

Swap bloated KPIs for something useful. "Overall engagement" sounds impressive, but what does it actually mean? Track micro-conversions instead—small but telling actions like clicking "Add to Cart" or watching 75% of a video. Real signals, not vanity metrics.

Standardize everything. Remove outliers, control external variables, and test in stable conditions (maybe don't run an experiment during a traffic surge). Messy data makes it impossible to detect real effects.

Refine your experiment design. Segment users by behavior, device, or other factors to catch nuanced trends. One-size-fits-all tests miss way too much.

Cross-check results. Industry benchmarks, past studies, user surveys—use them. Your experiment isn't as unique as you think.

Contributor

CRO Analyst at National Express

U

UTM Parameters

We find that UTM tracking remains an effective method for tracking specific inbound campaign performance across different channels, mediums, and experiments.

UTM tracking is particularly helpful for segmenting data when running experiments, as it allows us to produce stronger, more granular insights that may involve different messaging, promotions, or experiences being served to users.

However, we are mindful of the limitations of UTM tracking in recent years, especially as privacy regulations have become much stricter. Therefore, we favor an agnostic approach to tracking and attribution that includes UTM tracking alongside data modeling techniques and server-side tagging.

Contributor

Co-Founder and CRO Consultant at Convertex Digital

V

Variant

CRO professionals use the term 'variant' to describe the new thing that's being tested in an experiment. It's also often referred to as a "treatment" or "test group." An experiment can have multiple variants, for example:

• A - your 'original' or 'control group'
• B - Variant 1
• C - Variant 2

So, to avoid confusion, it's useful to give your variants descriptive labels in your A/B testing tool. For example: "Variant 1: New button design"

Tests with multiple variants find significant winners more often than those with only one variant. So, it's worth considering if there are other ways of executing your idea. But beware that the more variants you add, the more traffic you need.

Contributor

Conversion Rate Optimization Lead & Founder at Mammoth Website Optimisation

W

WYSIWYG

WYSIWYG, the most hilariously pronounced acronym going. The big spuds give it a lot of hate. But when your dev team has a backlog as long as your arm, it's the ideal way to visualize and test marketer-friendly changes like copy and imagery.

Expand this Glossary With Us!

Submit your favorite A/B testing term using the form below. And we might invite you to contribute a quote for it.

One new term wins “Word of the Month”, bagging its contributor a $50 gift card.

Start your 15-day free trial now.
  • No credit card needed
  • Access to premium features
You can always change your preferences later.
You're Almost Done.
What Job(s) Do You Do at Work? * (Choose Up to 2 Options):
Convert is committed to protecting your privacy.

Important. Please Read.

  • Check your inbox for the password to Convert’s trial account.
  • Log in using the link provided in that email.

This sign up flow is built for maximum security. You’re worth it!