MACHINE LEARNING &

A/B Testing

Building confidence in a decision

Elli Tzini

7 min readMay 19, 2022

What, why, when
Experimental Design
1. Choose the one variable to be tested
2. Define the key metric
3. Estimate the sample size
4. Perform a statistical analysis of the metrics gathered
Common pitfalls
Conclusion

This post is part of a blog series on A/B Testing in the context of Machine Learning. Stay tuned for more.

What 🧐, why 🤨, when ⏲

A/B testing is not new. It has been widely used in many business and scientific areas. From e-commerce to psychology, this user experience research methodology has paved the way to data-driven decisions and Machine Learning (ML) must also benefit from it.

Most Machine Learning engineers believe that the lifecycle of a project is as follows:

In other words, as soon as they have a model that is performing well, they ship it to production. This is not correct. We are missing the A/B testing component, which should happen at the following stage:

What is A/B testing?

It is an experiment 🧪! A popular and effective way to test if your new model is as good as you think it is in real life, because in real settings many things influence user behaviour and metrics.

A basic infrastructure of an A/B testing requires you to have a controlled setting, where all elements are constant but one; the ML model in our case. You need to be able to do two things at the same time:

randomly serve your customers with one of the two ML models, and
track your desired metrics

The use of random assignment to evenly split your customers into two groups (control and treatment variant) is the secret ingredient of the experiment. Without it, you wouldn’t be in a position to talk about causation.

Why do we need it?

When we build a ML model, we do it offline. This means that we train the ML algorithm with historical data and then we evaluate it in isolation. These offline tests can tell us a lot about the model performance but only on historical data. Thus, they do not tell us anything about the causal relationships between a model and a user outcome. But what is the real purpose of our ML model? It is to improve certain business metrics (KPIs — Key Performance Indicators) such as increase engagement, reduce handling time, produce more revenue, which means that is necessary to perform online evaluation as well.

A company invests on ML to improve business results. But, when building a model, the ML engineers don’t train it based on the KPIs. Instead they measure the performance based on historical data. When a model performs well offline, it does not mean that it will perform well online. We cannot establish causality through offline testing, so please repeat after me

Increasing the performance of a model, does not necessarily translate to a gain in value!

Because correlation does not imply causation. In a nutshell, the offline evaluation measures the correlation, while, the online evaluation measures the causation.

When is it time to perform an A/B testing?

When there is a change in your ML cycle. For example, your data preparation changed or you trained a new algorithm, you added more features or a post-processing step.

You should also minimise the time spent up to your first online experiment. Yes, you heard that correctly. Do not waste time improving your model day by day because it can easily take up to a year till you reach a deployable level of performance. Instead have a baseline model ready in 2–3 months (could be a simple ML model or a rule-based approach) and perform an A/B testing. This is important because it will allow you to understand the impact of a very basic model to your business.

Experimental Design

A good starting point is defining the null and alternative hypotheses. The null hypothesis (H₀) always states that there is no significant difference between the control and variant groups. In our context, the H₀ could be: “there is no difference in business revenue by using model A and B”. The alternative hypothesis (Hₐ) is challenging H₀; its purpose is to prove that H₀ is simply not true.

1. Choose the one variable to be tested

For example, your team has invested a lot of time into improving your ML classifier and you would like to test if this new (promising) model can lead to a positive impact.

2. Define the key metric

Depending on the task at hand and the nature of your business, the metric will be different. For example, you might want to increase the click through rate or decrease the handling time of your customer support service.

3. Estimate the sample size

In order to compute the sample size for our experiment, we first need to think about Type I and Type II errors.

Type I error happens when we reject the null hypothesis when it should not be rejected. The probability of falsely rejecting the null hypothesis is also known as significance level, or alpha. A 5% significance level is considered the industry standard. This number also means that you have a significant result difference between the control and the experimental groups with a 95% confidence.

Type II error means failing to reject the null hypothesis when it’s actually false. Statistical power is the probability that a test will correctly reject a false null hypothesis; in other words, the probability of avoiding Type II error. The power level is usually set to 80%.

Suppose your model aims to increase the conversion rate. Like with anything, if you want to improve a number, you need to have some baseline; Baseline conversion rate. Let’s assume that the mean daily conversion rate for the past 6 months is 15%.

The last missing piece for computing the sample size is the Minimum Detectable Effect. It is “the smallest improvement you are willing to detect in an experiment to a certain degree of statistical significance”. In our case, we assume that the new model should have at least an 18% conversion rate for us to use it instead of the existing “champion” model. Thus, the desired conversation rate lift is 18% — 15% = 3% , which brings us to MDE = 20% by using the following function:

MDE = desired conversation rate lift/baseline conversation rate x 100%

Now, we can simply plug those numbers in tools such as Evan Miller’s Awesome A/B Tools, which can help us determine the sample size. According to the site, we are going to need 2,276 samples per variant.

You can always increase the number of samples but remember that an increase in the required sample size requires us to run the experiment for a more extended amount of time.

4. Perform a statistical analysis of the metrics gathered

The very last step of our A/B testing is to perform a statistical analysis of the metrics we have gathered while the test was running. This will determine if there is a statistically significant result. Read more about it on the following series.

Common Pitfalls

Engineering Bugs

In order to measure the business impact between your two groups you will most probably use SQL queries. It is very common to find bugs in these queries, which may lead you into thinking that the newer model yields reasonable results, but in reality they are completely wrong.

Sample size

Finding the right sample size for your experiment is a challenge on its own. A small sample size may lead to inconclusive or inaccurate results, while a very large size will waste time and resources.

Randomisation

A/B testing is successful because it is a randomised experiment. However, achieving true randomness is easier said than done. Imagine visiting a website on your mobile and then on your laptop. You may end up with a different experience on the two devices. The solution is to cluster the customers in order to ensure a consistent assignment. But again, this is not easy and it needs a lot of time and proper infrastructure to build such a system.

Early Stopping

Imagine stopping an ongoing experiment as soon as there is 5% significance. The resulting significance tests are simply invalid 😔 Every scientist should only run experiments where the sample size has been fixed in advance, and stick to that sample size with near-religious discipline.

Conclusion

A/B testing is a very powerful tool. It allows you to use the word because. If, at the end of the day, you observe a significant difference between the two groups, you can say that it was because of your treatment.