A/B Testing Quick Guide

The purpose of any experiment is to verify the validity of hypotheses, which, if correct, lead to improvements in various metrics. Put simply, every experiment should answer specific questions:

  • Is the brilliant idea (hypothesis) correct?
  • How does the target metric (what we are trying to influence) change?
  • Are the changes in the target metric sufficient to implement the results in practice?

If your client does not want answers to the above questions (even to one of them), then you have a problem, and it's necessary to clarify why the experiment is needed.

Step 1: Cost-Profit Ratio Evaluation

Experiments provide the opportunity to make decisions based on data, which is often superior to "personal experience." A "data-driven" approach is undoubtedly valuable, but it's essential to understand that the final results of an experiment may not justify the resources invested in its execution.

Therefore, the first stage of any experiment is a preliminary assessment of the potential effectiveness of the changes that will be implemented if the hypothesis is confirmed. It's crucial to grasp how much the target metric should change to recoup the resources invested.

Here's a simple example:

The outcome of the first step in setting up an experiment should be an answer to the question: "How much should our target metric change to justify the expenses incurred in conducting the experiment and implementing changes in the product (if the hypothesis is confirmed)?"

The ideal format for the answer to this question is a number, such as leads (which have the potential to justify the experiment), orders, etc.

N.B. When it's not possible to assess the benefit of increasing the target metric, it's necessary to change the target metric.

Step 2: Target Metric

Before planning an experiment, it's essential to clarify with the client (the author of the testing idea) what the experiment should influence. The answer may surprise you or not provide a clear understanding of what should change during the experiment, or there might be too many target metrics.

In such cases, it's necessary to independently define the target metric/metrics. Remember, the fewer metrics, the better for everyone. The client will receive a clearer and quicker experiment result. Your workload will decrease, and the chance of errors will diminish. Ideally, there should be only one metric left.

How to determine if the designated metric is suitable for conducting the experiment? It's straightforward:

  • Measurability: The metric should be quantitatively measurable.
  • Trackability: You should be able to track the metric during, after, and preferably before the experiment.
  • Logic: Does the metric unquestionably change when the experiment is conducted? Does it make sense? (Changing the color of the corporate building is unlikely to affect the conversion of Google advertising.)

If the metric does not fit at least one of these principles, you should consider changing the metric.

Example of a "good" metric:

  • Number of purchases.
  • Average transaction value.

Example of a "bad" metric:

  • Customer happiness.
  • Engagement.
  • Interests.

Step 3: Audiences

Now that you have your target metrics and have calculated the minimum detectable effect (the effect from the experiment that justifies its conduct), it's time to form the experiment's audiences. You need to find answers to the following questions:

  • What audience size do you need for each of the experiment variants?
  • How long should the experiment run (which is derived from the previous question)?
  • How should you correctly divide the audiences into testing groups?

Let's go through these step by step. For those who want to delve into the details of formulas and their significance, welcome. The description below will provide a surface-level understanding necessary for conducting standard A/B tests.

How to Determine the Required Audience Size:

To answer this question, you need to understand the following:

Baseline Conversion Rate - the current conversion rate. For example, you've gathered an incredibly magnificent audience and want to verify that using it will increase the average transaction value and the number of purchases.

In this case: Baseline Conversion Rate = the number of users who click on your advertising links / the number of users who make a purchase.

Minimum Detectable Effect - the minimum effect. Go back to Step 1 and see what effect the experiment needs to bring into your life to justify it.

Example calculation: To justify the experiment, you need to attract an additional 100 purchases over one month.

In this case: Minimum Detectable Effect = (the number of users who click on your advertising links / the number of users who make a purchase + 100) - (the number of users who click on your advertising links / the number of users who make a purchase).

In simpler terms, the minimum detectable effect is how much the target metric should change to justify the experiment and the implementation of changes at the end of its execution.

It's also highly desirable to understand the concepts of significance and power. Within this guide, we'll grasp the general logic without delving into the details.

Significance - the percentage of cases where the difference between two samples will be registered (assuming there isn't actually one).

Power - the percentage of cases where the minimum effect will be registered (assuming it actually exists).

These concepts are related to Type I and Type II statistical errors. For those who want to dive into the details, you can explore further.

A simple example: a Type I error is saying that a hypothesis is true (when it's false), and a Type II error is saying that a hypothesis is false (when it's true).

As you can see from the above figure, any experiment risks falling victim to a Type I or Type II error. This is the reality of experimentation, and the question is how likely these errors are. By default, we take a significance level of 95% (more accurately, it's 5%, but some calculators use 95% = 100% - 5%), and a power level of 80%.

Let's move on to the tools for calculating sample size.

  • Evan Miller's website with the embedded calculator. According to Evan, he's the most understandable and clear guy when it comes to working with data and statistics.

N.B. There are numerous calculators on the internet, and the author of the article does not take responsibility for the accuracy of handpicked calculators. It's important to understand that the numbers obtained using calculators may slightly differ under similar settings. This is normal and is related to the calculation method and rounding values in the formula. So, don't overthink it; just go with the larger sample size.

Conclusion: By entering all the parameters into one of the calculators mentioned above, an experimenter will get an idea of the required sample size. Note that these calculators provide the audience size for one testing segment. Therefore, for an A/B test with two segments, A and B, the total audience size is the audience size for one segment multiplied by 2.

How to Correctly Divide Audiences for Testing:

To do this, you need to understand that the compared audiences (segments A and B):

  • Are structurally as similar as possible and only differ in the testing factor.
  • Users in variant A should not see content intended for users in variant B, and vice versa.

The testing factor is the difference between the test audience (variant B) and the default audience (variant A). For example, for web tests, the factor could be a changed form or button color, registration form fields, and so on. For marketing campaigns, the factor might be a set of unique audience characteristics. A simple example is users who made more than N purchases within a selected time period. In this case, the factor is the number of purchases within the chosen time frame.

Tools for testing websites often have built-in mechanisms for dividing users into equivalent groups based on location, demographics, traffic sources, etc. An example of such a tool is Google Optimize. This means that the question of correctly dividing the audience for A/B tests on websites is handled by the tool you're using.

However, this is not always the case. In situations where you need to divide audiences for testing manually (for instance, if you're manually uploading audiences into a CDP), you won't have control over details like a user's gender or location. Nevertheless, you must follow these rules:

  • The sample size for each variant must be the same. Sample size A = sample size B.
  • Variants A and B must be represented in equal proportions for each advertising platform. Since advertising platforms have many differences, variant A and B should be tested for each platform separately. What works for Facebook may not work for Google Ads and so on.
  • Variant A should not see advertisements for variant B, and vice versa.
  • Ideally, in a perfect world, users from variant A and B should not be subjected to outside influences.

Step 4: A/A Test

The best way to verify a new testing tool or ensure that you've collected and divided your testing audiences correctly is to conduct an A/A test.

In an A/A test, you divide your audience into groups that do not differ in the testing factor. For websites, this means a test where both user groups see the same button. For marketing campaigns, it's merely a common pool of users who form the sample (for example, all users exposed to retargeting ads in the Facebook Ads dashboard).

What does such a test provide?

You gain the opportunity to evaluate if there is any statistically significant difference between the testing variants, which, according to the experimenter, should be equivalent.

If there isn't a statistically significant difference, it means the experiment's structure is likely correct, probably.

If there is a statistically significant difference between two "identical" variants, it means the experimenter made a mistake and caught the olive, and you need to go back to Step 3 and recheck everything.

P.S. Statistical significance will be explained later; for now, it's essential to accept that it might exist.

Step 5: Waiting

You've launched the test, and the best thing you can do is to leave it alone, refrain from trying to obtain intermediate results, and simply wait.

The only thing you can monitor is the number of conversions (desired actions) to understand how much time is needed to reach the sample size calculated in Step 3 for the test.

Step 6: Evaluating Results

To evaluate the results, you need to use the calculators familiar from Step 3:

  • Evan Miller's website (calculator). Pay attention to the Chi-Square test section; it's suitable for comparing absolute values. For example, if variant A had 20 conversions out of 100 ad clicks, and variant B had 30 conversions out of 100 ad clicks, let's compare them. The 2-sample T-test is often used to compare means.
Enter the data into the calculator and get the result. For those who want to understand how the result is obtained and what all of this is about, you can explore further.

Once you have the result, congratulations – you're doing great!

Made on