A/B Testing: 7 Common Questions and Answers in Data Science Interviews, Part 2

In this second article in this series, we’ll continue to take an interview-driven approach by linking some of the most commonly asked interview questions to different components of A/B testing, including selecting ideas for testing, designing A/B tests, evaluating test results, and making ship or no ship decisions.



 
Note: This is the second part of this article. You can read the first part here.

 

Analyzing Tests Results



Photo by Scott Graham on Unsplash

 

Novelty and Primacy Effects

 
When there’s a change in the product, people react to it differently. Some are used to the way a product works and are reluctant to change. This is called the primacy effect or change aversion. Others may welcome changes, and a new feature attracts them to use the product more. This is called the novelty effect. However, both effects will not last long as people’s behavior will stabilize after a certain amount of time. If an A/B test has a larger or smaller initial effect, it’s probably due to novel or primacy effects. This is a common problem in practice, and many interview questions are about this topic. A sample interview question is:


We ran an A/B test on a new feature and the test won, so we launched the change to all users. However, after launching the feature for a week, we found that the treatment effect quickly declined. What is happening?


The answer is the novelty effect. Over time, as the novelty wears off, repeat usage will be decreased so we observe a declining treatment effect.

Now you understand both novelty and primacy effects, how do we address the potential issues? This is a typical follow-up question during interviews.

One way to deal with such effects is to completely rule out the possibility of those effects. We could run tests only on first-time users because the novelty effect and primacy effect obviously doesn’t affect such users. If we already have a test running and we want to analyze if there’s a novelty or primacy effect, we could 1) compare new users’ results in the control group to those in the treatment group to evaluate novelty effect 2) compare first-time users’ results with existing users’ results in the treatment group to get an actual estimate of the impact of the novelty or primacy effect.

 

Multiple testing problem

 
In the simplest form of an A/B test, there are two variants: Control (A) and treatment (B). Sometimes, we run a test with multiple variants to see which one is the best amongst all the features. It can happen when we want to test multiple colors of a button or test different home pages. Then we’ll have more than one treatment group. In this case, we should not simply use the same significance level of 0.05 to decide whether the test is significant because we are dealing with more than 2 variants, and the probability of false discoveries increases. For example, if we have 3 treatment groups to compare with the control group, what is the chance of observing at least 1 false positive (assume our significance level is 0.05)?

We could get the probability that there is no false positives (assuming the groups are independent),

 
Pr(FP = 0) = 0.95 * 0.95 * 0.95 = 0.857
 

then obtain the probability that there’s at least 1 false positive

 
Pr(FP >= 1) = 1 — Pr(FP = 0) = 0.143
 

With only 3 treatment groups (4 variants), the probability of a false positive (or Type I error) is over 14%. This is called the “multiple testing” problem. A sample interview question is


We are running a test with 10 variants, trying different versions of our landing page. One treatment wins and the p-value is less than .05. Would you make the change?


The answer is no because of the multiple testing problem. There are several ways to approach it. One commonly used method is Bonferroni correction. It divides the significance level 0.05 by the number of tests. For the interview question, since we are measuring 10 tests, then the significance level for the test should be 0.05 divided by 10 which is 0.005. Basically, we only claim a test if significant if it shows a p-value of less than 0.005. The drawback of Bonferroni correction is that it tends to be too conservative.

Another method is to control the false discovery rate (FDR):

 
FDR = E[# of false positive / # of rejections]
 

It measures out of all of the rejections of the null hypothesis, that is, all the metrics that you declare to have a statistically significant difference. How many of them had a real difference as opposed to how many were false positives. This only makes sense if you have a huge number of metrics, say hundreds. Suppose we have 200 metrics and cap FDR at 0.05. This means we’re okay with seeing false positives 5 of the time. We will observe at least 10 false positive in those 200 metrics every time.

 

Making Decisions



Photo by You X Ventures on Unsplash

 

Ideally, we see practically significant treatment results, and we could consider launching the feature to all users. But sometimes, we see contradicting results, such as one metric goes up while another one goes down, so we need to make a win-lost tradeoff. A sample interview question is:


After running a test, you see the desired metric, such as the click-through rate is going up while the number of impressions is decreasing. How would you make a decision?


In reality, it can be very involved to make product launch decisions because various factors are taken into consideration, such as the complexity of implementation, project management effort, customer support cost, maintenance cost, opportunity cost, etc.

During interviews, we could provide a simplified version of the solution, focusing on the current objective of the experiment. Is it to maximize engagement, retention, revenue, or something else? Also, we want to quantify the negative impact, i.e. the negative shift in a non-goal metric, to help us make the decision. For instance, if revenue is the goal, we could choose it over maximizing engagement assuming the negative impact is acceptable.

 

Resources

 
Lastly, I’d like to recommend two resources for you to learn more about A/B testing.

 
Bio: Emma Ding is a Data Scientist & Software Engineer at Airbnb.

Original. Reposted with permission.

Related: