The A/B Testing Trap

“We A/B test everything”? You shouldn't.

Jens-Fabian Goetzmann

26 May 2019 ‧ 4 min read

“We A/B test everything” is a trap that product teams fall into—it means you have stopped innovating and are only optimizing. Innovation is hard to A/B test: You often need to start with foundational research of the problem instead of validating solutions, and even when validating solutions, previous KPIs may no longer apply or you may need many iterations until an innovative solution can beat the control experience in an A/B test.

Product teams of low maturity are often focusing mostly on delivery: they prioritize and groom the backlog, ensure engineers are always unblocked and working on the highest priority items, and they are obsessed with velocity.

As the product team matures, they start to understand that they need to focus on outcomes, not outputs. They realize that “you can't improve what you can't measure” (Peter Drucker), so they define KPIs that proxy user and business value and start tracking them.

They recognize that most improvement ideas fail, so they start validating that their ideas improve the KPIs. They start A/B testing—the gold standard for validating KPI impact of features, the scientific method of product development. At some point, it happens: the team starts proudly proclaiming ”we launch every feature as an A/B test”. They have fallen into the A/B testing trap.

Why is A/B testing everything a trap? The reason is that while A/B testing as a validation technique works great for optimization purposes, it has big challenges with innovation. Optimization means removing user friction and incrementally improving the value delivered to users or the value extracted for the business. Innovation is about making bigger steps, creating value differently for the user.

A/B testing works so well for optimizing your product because you have already validated that there is a problem, and you generally have a quite good understanding of the problem, how it can be solved, and how it can be measured. With true innovation, you are taking are greater leap: You are either solving an existing problem in a new and different way, or you are solving a new problem that your product isn't solving yet.

In the first case of solving an existing problem differently, it is possible that the KPIs that were established for proxying the value delivered by the previous solution are no longer accurate. In that case, A/B testing the new solution will not yield meaningful results. Even if the KPIs still apply, you still risk running into the local maximum problem. The local maximum problem means that if you are A/B testing an innovative solution against a control experience that is somewhat optimized in terms of your KPIs, it is possible that the innovative solution fails the A/B test despite having higher potential: It is simply not yet optimized. You get stuck in a local maximum with respect to your KPIs. The problem is best understood through the following illustration:

The local maximum problem is difficult to solve through A/B testing alone. You could keep iterating on the innovative solution to understand if it is possible to creep further up the “hill“ until the innovative solution beats the control experience in an A/B test. However, it is impossible to understand from A/B testing alone whether the potential of the innovative solution is higher than that of control: You might just have landed on a lower hill. Qualitative validation methods can help better understand the potential of a solution and the impediments to realizing the value fully. Moreover, it is faster than building and shipping many iterations to qualitatively validate the potential of the solution, since it can be done using prototypes of increasing fidelity.

Another difference between optimization and innovation is that optimization is often more locally constrained and therefore easier to measure in a single KPI. Innovation attempts will often impact multiple KPIs, often in different directions. It is not possible to run an A/B test with multiple KPIs as success criteria, since trade-offs between the KPIs are not clear. The only way to test for multiple criteria at the same time is by synthetically combining them into a single KPI, but that means making trade-off decisions before understanding the impact of the feature, and also means the success criteria are not intuitively understandable anymore.

In the case of solving a new and different user/customer problem, A/B testing is even further from what you should be focusing on. You need foundational qualitative research to ascertain that there is actually a problem worth solving and to then understand it as deeply as possible. Even after a solution concept has been developed, the KPIs to proxy the value delivered to users/customers for this new problem will be completely different (and will have to be established first through analysis of longitudinal usage data, which won't be available until after the product has been shipped and used for a while). Therefore, validating that the solution indeed solves the problem and delivers value should also be done qualitatively.

Even after understanding that innovative solutions require qualitative research (both foundational and to validate solutions), product teams often still insist on rolling out all features as an A/B test to understand the impact that a solution has on user/customer and business outcomes. This is of course an approach that makes sense, but it comes with risks as well that need to be understood and managed.

Firstly, it can mean that A/B testing becomes the “default” validation method: “If we're going to validate this in an A/B test anyway, why bother validating qualitatively beforehand?” Instead, the correct method of validating needs to be independently determined for each and every improvement effort. Choosing the correct validation method should be a step in the product development process that each improvement effort goes through.

Secondly, A/B testing even innovative solutions can have a chilling effect on innovative improvement efforts. No matter how open-minded the product team is, it never “feels good” if an A/B test yields worse results for the text experience vs. control, and it can feel like “cheating” to ship an innovative solution that has “lost” the A/B test (even after having been successfully qualitatively validated). If this feeling arises in the product team, even subconsciously, it can mean that people start preferring optimization over innovation—after all, the former is much more likely to “win” the A/B test.

Both of these risks can of course be overcome, but only if everyone in the product team is aware of them and actively counteracts them.

In conclusion, while product teams may choose to A/B test all improvement efforts, they need to be aware that innovative solutions will need qualitative research both to understand the problem and to validate the solution. For innovative solutions, A/B testing is often best used only supplementarily, and even that comes with risks that need to be carefully managed.

If you found this article interesting, feel free to follow me on Twitter where I share articles and thoughts on product management.