You Can't A/B Test Your Way to Greatness

A/B testing is an awesome tool, but has severe limitations

Jens-Fabian Goetzmann

10 May 2020 ‧ 8 min read

First things first: I love A/B testing. For those unfamiliar with the method, A/B testing involves changing a product (or packaging, messaging, pricing...) and then measuring how that changed version performs (in terms of some measurable KPIs) in comparison to the original version (the “control”). The way this test is done is by randomly assigning the population of users (or customers, visitors) to either the control or test group, and then measuring how these groups behave differently. Since the allocation is random, the two groups are very likely to look similar in terms of their characteristics (e.g., demographics) and therefore any measurable differences between the two groups must be caused by the change made to the product.

A/B testing is based on randomized control trials done in medicine to establish the effectiveness of a new treatment. Obviously, we have very high standards for medical trials. A/B testing has therefore been dubbed the “scientific method of product development”.

It is an inconvenient truth of digital product development that most of our ideas fail. Therefore, we have to validate that our ideas have the impact we were hoping, and catch those ideas that failed to live up to those hopes. A/B testing is absolutely a viable way to do that.

If the alternative is to just ship our ideas without validating that they have the expected impact, then I would always argue in favor of A/B testing. In fact, I believe that for a lot of changes, running an A/B test to understand the impact (positive and negative) as accurately as possible is in fact a great idea.

However, A/B testing has its limitations and risks. Most importantly: You can't A/B test your way to greatness.

Creative Destruction

Building something great and novel requires creative destruction, overthrowing some aspects of the status quo that are no longer needed or valid and establishing new fundamental assumptions. For example, the iPhone—possibly the greatest and most successful product of all times—required challenging the assumption that a smartphone would require a keyboard or a stylus.

This creative destruction requires a long term vision, an underlying hypothesis, a “theory of the case”. As a product team, you need to have a goal in mind for how your customers problems can be solved in a fundamentally better way than today, and then pursue that goal.

That goal doesn't exist in isolation, of course. It should be informed by foundational research about customer needs and insights into technology trends or the evolution of the market. However, a truly great and novel idea is likely going to be controversial. If the underlying hypothesis was obviously true, someone would have built the new product long ago. (In 2012, around Facebook's IPO, a business school professor told me how bad Facebook's financials were. Today, Facebook is obviously extremely profitable, but even when it IPO'ed you could reasonably doubt its business model.)

To achieve greatness, the team needs to doggedly pursue the vision that goes against the assumptions underlying the status quo. There will be setbacks along the way, ideas you try that don't work out, but you can't allow them to make you lose sight of the vision. When at first you don't succeed... You know how it goes.

To build something great, something that's fundamentally better than what's already out there, you don't start with something that's marginally better than the status quo. You start with something fundamentally different. Since you're off the beaten bath, likely your first version will suck. Challenging assumptions also means throwing out all of the optimization that has happened on top of those assumptions. If you A/B test the first version of a novel approach against an optimized status quo, it is extremely likely you will not make any progress.

The vision you have for a new way of doing things is your root hypothesis, and you should devise experiments to validate it piece by piece. Just building v1 and A/B testing it isn't going to cut it, though.

Clay Christensen's theory of disruption describes how fundamental shifts in the market often are triggered by a new technology that has decidedly lower performance, measured by traditional ways, than the incumbent technology. For example, look at how PCs disrupted minicomputers. PCs were much lower performance than minicomputers in all dimensions that were thought to matter—but their small size and relative affordability meant that they could address a market vastly larger than minicomputers. How are you going to A/B test a fundamental shift like that?

Great products require breaking with the status quo and challenging the underlying assumptions. Building on new assumptions requires a new vision and determination to pursue it, even in the face of setbacks. Of course, that doesn't mean that you shouldn't validate whether your vision is likely to succeed, but A/B testing is likely not the answer.

Incremental Thinking

A/B testing promotes an approach that is diametrically opposed to that creative destruction. It starts with the local maximum problem: if you A/B test a novel, unoptimized approach against an optimized control version, and the control version outperforms the novel one, there's no way of knowing whether this is simply because the new approach hasn't been optimized yet. You have climbed a hill (the local maximum) by optimizing the current experience, and when testing a drastically different experience against it, what you would really like to know is whether the potential of the new experience is higher than the hill you are currently on—but there is no way of knowing that from the results of an A/B test.

A/B testing thereby promotes an incremental mindset: finding a “winner”, an experience that “beats” control, becomes more important than making progress toward the vision. This incremental approach is great for optimization purposes, but terrible for vision-oriented, big picture thinking.

Overreliance on A/B testing also leads to short-term thinking. Since you want to be able to conclude A/B tests without waiting for months and years to measure lagging metrics (for example, long term customer retention), you have to use short-term, leading indicators as success criteria (for example, early customer engagement as a leading indicator for long-term retention). In the short term, however, these indicators can easily be “gamed”—not even intentionally or in bad faith, necessarily, but in a way that myopically focuses on metrics as opposed to customer and business value.

There is also the risk of falling into the trap of running tests that mostly prove “if we make a button bigger, more people will click it”. Now, when you express it that obviously, it's clear that this is a silly experiment (and also one that's true most of the time). Often, in reality, it's more subtle though and needs to be carefully managed.

As an example, at 8fit we wanted to increase early retention. We knew that people who did their first workout were far more likely to be retained—workout activation was a good leading indicator for retention. So we tried to increase workout activation by creating a new screen at the end of the onboarding process that tried to get people to either start their first workout or schedule it for a later time (with a push notification to remind them). Now, the straight forward way to test this would be to run an A/B test for workout activation. However, in that case, we would have fallen into the trap above. We made “the button bigger“—we made the entry point to starting the workout more visible, so it was very likely that more people would start the workout. Even measuring workout completion (i.e., someone didn't just start the workout but actually made it to the end) would be misleading—if more people started the workout, there's almost no chance that we wouldn't also see an increase in people finishing (even though that increase was lower). So instead, we resorted to the lagging indicator, measuring second week retention to see if there was a lasting impact. So while you can avoid the trap, it needs special attention.

All in all, relying on optimization techniques such as A/B testing exclusively means you'll risk falling into an optimization-only, incremental mindset—which is not going to help you make big bold steps toward a novel vision.

Speed of Insight

To build something great, you have to move fast. Fast in responding to emerging and evolving customer needs. Faster than the incumbents, even if they are big and rich in resources. Faster than other potential competitors. Speed and agility are the way in which a startup can beat even competitors with the deepest pockets.

For optimization purposes, A/B testing can often be a very fast way to generate insights. Tweak the copy or the layout of the page, run it out to A/B test for a couple of weeks, and you know exactly if the new version outperformed control.

As mentioned above, however, true innovation requires more than marginal optimization. It requires challenging fundamental assumptions and making more drastic changes.

For these more radical changes, A/B testing isn't so fast anymore. In these cases, the implementation time is often vastly longer, so you waste a lot more effort if the A/B test comes back negative. You will also often change more than one aspect of the product at a time in order to build up an experience that makes sense as a whole. This also makes it harder to make sense of A/B test results—in general, a “clean” A/B test varies only one aspect so that a clear causation can be established.

Another challenge that often comes into play in startups is that early on, there just isn't enough data. If you don't have so many customers, it will often take a long time to reach statistical significance in an A/B test.

In the cases of radical changes to the product, building very early prototypes and validating them with real customers is the much faster way to insight. If you rely on catching the bad ideas further down the line during an A/B test, you will have wasted a lot of time and won't be moving fast enough to build something truly great. The competitor doing discovery using early prototypes will have tested ten ideas in the time it took you to build out and A/B test one. With radically new approaches, both the risk and the reward tends to be much higher with each idea, so validating them as early as possible is paramount.

Refining Your Hypotheses

In general, A/B testing is not a great tool for refining your hypotheses. Again, it's a great tool for optimization, if your product's overall fundamental hypothesis has been validated and you want to get that flywheel spinning as fast as possible. However, if you are still trying to validate your direction, you need richer insights than “this idea didn't work”.

Qualitative validation methods provide you with these rich insights. They don't just tell you what happened, they also tell you why. They help you uncover what customer pain points you managed to address, which ones you didn't, and perhaps which new ones you inadvertently created. This is the most crucial input you can get in order to refine your product direction and iterate on your product offering.

A/B testing also requires quantifiable success criteria. In general, that's not such a bad thing—after all, Peter Drucker's famous quote “you can't improve what you can't measure” is timeless management advice. However, there are aspects of a product which will often have limited measurable short-term benefits but ignoring them will have negative consequences in the longer term. These include things like quality and user “delight”, but also privacy, safety, and accessibility.

Moreover, even quantifiable success criteria are much less well-established for a novel product than for a well-established one. Metrics are always only proxies for the value that is being generated for the customer. Relying exclusively on A/B testing means substituting these proxies for the actual customer value and product vision. If you aren't certain that the metrics are good proxies for customer value, then you might optimize for the wrong thing.

If A/B testing is not the best way to achieve greatness, what should you do instead? A few things: have a clear vision and strategy. Put a lot of emphasis on product discovery: talk to customers, understand their needs and pain points, define areas of opportunity, and only then start coming up with solutions. Validate your solutions qualitatively using prototypes before you fully build them out.

When you've done all of that and have a novel solution that was thoroughly qualitatively validated, can you still A/B test it to see how it performs? Of course. There are a few aspects that I would pay attention to in those cases:

Firstly, these tests are different from regular A/B tests in that you are making big changes without a clearly isolated hypothesis. It might be a good idea to give these tests a name other than “A/B tests”. At 8fit, we call them “general impact tests”. You could even just call them “phased roll-out”—after all, you already should have qualitative evidence that the changes are worth rolling out.
If this kind of A/B test fails to show improvements to product metrics, this is likely to be a local maximum situation. Since you have evidence of the new solution's value from qualitative testing, you can now start iterating on the new solution.
You should be mindful of the metrics you are using, and ask yourself whether the metrics are still good proxies for value. Also, beware of the trap of focusing only on short-term metric improvements. The qualitative feedback you have collected can be more reflective of long term effects than A/B test measurements.

I hope this article was helpful. If it was, feel free to follow me on Twitter where I share thoughts and articles on product management and leadership.