Sep 07, 2011

A/B Testing: It's Not All Bad

Rachel Elkington's picture
Rachel Elkington
User Experience Architect

A colleague of mine came across a great article on A/B testing and shared it on our internal blog. A/B testing sometimes gets a bad rap in the design community because it is misunderstood. When used correctly, it’s an unstoppable form of awesomeness. When applied to the wrong problem, or wielded clumsily by people who don’t understand the difference between correlation and causation, it is destructive.

Here are my thoughts on each of the points from the article:

1. A/B testing may only be as effective as the designs being tested, which may or may not be high quality solutions. Users are not always the best judge of high quality design. That’s why you hire expert designers of seasoned skills, experience, judgment, and yes the conviction to make a call as to what’s better overall.

To each problem its method, I say. Testing is an evaluative method, not a generative one. Meaning, it will not generate the next disruptive design idea on its own. It’s not meant to judge high-quality design. It is meant to show which of the possible alternatives for a particular place in a digital experience will lead to an outcome that is desired by the business.

2. As is true with any usability test, you gotta question the motives behind the participants’ answers/reactions. Instead, biz/tech folks look at A/B test results as “the truth” rather than a data point to be debated. Healthy skepticism is always warranted in any testing. Uncovering the rationale for a metric is vital.

Absolutely. Quantitative testing tells you what users will do, and what they prefer—in statistically significant ways. It can also uncover statically significant differences in behavioral segments—the preferences of different user groups. That’s a powerful tool.

It’s attractive to business types because money value can be put on test results, and further features or initiatives can be prioritized in an absolute framework because of that count-ability. It’s not a panacea, as we know because testing will not tell you WHY users do what they do. Combining both qual and quant—and knowing when to use each—is a sophisticated way to conduct testing and research. It’s the space I want Hot to be in.

3. A/B testing is typically used for tightly focused comparisons of granular elements of an interface, resulting in poor pastiches with results drawn from different tests.

That’s what I call the "ransom note" effect. You end up with sites performing somewhat better than at pre-testing levels that look ugly and cut and pasted from different styles and approaches. It’s a terrible trend caused by testers who don’t know how to set up the theoretical foundations of a test properly. Hypothesis-driven testing makes sure that doesn’t happen—testing according to the scientific method makes sure you don’t conflate correlation with causation, and can therefore understand which pieces of the site are causing what outcomes.

4. How do you A/B test novel interaction models, conceptual paradigms, visual styles (by the way, visuals & interactions have a two-way rapport, they inform each other, can’t separate them–see Mike Kruzeniski’s talks) which may vary wildly from before? Would you A/B test the Wii or Dyson or Prius or iPhone? Against what???

This is a complicated question. Sometimes what we see in testing, is that a new version will outperform an old one just because it is new, and will eventually level out in performance. There is a lot to consider when attributing results of even the most well constructed test.

A/B testing is not the proper method to test a Prius or an iPhone, or a conceptual paradigm. Just as stakeholder interviews are not the best way to see how a shopping cart could perform better. To each problem it’s method.

5. A/B testing locks you into just two comparative options, an exclusively binary (and thus limited) way of thinking. What about C or D or Z or some other alternatives? What if there are elements of A & B that could blend together to form another option? Avenues for generative design options are shut down by looking at only A and only B.

That’s a good case for multivariate testing, which is a standard method of testing in optimization practice. Most good optimization tools will have it, and it monitors the combinatorial effects of many different permutations of elements using a rad algorithm called the Taguchi method. Worth mentioning is that you wouldn’t want to use this in a generative process. It’s always and only evaluative.

6. Finally A/B testing can undermine a strong, unified, cohesive design vision by just “picking what the user says." A designer (and team) should have an opinion at the table and be willing to defend it, not simply cave into a simplistic math test for interfaces.

Indeed! If we cared only about conversion, or another measurable metric, many sites would be hideous, covered with CTA’s with dancing babies, or flashing "buy now" buttons. A team must decide what decisions are open to testing within a framework of larger goals about the product or service. That said, I’ve run tests that show the most high-performing variant is one that wantonly breaks the style guide. We should at least know what the style guide is costing, and then decide whether and how to stick by it.

See what 37signals has to say about their experience with A/B testing.

Because you're a Hottie, please log in before commenting:

Post new comment