The Challenge of Data-Led Design

This is a condensed version of my talk for the Bulgaria Web Summit, 31st May 2014. I spoke without real notes so the following is simply the main points.

Introduction

How should a UX team respond to the results of randomised control trials of their designs? Such “A/B” and “multivariate” tests hold out the possibility of finding the best approach from a number of alternatives. But what does this mean for the practice of design itself? Hotels.com has had one of the most advanced experimentation programmes of any website in the world. As manager of the UX Design team there, it was my job to makes sense of quantitative data and apply that to an overall design approach. What did we find out along the way? And in a business infused with numbers, what problems did we face as designers, and how did we show that traditional design methods were still valid?

What Is “Quantitative Testing”?

It’s important to understand what I mean by “testing”, known formally as “randomised control trials”.

First, you need to gather the data. In an “A/B test”, a random selection of web traffic is shown a “test variant”, while the rest of the traffic sees the “control variant” – the current design you are testing against, the performance of which is already known. “Multivariate” tests are more sophisticated, and let you isolate combinations of variations within random samples, but the purpose is the same.

T-shirt - reject the null hypothesis

Get the T-shirt: you need to be confident it’s not just random

Next, you need to analyse the results by establishing three main things: the metric you are testing for (in our case it was usually conversion: people buying things); the amount of traffic participating in the test (the “sample size”), and the “confidence interval” you are willing to accept (typically 95%). The latter two factors must produce a “statistically significant” result to reject the “null hypothesis”. This means you have a result in which you are 95% certain that the effect you saw was not just down to random variation.

Once you have your results, you can start putting more traffic into the winning variant until that becomes the control for another test. And so it continues.

It’s worth noting that getting to a statistically significant result is surprisingly hard to do. Even an operation as large as hotels.com found it tricky to get enough data for some parts of the site. We often needed to run tests for weeks or months (and in one case two years!) to achieve that.

Tools such as Optimizely and Google Analytics A/B testing have made it easy to test sites in this way over the last few years. Results are sometimes published and discussed. Sadly however, when looking at the data behind these tests (if such data is provided at all) very often the analysis does not reject the null hypothesis. They are probably just observing random variation.

Hotels.com: Some Examples

At this point I could list a bunch of tests we did, together with their results and hope that would be entertaining. However, I strongly suspect that these tests are extremely context dependent. What might win on hotels.com might do nothing or worse on another site. I say this because when I’ve looked at test results performed by Google, Etsy and others who tested the same things as we did from time to time. Their results were often completely different from ours. Make of that what you will!

Instead, I will describe some general phenomena in testing that might be more useful if you find yourself in a similar situation as I did.

The first thing to note is that bugs are an Achilles’ heel. The more complex the design under test, the more bugs are likely to foul it up by affecting one or more variants – assuming you find them while running the test! It took us the best part of two years to test for the best booking form approach (3 and 1-step variants against a 2-step control).

Unexpected interconnected effects can also shake your faith in what’s happening. We tested a speed increase for our search results – spending a great deal of effort in optimisations and upgrades to get the site to respond in less than 2 seconds when it used to respond in about 4. After running the test on it, we found no observable effect on conversion. But we rolled it out to more customers none the less…

At the same time, we tested an increase in the number of results returned. Again, no observable effect between 15, 35 and 50 results. So we pushed the 50 results variant anyway (perhaps it might be better for SEO, we thought)…

Imagine our surprise when we realised that customers in the fast (2 second) variant and the 50 results variant converted more, and by a large amount more at that!

While this was initially a pleasant realisation, we later wondered if the opposite could happen. Could tests that had proved initially positive or flat on their own later prove negative in combination? A chill ran down our spines at that thought.

My final examples are chosen to illustrate the next part of my talk – how we came to a way of responding to the results of quantitative testing without compromising our UX design ideals.

“Urgency messaging” involves various forms of UI that essentially nag or bully the customer into buying by instilling a sense of urgency in them. “Only 2 rooms left!”, “4 people booked this hotel in the last hour!”, and so on.

We weren’t the first to do this – we saw it on our rivals’ sites and so decided to copy them. Sure enough, it produced a healthy uplift. But it didn’t sit well with us. How could such an annoying experience entice people to buy? We decided to cross-check our quantitative findings with some qualitative research. Putting some controlled “urgency” messages of various types into a couple of user research sessions showed us that people booking a hotel in the near future were better disposed to the messages. These customers with “short booking windows” reacted very differently to those who were planning to book in the further future, or were trying to decide between destinations. The former described the messages with words like “helpful” or “reassuring”, while the latter group with long booking windows hated them like we did.

This gave us some actionable data. If the analysts re-cut the original test results to find that booking window affected the conversion rate, perhaps we could get an even bigger uplift by reducing the frequency of the messages for those on long windows (so as not to annoy them into leaving the site without converting), while increasing them for those on short ones. It was a small insight, but a valuable one none the less.

The next test (which we later called “Texas pricing”) was initially a mistake. A developer altered the size of text used to display prices on the search results, making it much bigger than intended. It wasn’t a test at the time, but we noticed conversion improving after it happened. So after some investigation to determine the likely cause of the uplift, we ran the large pricing variant as a test. It gave us the biggest uplift of any single intervention we’d ever done.

Not surprisingly, this left us scratching our heads. Why could something like a text size increase produce such a startlingly lucrative increase in sales?

Urgency messaging on the left here, and "Texas pricing" on the right.

Urgency messaging on the left here, and “Texas pricing” on the far right.

Getting to “Why?”

After a few years, and doing many tests, we began to crave some explanation for the results we were seeing.

When I was looking for material for this talk, I discovered almost the perfect description of how we felt at the time. It’s a quotation from a Facebook employee, overheard in a cafe in Silicon Valley:

We’re blind. … Everything must be tested. It’s feature echolocation: we throw out an idea, the data comes back we look at the numbers. Whatever goes up, that’s what we do. … We don’t operate around innovation. We only optimize. We do what goes up.

I don’t know what that employee did to cure his angst, but we started digging. The first guiding concept we came up with was the idea of the “local maximum problem“.

The local maximum problem - like climbing a mountain in the fog.

The local maximum problem – like climbing a mountain in the fog.

Fundamentally, you either need a map, or an intuitive leap to find the real goals. So we kept thinking: given that we can test anything we want, and recognising that we all have opinions about what motivates customers, how do we find the other, higher peak?

Now, we were just designers, so I don’t want to over-dignify our next move, but it was hard to avoid the conclusion that in these circumstances, you use the scientific method.

lol

The Scientific Method (Internet meme version)

We therefore set out to progressively strengthen or weaken our hypotheses about what makes people book hotels. We had the power to test anything, and now we had control of that power to arrive at business-proprietary knowledge. No longer were we throwing out just any old idea, but ideas that got us nearer the truth.

So this is what we did. First we observed something interesting (like large prices doing good things for conversion), then we wondered why that might be. So we came up with an hypothesis for it (perhaps large text conveyed honesty in some way?). We would predict that if our hypothesis was correct then we could make something else bigger that we would normally hide away. How about making a buildings works notice, or a cancellation policy bigger? So we’d come up with a design for that, and put it to the test.

But that’s not all. We realised too that a negative result was just as valuable for our improved understanding (if not for the business’s bottom line!) as any other. In which case we could start to think about how we would evolve our design against our hypothesis should the design do better, worse, or remain unchanged against the control. It was like playing chess: thinking a move or two ahead. And the customer was our opponent!

Thinking through to the possible results. How would you evolve the design to respond?

Thinking through to the possible results and how to respond to them in each case.

We even created a database in which to store our hypotheses, and invited anyone in the company to add to it. We would then attach positive or negative findings to these hypotheses as we discovered them in either quantitative or qualitative testing. So in this way we would strengthen or weaken these assumptions over time. We had about 50 in all.

But It Ain’t Easy (in fact it’s pretty difficult)

However, constructing hypotheses was hard. I was also surprised by how many people were confused between predictions and hypotheses too. “My hypothesis is that blue buttons will do better”, or “The funky banner will drive more sales” are not hypotheses, they are predictions. Even “Infinite scroll is cool” or “Most people don’t like popup windows” are not good hypotheses because you can’t do much with them.

Predicting the results of a test in terms of the test itself. Fail.

Predicting the results of a test in terms of the test itself. Fail.

Instead, you need good hypothesis that can be tested multiple times in multiple ways. You need something that you can refine over time and that stimulates different approaches.

Some “fruitful” hypotheses might be:

You cannot show too much information.
People know where they want to go.
If they trust us, they will buy from us

    .

And there are still problems…

I wish I could say we solved all the problems we encountered. But despite our efforts, various issues relating to the practice of quantitative testing a large and complex webs site still remained.

Not surprisingly, short cutting for immediate results was the main problem. The practice of blindly copying what competitors were doing without attempting to understand what was going on. We called this a “cargo cult“.

Perhaps a more serious issue with the whole idea of randomised control trials for web sites was the lack of back-testing. Uplift from tests in monetary terms isn’t money actually in the bank. To determine that, you need to apply accounting measures that are probably impossible or at least extremely hard to do. I’ve written about this issue before – perhaps I’m the only one.

There was also the phenomenon of people not believing results. Being 95% certain still means a 1:20 chance of it being random after all. And to an extent, playing fast and loose with the maths, machismo, politics, all the familiar foes were there … People being people.

So add to all that the aforementioned issues of interconnected effects putting long (or even short) term effects of tests into doubt, and things were less than ideal.

So what does it all mean?

I’m in two minds. The designer in me wants to know why not just what, and rejects dry numbers. We had user research of course, but it’s slow, messy and lacks immediate results. Confronting A/B tests with scientific method felt like fighting fire with fire, and it was liberating: speaking your mind was OK because you were saying it in knowledge that it could be tested.

Even allowing for the problems I’ve revealed to you about quantitative testing, I can say it is that it is worth doing. But not for the reasons you think. You are not going to find the truth. You are not going to be right. That’s not what this is really about. The best you can hope for is to be less wrong.

And in being less wrong, you can have fun thinking creatively about the people you’re designing for and the statistics that represent them. You can become a better designer because of that.

— ooOoo —

Addendum

One observation about my experience at hotels.com that I didn’t mention in the talk due to time constraint was that in the first two or even three years of our testing programme, I only knew of a single test that had lost against the control. All our tests on new designs proved positive. If I’m honest, I’d say that was because the site was poorly designed in the early years, so improvements were easy! But was it simply luck, flawed analysis perhaps? Most likely it was granularity. By the time I left, testing very small increments was the norm while large changes were the rule in the early stages. For small tests, the majority can’t disprove the null hypothesis.