A Funny Thing About A/B Tests

I was having a look today at this question posted on Quora: “What are the most unexpected things people have learned from A/B tests?“. The writer clearly expects answers on specific tests, but a couple of people have referred to the surprising behaviour of people who run or react to the tests themselves.

I think it is notable that people conducting A/B or MVT very often don’t seem to understand what to do with the results they get. Results are often used inappropriately, or otherwise used as excuses to play fast and loose with the facts.

Take this presentation by Google’s Marissa Mayer. Google observed that the more search results they provided on the page, the higher the subsequent latency. So they concluded this meant that for higher numbers of results, people were less willing to wait for results to appear, and hence fewer searches were performed.

To be fair, this is a presentation given by a senior Google person who probably had nothing to do with the conduct of the actual test. It is also a plug for Google’s A/B testing platform. However, anyone who has any understanding of research methods or scientific inquiry would not be satisfied by the conclusion she cites at all without corroboration of the latency finding. That is, you would need to test the same number of results with several different latency times to test whether latency causes lack of searching. Even then, you’d need to design a third test to make sure that the higher numbers of results were in fact being affected by the latency issue before you could justify spending bazillions on lowering that latency.

Note also that Mayer does not go on to address the obvious question her conclusion begs: what is the latency time associated with the most searches? How do they they know the control variant isn’t itself optimal in that regard? This is of course the “local maximum” problem – un-addressed as usual – and pretty much a dereliction of duty if you’re conducting test-driven design.

I’m not saying Google didn’t consider these things. However, I am often very surprised by people’s willingness to accept unfounded conclusions from transparently flawed research. I’ve written about this in the past, but it really doesn’t seem to get any better. Statistical insignificance; confounding factors; a reluctance to corroborate; a general willingness to accord far too much respect to a single finding, and a lack of imagination about what further tests to do. All these and more methodological failure are rampant in the UX design world.

If you have an interest in A/B and MVT, don’t you owe it to yourself to have an interest in defensibly interpreting the results?