Google’s £200M A/B Test Cobblers

The story – now passed into minor Internet legend – of Marissa Mayer’s testing of 41 shades of blue in 2009 (and the resignation of Google’s Visual Design Lead, partially because of this) has been referred to again this week. The Guardian reports that Google UK’s managing director Dan Cobley says that the winning shade of blue made Google £200 million.

The story touches on several interesting topics. The first is the idea that Google, and potentially other large sites like Facebook and Amazon, can harness large amounts of traffic to conduct randomised control trials on UX-related elements of their sites. This in turn implies they might arrive at the best (for which read “most profitable”) design without the aid of traditional designers. The second issue is about how to respond to the results of such testing, and the third is whether such testing is worth doing at all. The latter two topics are in my view very important yet hardly considered in the design world. And I have a lot to say about those things, should anyone ask.

Be that as it may, I think it’s worth pointing out that the £200M assertion is completely unprovable, and may well be bogus.

Revenue uplift from A/B tests are given as theoretical amounts based on the results of the test, not the amount you actually get in the bank. So if your winning variant achieved a 10 basis point (0.01%) increase from the control, and your monthly gross revenue until then was £1 million, then you have “made” an extra £100 a month compared to not introducing the change. However, isolating this value (much less back-testing it against your actual revenues) is impossible in the case of something as tiny as a variation in a link colour. This is because you cannot account for the almost infinite factors that are changing from the time you completed the test. For all Google knows, the winning shade of blue might have started losing after they introduced another change to the page the day after. Or perhaps users have since become used to the new shade, and the old shade might now do better. The list of possible known and unknown variations is huge, to say nothing of the implications for the further testing of other elements.

This is also why you will find no use of A/B test wins in revenue forecasting. The best you can say about positive test results is that they are likely to have contributed to rising revenue, assuming that revenues rose overall. Quite what that means if overall revenues fall in the face to successful testing is anyone’s guess – until you apply some UX thinking at least.

Add to this my completely unfounded, but really quite confident suspicion that most online A/B tests are fatally flawed, and I think you can say that Mr Cobley is talking cobblers.