You've (hopefully) heard many things about A/B testing. People rave about it, and with good reason: It provides you with a solid framework on which you can evaluate how a change you've made has affected your metrics. Having science backgrounds, we're big believers in metrics here at historious, and we think that any business decision should be based, at least in part, on those metrics.
In case you're unfamiliar with A/B testing, here's a small rundown (it might get a bit technical, but please follow through):
You have two different versions of a page, one with a change and one with no change. You want to see if you should actually make that change, so you send half your visitors on the first page and half on the second. You monitor how many of the visitors on each page perform an action (say, sign up for your service), and then you have the signup rate for the first page (the changed one) and the rate for the second page (the unchanged one). Whichever page has the highest signup rate is the one you need to use.
How can you rule out randomness, however? If you show one person the first page and he signs up, and show one person the second page and he doesn't sign up, does this mean that the first page is better? Clearly not, since it might be pure luck. Therefore, we need a way to tell whether the difference in rate is statistically significant or not.
For this reason, we have a special formula that can tell us if the change actually is statistically significant. You enter the people who signed up on the first variation, how many visitors saw it in total, and the same for the second variation, then magic happens and you get a percentage that tells you how likely the change you have is actually statistically significant. If you get, for example, 90%, it means that you have a one in ten chance that the change was due to luck.
Since we wanted to minimise the chance of changes making our user experience worse, we decided to go for over 98%. This way, only once in fifty times were we supposed to get random results.
We made various changes this way, e.g. decide whether or not to include a presentation of the site's functionality or not (we decided against it, as it lowered the signup rate), the order of the calls to action, etc etc. After a month or so of running A/B tests, we decided to test the A/B testing software itself, to verify that everything was actually correct.
To do this, we had two versions of the page that were exactly the same. This should produce no difference in signups, and thus no confidence in the result.
We left this test running for three days, and then came back to see the results. A few thousand visitors had entered the test, and the results were clear: The "changed" page had improved signups by a whole 30%, with 99.8% confidence!
This was an astounding result! We had effectively made visitors 30% more likely to sign up by doing nothing!
It doesn't make any sense, however, so we started checking everything. Whether the second page actually had a change we hadn't noticed, whether the software biased visitors toward a page, whether there was a bug in the calculation code, and came up with nothing.
Everything was correct, yet, with 99.8% certainty, the variation was better than the original by a lot! Since there's no reasonable explanation for this, we decided to keep the test running and see if the trend continued.
Keep in mind that, with 99.8% confidence, we were well past our certainty threshold, and, were the two pages not the same, we would have definitely used the variation. Now, though, we didn't know what to think. Had our precious A/B testing methods been lying to us all along? Had we made some grave mistake? We searched high and low, asked on forums for a possible mistake we could have made, but nothing. Surely we hadn't fallen on the one-in-five-hundred case?
Another three days of running the tests, rates had reverted much closer to 50%, and confidence was a meagre 10%. This, at least, let us heave a sigh of relief, secure in the knowledge that we hadn't just disproved the whole of statistics and mathematics.
How many decisions, though, did we make that were wrong? If 99.8% confidence wasn't enough, what about the changes we made based on 98% confidence?
The important lesson here, and the one you should take away from our experience, is this: Whenever you think you have enough data for the A/B test, get more! Sometimes, you will fall into that 0.1%, and your decision will be wrong, and might impact your metrics adversely, and you might never find out.
Generally, the less significant the change, the more data you are likely to need in order to gain a good degree of confidence. For small changes, you might get a confidence of over 95% at some point, but it would be wise to gather some more data, as that's only a one in twenty chance that the result is not statistically significant.
The good news is that, if the tests are independent of one another (i.e. the probability of a user being in test A is independent of the probability of the user being in test B), you can run multiple tests on the same page and still get good data. Of course, you might have biases in the actual test, e.g. the two tests might work very well together but very badly each on its own, which would skew the data. Generally, though, you should be able to run two tests at the same time. This is especially useful when you're starting out and don't have thousands of visitors each day, providing you with lots of data.
We hope we have given you more insight into what constitutes a good A/B test, so you can improve your process and not repeat our mistakes. Thanks for reading, and have fun!