The CRM specific challenges of A/B Testing & Statistical Significance
I’m writing this as I’ve just put the finishing touches to a campaign report for one of our clients. We ran a combined direct mail and email promotion to a series of lapsed customers.
To understand whether any uplift we can see is incremental, and wouldn’t have happened naturally we kept a random control group back from the mailing and emails. This way we can be sure that any difference in purchase behaviour between our test and control groups is down to the campaign.
Well, it’s not always quite that simple.
Statistical Significance – A Challenge for many CRM-ers
When measuring an A/B test like this you need to look for statistical significance in the results. Statistical significance tells you the likelihood that if you repeated the test you would see the same results.
The inputs for this calculation are your sample size for each group, and the respective conversion rate of each. For a classic subject line test this would be the amount delivered in each group, and the number of opens respectively.
The maths behind it aren’t something most of us want to do by hand but luckily there are numerous calculators around such as this one from Survey Monkey:
What you are looking for is a 95% confidence rate. This means if you repeated the test you would expect the same result 95% of the time. 95% is the gold standard most people will apply for a scientific or online marketing test.
Going back to the 2 inputs into the calculation you are more likely to achieve 95% confidence if:
- You have a large sample size in both test and control groups
- There is a big uplift in performance
This is where email marketers find it more challenging than say those running A/B tests on the website with lots of traffic.
Firstly the sample size might be restrictive if either you have a refined segment you are targeting (or simply don’t have huge volumes to start with). If you are running something such as a subject line list to a small portion of your list before rolling out to the rest, chances are you will struggle to achieve statistical significance without taking up so much of the database the benefits are wasted.
Some of this is because the difference in results won’t be that great on such a test. Maybe you might boost opens from 15% to 16% between subject lines. This 6.67% increase is nothing to be sniffed at but does mean you need at least 15,000 delivered emails for each version to be sure of a 95% confidence level.
A website conversion manager will simply run a test until they have seen enough traffic, but email marketers don’t have that luxury as they have a finite pool of those they can contact within a short window of time.
And the second issue – most email marketing and CRM activities you test won’t deliver huge wins in performance, but small incremental uplifts unless the activity you are running is truly transformational.
As a result many of the A/B tests I observe don’t reach the gold standard of 95% confidence level. The result is either the confidence level is ignored – dangerous as it can lead to incorrect assumptions about a winner (usually which one the email marketer simply prefers or whose idea it was). Or, for those more data driven organisations any results are dismissed as not an improvement simply because 95% confidence wasn’t achieved.
Why not achieving 95% confidence in results doesn’t mean it’s not a winning test
Dismissing a test outright just because you don’t have enough of a sample size and uplift to achieve 95% confidence is mis-guided. All it means is that the maths can’t be sure that your 10% uplift generated is guaranteed if the test was run again.
You can look at lower levels of confidence like 90%, or even 80% at a push. But remember 50% confidence means the confidence of it being a winning test is the same as if you guess heads on a coin toss.
There are other things you can do to help with understanding the reliability of your A/B test.
Break results down by segment
One way it’s good to get a feel for how real the results are is to break them down by segment. Segment might be recency tiers on how long it is since the recipient signed up or purchased, or perhaps number of orders.
If you see consistent patterns in the test results across these segments then you can have greater confidence in the test likely to be a reliable winner. If you see less consistency and some groups not following the test results pattern then it definitely increases the doubt on the test results reliability. A test won’t be reliable if its simply a result of randomness, and randomness doesn’t follow consistent patterns.
But you might find additional insight with this approach, and it’s especially worthwhile if you just feel the results you have don’t feel right.
One example I recall of this being especially useful was a frequency test we ran for a client several years ago. They wanted to test reducing the frequency of send as there were a lot of customers complaining about the volume of messaging. The initial test showed reducing frequency had a slight negative effect overall. But then we broke the results down by age of customer – those new to the brand who had only registered during the test actually increased performance with less email, and at a 95% confidence level.
Our hypothesis was those who had already endured the high frequency of send were used to it and switched off. I’ll be honest though I doubt you’ll see that result often – reducing volume usually reduces overall revenue from email.
Another example was an online marketplace which wanted to test whether their daily email strategy delivered any incrementality at all. The top line results indicated not, but as soon as you split the segments out you could clearly see a large segment of their base skewing the results. By stripping them out it was clear at 95% confidence rate that the email programme did deliver incremental benefits and they could cut down the overall email sends by several million per month by excluding the inactives.
If you are measuring something like sales or conversions from an email send then immediately you are naturally going to have very few conversion events to feed into your statistical significance calculation, and find it difficult if not impossible to achieve 95% confidence.
Instead you might want to measure micro actions instead that are further up the funnel to conversion. Typically this will be opens and clicks. These will be far more numerous so you’ll need less volume for your test.
However, treat this approach with extreme caution – a higher open rate won’t necessarily mean a higher conversion rate. One of the key reasons for this is user intent. Let’s say you have a subject line promoting a new product line that mentions a special promotion, against a control test which talks more about the benefits of the new product line.
You might get more people to open the subject line with the incentive in, but these are less motivated by the product benefits so less likely to buy than those who need the benefits of the product so see less clicks and sales overall.
Another example I remember is from a template test where the client tested a cut down template without any generic links in the header through to the site, just a call to action to the product featured in the email.
The old template had nearly twice the clicks – but analysing where the clicks occurred over half were on the generic template links in the header. The new template actually generated more revenue with less clicks as those clicks were to a better converting funnel, and just from those interested in the product.
Repeat the test
In the conversion optimisation world there can be a lot of talk about repeating a test to see if you get the same results. While this is possible for email marketing there is one subtle difference which makes it less reliable.
Generally speaking, an email test that is repeated will go to those that have already been exposed to the test, and this brings in an additional element, one of freshness. What worked last time might well have simply been a result of it being different to what they normally see and that stimulated them. Each time they see the same test this freshness is reduced and results will close up, or even switch round.
It’s why I’m not a huge fan of subject line testing on everyday emails as you can’t repeat the result. It’s also why you need to ask yourself WHY did something new beat the existing version – is it simply because it was new?
Final thoughts on test results
There is clearly an importance to ‘Look at what the data says’ as so many preach within marketing, but then don’t really understand what they are looking at with the data.
A more indepth look at the numbers is required to understand the validity of a test to see if the numbers really tell you what you think they mean. But most importantly marketers need to do the planning ahead of time to work out how they will measure the test and to justify whether any test activity is indeed worthwhile, or if the numbers they can realistically expect to uplift are going to pass the statistical significance test or other measures. If not then what is the point of the test as you’ll never know the real answer to its success?