Correlation + Analytics – Causation = Confusion
April 17, 2015 6 Comments
What causes ‘something’ and what action to take to either prevent or make that ‘something’ happen has been the curious question that has led to major scientific discoveries and philosophical theories. But these are based on painstaking research in a closed environment and a focused approach for years.
In recent years, the promise to solve this has hit the business world. The promise being that what causes something is hidden in the data (so much of it!), and analyzing that big data will give the solution. Simple idea, but not as simple as it sounds.
Establishing causation in a business environment with multiple, ever-changing factors and now, increasing data sizes is not a trivial problem. And filtering it down to action where ‘do this and this will happen’ is probably tougher than scientific ’cause and effect’ discoveries. Probably because a sure-shot causation does not exist in the business world.
Very often, the clients ask – ‘so what should I do? what action should I take?’ And my answers have been kind of similar, and probably frustrating. ‘Well there are multiple factors, and what action to take depends a lot on what you want to do and where you want to go.’
A bit like the Cat in Alice in Wonderland.
‘Would you tell me, please, which way I ought to go from here?’
‘That depends a good deal on where you want to get to,’ said the Cat.
‘I don’t much care where —’ said Alice.
‘Then it doesn’t matter which way you go,’ said the Cat
The fixation with ‘data-driven decision-making’ has led to the notion that causal factors for anything can simply be derived based on a few things happening together, and that somehow if one finds something in the data happening together and some kind of model to stitch them, it will be the magic bullet to drive action.
The truth is that what causes something is rarely as simple as that. Observing what happens together is easier, but establishing one leads to another is not.
The belief catching up is that one should track all kinds of data in huge volumes, and derive connections using powerful machines and databases and implement real-time. The presumption being that all the humongous information and processing will reveal correlations and insights so strong that causation or logic is not relevant.
Nothing could be farther from the truth.
Even ‘Smoking causes Lung Cancer’ is not a sure thing. All smokers will not get lung cancer and all non-smokers will not escape it, so how can one say sure-shot causality exists? So much so that for even something that is so universally accepted, most accomplished researchers now speak in probabilities like ‘Smoking is one of the risk factors linked to Lung Cancer’.
In complex production environments where most things can be measured, factors that are scientifically tuned and tracked and known and follow a fixed pattern, it is still tough to say with complete confidence and certainty what exact combination causes a defect or failure.
In more qualitative and customer-centric functions like marketing, it is even tougher to be exact about the factors that cause an increase in sales or better response from customers or higher customer satisfaction or whatever. So it is tougher to link it down to a clear mix of specific causative factors.
In stock markets, technical analysis is entirely based on looking for signals in patterns that repeat themselves. No one has yet found a 100% cause effect that works every time. Besides, even for followers of technical analysis, it is not clear which is the cause and which is the effect. Whenever it works, does it work because everyone follows it, or does everyone follow it because it works?
In most business situations, there is no such thing as a sure-shot correlation – so the question of a sure-shot cause-effect is even farther. Anyone who confidently demonstrates cause and effect, and that too with multiple factors, either doesn’t understand it or is in the business of being confident, not right.
That is not to say correlations are useless. Often they could be used as standalone signals – but that is all they are.
Hence, statisticians come up with the language of correlations, factors and probabilities. And because nobody understands them, every one thinks what they are saying is causation and surety. That is the start of problems.
‘What do you mean so much probability? What is this statistically significant model? What exactly should I do?’ is the question. Or even better ‘We wanted to get actionable insights!’ Or even better ‘We gave you all the data we had!’
Data, correlations and statistical models – without any causation or business logic are like a solution looking for a problem to solve. An insight looking for justification rather than a hypothesis looking for validation. A recipe for disaster even in the best of situations.
As it is, analytics gives you no sure-shot answers and talks in probabilities – even in the best of situations. But in this scenario, the only sure-shot thing that one will end up in is confusion. Even if you don’t want to get there.
Like Alice in Wonderland.
‘But I don’t want to go among mad people,’ Alice remarked.
‘Oh, you can’t help that,’ said the Cat: ‘we’re all mad here. I’m mad. You’re mad.’
‘How do you know I’m mad?’ said Alice.
‘You must be,’ said the Cat, ‘or you wouldn’t have come here.’
Alice didn’t think that proved it at all; however, she went on ‘And how do you know that you’re mad?’
‘To begin with,’ said the Cat, ‘a dog’s not mad. You grant that?’
‘I suppose so,’ said Alice.
‘Well, then,’ the Cat went on, ‘you see, a dog growls when it’s angry, and wags its tail when it’s pleased. Now I growl when I’m pleased, and wag my tail when I’m angry. Therefore I’m mad.’