Why Machine Learning Is Not Made for Causal Estimation | by Quentin Gallea, PhD | Jul, 2024

0


We all know that “Correlation does not imply causation”. But why?

There are two main scenarios. First, as illustrated below in Case 1, the positive relationship between drowning accidents and ice cream sales is arguably just due to a common cause: the weather. When it is sunny, both take place, but there is no direct causal link between drowning accidents and ice cream sales. This is what we call a spurious correlation. The second scenario is depicted in Case 2. There is a direct effect of education on performance, but cognitive capacity affects both. So, in this situation, the positive correlation between education and job performance is confounded with the effect of cognitive capacity.

The main reason why “correlation does not imply causation”. The arrows represent the direction of the causal links in causal graphs. Image by author.

As I mentioned in the introduction, predictive inference exploits correlations. So anyone who knows that ‘Correlation does not imply causation’ should understand that Machine Learning is not inherently suited for causal inference. Ice cream sales might be a good predictor of the risk of drowning accidents the same day even if there is no causal link. This relationship is just correlational and driven by a common cause: the weather.

However, if you want to study the potential causal effect of ice cream sales on drowning accidents, you must take this third variable (weather) into account. Otherwise, your estimation of the causal link would be biased due to the famous Omitted Variable Bias. Once you include this third variable in your analysis you would most certainly find that the ice cream sales is not affecting drowning accidents anymore. Often, a simple way to address this is to include this variable in the model so that it is not ‘omitted’ anymore. However, confounders are often unobserved, and hence it is not possible to simply include them in the model. Causal inference has numerous ways to address this issue of unobserved confounders, but discussing these is beyond the scope of this article. If you want to learn more about causal inference, you can follow my guide, here:

Hence, a central difference between causal and predictive inference is the way you select the “features”.

In Machine Learning, you usually include features that might improve the prediction quality, and your algorithm can help to select the best features based on predictive power. However, in causal inference, some features should be included at all costs (confounders/common causes) even if the predictive power is low and the effect is not statistically significant. It is not the predictive power of the confounder that is the primary interest but rather how it affects the coefficient of the cause we are studying. Moreover, there are features that should not be included in the causal inference model, for example, mediators. A mediator represents an indirect causal pathway and controlling for such variables would prevent measuring the total causal effect of interest (see illustration below). Hence, the major difference lies in the fact that the inclusion or not of the feature in causal inference depends on the assumed causal relationship between variables.

Illustration of a mediator. Here Motivation is a mediator of the effect of training on productivity. Imagine that training for the staff increases their productivity directly but also indirectly through their motivation. The employees are now more motivated because they learned new skills and see that the employer put effort into upskilling the employees. If you want to measure the effect of the training, most of the time, you want to measure the total effect of the treatment (direct and indirect), and including motivation as a control variable would prevent doing so. Image by author.

This is a subtle topic. Please refer to “A Crash Course in Good and Bad Controls” Cinelli et al. (2002) for more details.

Imagine that you interpret the positive association between ice cream sales and drowning accidents as causal. You might want to ban ice cream at all costs. But of course, that would have potentially little to no effect on the outcome.

A famous correlation is the one between chocolate consumption and Nobel prize laureates (Messerli (2012)). The author found a 0.8 linear correlation coefficient between the two variables at the country level. While this sounds like a great argument to eat more chocolate, it should not be interpreted causally. (Note that the arguments of a potential causal relationship presented in Messerli (2012) have been disproved later (e.g., P Maurage et al. (2013)).

Positive correlation between Nobel Laureates per 10 million population and Chocolate Consumption (kg/yr/capita) found in (Messerli (2012)). Image by author.

Now let me share a more serious example. Imagine trying to optimize the posts of a content creator. To do so, you build an ML model including numerous features. The analysis revealed that posts published late afternoon or in the evening have the best performance. Hence, you recommend a precise schedule where you post exclusively between 5 pm and 9 pm. Once implemented, the impressions per post crashed. What happened? The ML algorithm predicts based on current patterns, interpreting the data as it appears: posts made late in the day correlate with higher impressions. Eventually, the posts published in the evening were the ones more spontaneous, less planned, and where the author didn’t aim to please the audience in particular but just shared something valuable. So the timing was not the cause; it was the nature of the post. This spontaneous nature might be harder to capture with the ML model (even if you code some features as length, tone, etc., it might not be trivial to capture this).

In marketing, predictive models are often used to measure the ROI of a marketing campaign.

Often, models such as simple Marketing Mix Modeling (MMM) suffer from omitted variable bias and the measure of ROI will be misleading.

Typically, the behavior of the competitors might correlate with our campaign and also affect our sales. If this is not taken into account properly, the ROI might be under or over-evaluated, leading to sub-optimal business decisions and ad spending.

This concept is also important for policy and decision-making. At the beginning of the Covid-19 pandemic, a French “expert” used a graph to argue that lockdowns were counterproductive (see figure below). The graph revealed a positive correlation between the stringency of the lockdown and the number of Covid-related deaths (more severe lockdowns were associated with more deaths). However, this relationship was most likely driven by the opposite causal relationship: when the situation was bad (lots of deaths), countries would impose strict measures. This is called reverse causation. Indeed, when you study properly the trajectory of the number of cases and deaths within a country around the lockdowns controlling for potential confounders, you find a strong negative effect (c.f. Bonardi et al. (2023)).

Replication of the graph used to argue that lockdowns were ineffective. The green corresponds to the lowest lockdown measure and the red the most restrictive. Image by author.

Machine learning and causal inference are both profoundly useful; they just serve different purposes.

As usual with numbers and with statistics, most of the time the problem is not the metrics but their interpretation. Hence, a correlation is informative, it becomes problematic only if you interpret it blindly as a causal effect.

When to use Causal Inference: When you want to understand the cause-and-effect relationship and do impact evaluation.

  • Policy Evaluation: To determine the impact of a new policy, such as the effect of a new educational program on student performance.
  • Medical Studies: To assess the effectiveness of a new drug or treatment on health outcomes.
  • Economics: To understand the effect of interest rate changes on economic indicators like inflation or employment.
  • Marketing: To evaluate the impact of a marketing campaign on sales.

Key Questions in Causal Inference:

  • What is the effect of X on Y?
  • Does changing X cause a change in Y?
  • What would happen to Y if we intervene on X?

When to use Predictive Inference: When you want to do accurate prediction (association between features and outcome) and learn patterns from the data.

  • Risk Assessment: To predict the likelihood of credit default or insurance claims.
  • Recommendation Systems: To suggest products or content to users based on their past behavior.
  • Diagnostics: To classify medical images for disease detection.

Key Questions for Predictive Inference:

  • What is the expected value of Y given X?
  • Can we predict Y based on new data about X?
  • How accurately can we forecast Y using current and historical data on X?

Leave a Reply

Your email address will not be published. Required fields are marked *