Why ‟controlling for a variable” doesn't (usually) work

Sep 2022

Problem 1: Bro come on
Problem 2: Reverse causality
Problem 3: Dependent features
Problem 4: More dependent features
Problem 5: Everything else
Association doublethink

Comments at substack.

I’ve always seen cathedrals as presenting a kind of implicit argument to atheists. Something like: God must exist, because otherwise it would have been insane for people to build this:

cathedral

This is how I feel about the idea of “controlling” for stuff in statistics. You read things like this all the time:

X is associated with Y, controlling for Z.

These analyses are everywhere in contexts where what we care about is if X causes Y. Everyone knows correlation doesn’t imply causation. But here we controlled for Z. So maybe it’s OK? Everyone seems to act like it’s OK. If these analyses didn’t work, then everybody would be crazy. So they’ve got to work, right?

What does it mean to “control” for something? Papers love to use scary terms like Cox Proportional-Hazards Model and Linear Structural Causal Equations. These hide the embarrassing reality that all it means to control for something is to add a variable to a regression.

Say you want to know if drinking alcohol makes people fatter. So you go find some people and measure how much they weigh and how much they drink.

Weight (kg)	Alcohol (drinks per day)
95	0
68	0.5
61	1.5
81	2
71	3

(Of course, you can’t conclude anything from just five data points—I’m just trying to keep things simple.)

If you squint, the trend seems to be that heavier drinkers are a bit lighter. You can confirm this by fitting a regression: You look for constants a and b such that

(weight) ≈ a + b × (alcohol).

You want the best approximation, i.e. the constants that minimize the average squared error. If you do this for the above dataset, you get a fit of

(weight) ≈ 89.6 - 4.5 × (alcohol).

The constant in front of alcohol is negative. In statistics jargon, you can say:

Weight is negatively associated with alcohol.

The naive interpretation of this would be that drinking makes you thinner. Which would be very weird if true.

But then you have an idea: It could be that the same people who drink less also tend to eat more. So you call up everyone and ask how much they eat in a day:

Weight (kg)	Alcohol (drinks per day)	Calories (thousands)
95	0	3.0
68	0.5	2.0
61	1.5	1.8
81	2	2.7
71	3	1.9

If you fit a regression to this dataset, you’ll get

(weight) ≈ 20.1 + 0.3 × (alcohol) + 24.0 × (calories).

Now the constant in front of alcohol is positive. Or, in statistics jargon:

Weight is positively associated with alcohol, controlling for calories.

That’s it. You “controlled” for calories. In almost all cases, this is what it means to control for something. The details vary a bit depending on the analysis, but it’s ultimately just sticking a new variable into a linear approximation somewhere.

So, does this work? Is this a good way to figure out if alcohol makes you fatter?

Not really, no.

Problem 1: Bro come on

To start, just use your intuition. Does it seem like you can uncover causal relationships by doing little regressions?

Think about the previous dataset. There are many possible relationships between weight, alcohol, and calories. Maybe being heavier makes people drink more. Maybe drinking makes people eat more. Maybe alcohol and food directly do nothing, but they’re correlated with exercise, which is actually what determined how fat people are. Maybe the true relationship is nonlinear, and all the variables influence each other in feedback loops. Any of these relationships could produce the previous data.

People seem to think that there’s some secret math that means you don’t need to worry about this stuff. Otherwise, what is everyone doing with these analyses?

There is no secret math. The biggest mistake people make with statistics is to distrust their intuition. In reality, once you do all the math, the things that seemed like they’d be problems are in fact problems. If anything, the math just turns up more things to worry about.

But how is this possible? After all, these methods have careful formal justifications. With proofs! And Borel spaces! Surely those aren’t wrong?

They are not wrong. But you must understand how these proofs get around the difficult problems: Roughly speaking, they assume the problems don’t exist.

Problem 2: Reverse causality

Take a group of aliens that only vary in how much they drink and how fat they are. Nothing else matters: They don’t eat or exercise, they all have the same genes, etc. Now suppose drinking and weight are positively associated. There are three possible explanations:

First, maybe drinking makes them heavier. (Perhaps because alcohol has calories.) We can draw this as

ALCOHOL → WEIGHT.

Second, maybe being heavier causes them to drink more. (Perhaps because there’s a hormonal change in heavier people that makes alcohol taste better.) We can draw this as

ALCOHOL ← WEIGHT.

Third, maybe drinking and weight are in some kind of complex feedback loop. (Perhaps because both of the above things happen.) We can draw this as

ALCOHOL ↔ WEIGHT.

How do you use observational data to figure out which explanation is right?

The answer is easy: You don’t. Any data that comes from one of these causal models could also come from any of the others. The difference is completely invisible. The only way to tell is through intervention—make people drink more and see what happens.

To estimate causal relationships from observational data, you start by making assumptions about which variables cause changes in which others, and then use data to figure out the strengths of the interactions. You assume which direction the arrows are pointing, and then you figure out how big the arrows are.

I know that seems odd—isn’t the direction of the arrows the central question? You’d think that there’s some caveat here, but there just isn’t. People assume X causes Y and then use that assumption to figure out how strongly X causes Y. If that assumption is wrong, the analysis will happily give you a wrong answer.

Problem 3: Dependent features

If you say that “drinking makes people fatter”, what do you mean by that? Here are two options:

If people drank more but didn’t change how they eat, they’d be fatter.
If people drank more and changed how they eat as a result of drinking more, they’d be fatter.

These are not the same!

Suppose that eating makes you fatter, and eating makes you drink more, but drinking has no direct impact on weight:

ALCOHOL ← FOOD → WEIGHT

If you did a randomized controlled trial and assigned people to drink different amounts, there would be no impact on weight. However, alcohol and weight are associated, because food acts as a confounder. You can fix the problem by controlling for food, which makes alcohol and weight unassociated. Here, controlling works.

But maybe the causal model is different. Maybe alcohol has no direct influence on weight, but alcohol causes you to eat more, and eating more makes you fatter:

ALCOHOL → FOOD → WEIGHT

Now, an RCT would show that drinking does make people fatter. But just like with the previous causal model, alcohol and weight are associated, but unassociated if you control for food.

Get that? Here’s a little table:

Causal model	ALCOHOL ← FOOD → WEIGHT	ALCOHOL → FOOD → WEIGHT
Alcohol causes fatness in RCT?	NO	YES
Alcohol associated with fatness?	YES	YES
Alcohol associated with fatness, controlling for food?	NO	NO

In the two cases, an RCT will give you totally different results. But alcohol is always associated with fatness, and always unassociated once you control for food.

Why? Because there is no visible difference between the two causal models in observational data. Any dataset that comes from one causal model could just as easily come from the other one. You simply need to guess which one is right, and you better not be wrong.

Problem 4: More dependent features

It gets worse. In humans, it’s likely that alcohol and food both influence each other in a feedback loop. And it’s likely that both have direct impacts on weight, leading to this causal model:

    ALCOHOL ↔ FOOD
                    ↘    ↙
                  WEIGHT

In an RCT, drinking will again show an impact on weight. But what about the association between alcohol and food in observational data? Without controlling for food, the association will be stronger than in an RCT, due to food acting as a confounder. But if you do control for food, the association will be weaker than in an RCT, since the control blocks the indirect impact of alcohol on weight via food.

Causal model	ALCOHOL ↔ FOOD ↘ ↙ WEIGHT
Alcohol causes fatness in RCT?	YES
Alcohol associated with fatness?	YES (but stronger than RCT)
Alcohol associated with fatness, controlling for food?	YES (but weaker than RCT)

You’re screwed either way. The only way to get it right would be to somehow guess exactly what fraction of the ALCOHOL ↔ FOOD association is due to causality flowing in either direction. Which is invisible in the data. Good luck with that.

But wait, there’s more! Maybe weight also influences what people consume, so everything causes everything else:

    ALCOHOL ↔ FOOD
                    ⤡    ⤢
                  WEIGHT

Abandon all hope ye who encounter this causal model.

Problem 5: Everything else

So far, I’ve focused on the most serious problem with “controlling”: It only works if influence flows in a simple unidirectional way and you already know which direction. But there are many other issues, too:

Missing causes. There might be important variables that are missing from the model. Sometimes these are things that you just failed to measure (like EXERCISE). Sometimes these are ethereal and unmeasurable (like PSYCHOLOGICAL COMMITMENT TO A HEALTHY LIFESTYLE).

Linearity. Fitting linear models assumes interactions are linear. Often there’s a feeble attempt to address this by testing what happens if you add interactions or quadratic terms. But there’s no guarantee that a quadratic model is sufficient either. And people often do weird things like compute a p-value for the quadratic term, observe that the p-value is large, and therefore conclude that the quadratic term isn’t needed. (That isn’t how p-values work.)

Coding. Everything depends on how you code variables. Like, when people say they control for “education”, what they actually did is choose some finite set of categories, e.g. (highschol or less) / (some college) / (4-year degree or more). But you could break these things up differently, and when you do, the results often change.

Noisy controls. People might say they control for “diet”. But what they actually control for is “how much people claimed to eat on 5 days they were surveyed”. If they don’t remember, or they lie, or those days were unusual, what happens? Well, in the limit of pure noise, the regression will just ignore the control variables, equivalent to not controlling at all. In practice, the effect is usually somewhere in the middle, meaning that things are just partially controlled. It’s amazing how rare it is for papers to show any awareness of this issue.

Association doublethink

Of course, there’s a place for observational studies. For many things we care about, true experiments are impossible—we can’t run RCTs where we assign people to breathe different amounts of nitrogen dioxide or particulate air pollution for their entire lives. So we have to content ourselves with observational estimates, as imperfect as they are.

And of course, all the problems I’ve pointed out are well-known. This is why journals won’t let you do an observational study and then say that you’ve proven anything about causality.

So what happens? Well, many people accept this. They insist on RCTs whenever possible. (Just try getting a drug approved with observational data!) When an RCT can’t be done, they’ll try to use natural experiments or instrumental variables as an approximation. When even that’s impossible, they’ll use observational data, but obsess about the assumptions and treat the conclusions with great humility.

But many people don’t seem to accept the harsh limitations of associations and controls. And sometimes these people form entire scientific sub-communities. They train each other, and accept each others’ papers and give quotes to journalists when someone’s paper is published. Here’s what they do:

Take whatever variables they can measure, and throw them into off-the-shelf statistical packages.
Don’t worry about (or, maybe, understand) all the assumptions that are being made. Certainly, don’t discuss them.
When writing papers, never say “cause”. Instead use words like “associated” or “risk factor” exactly as if they meant “cause”.

For example, take the paper Artificial sweeteners and cancer risk: Results from the NutriNet-Santé population-based cohort study. Here are some quotes:

In this large cohort of 102,865 French adults, artificial sweeteners […] were associated with increased overall cancer risk

Our findings do not support the use of artificial sweeteners as safe alternatives for sugar in foods or beverages and provide important and novel information to address the controversies about their potential adverse health effects.

And here’s one from the press release—picked up by all the usual copypasta places and presumably read by the general public.

This large-scale, prospective study suggests that artificial sweeteners, used in many foods and beverages in France and throughout the world, may represent an increased risk factor for cancer.

What game is being played here? Are we seriously supposed to pretend that the goal is anything other than to make people think that artificial sweeteners cause cancer?

I think this is disgraceful. If you think you’ve shown something about causality, then say so explicitly. Have the courage to argue for what you think is true. Drop this gambit of bending words so you can act like you’ve proven causality, but not have to defend your conclusions.

This is bad. But I do have some sympathy for the authors. And if everyone you know is distorting words like this, it’s natural to think this is the normal way to use those words. And if everyone you know is doing research like this, it’s natural to think that this is the right way to do research. Because if it wasn’t…

Now, I like cathedrals. Beyond simply being pretty, they are a celebration of human agency, a demonstration of what dedicated people can accomplish over generations. Misleading observational studies have fewer obvious benefits.

Comments at substack.

Datasets that change the odds you exist

Stats for dangerous situations

It's October 1962. The Cuban missile crisis just happened, thankfully without apocalyptic nuclear war. But still: Apocalyptic nuclear war easily could have happened. Crises as serious as the Cuban missile crisis clearly aren't *that* rare, since one just happened. You...

Using axis lines for good or evil

add them only if they mean something

Say you want to plot some data. You could just plot it by itself. Or you could put lines on the left and bottom. Or you could put lines everywhere. Or you could be weird. Which is right? Many people...

Prediction market does not imply causation

Unless you're careful, conditional prediction markets have all the same problems as observational studies.

We all want to make good decisions. But it’s hard because we aren’t sure what’s going to happen. Like, say you want to know if CO₂ emissions will go up in 10 years. One of our best ideas is to...

The conspiratorial Monty Hall problem

What if you and Monty decide to cheat?

The Monty Hall problem has now been a pox on humanity for two generations, diverting perfectly good brains away from productive uses. Hoping to exacerbate this problem, some time ago I announced a new and more pernicious variant: What if...

Social dynamics of bluetooth speakers

A mathematical model of who turns on their bluetooth speaker at the beach.

Say you're at a park or a beach. How many people will have bluetooth speakers on? It seems to me there are three types of people: The main characters always turn on their speakers regardless of what anyone else is...

It’s perfectly valid for a trait to be more than 100% heritable

What heritability really is: A fluid statistic that changes whenever society changes.

All psychological traits are heritable. This is the best replicated finding in all of behavioral genetics. Some recent numbers include: Religiosity: 44% Schizophrenia: 79% Big five personality traits: ~40% But what, exactly, does "heritability" mean? I used to have a...

Simpson's paradox all the way down

Visualizes Simpson's paradox, and shows how it's a deeper problem than many people realize.

It's hard to get into Oxford. Is it easier if your parents are rich? In 2013, The Guardian showed noticed something disturbing: Students from (expensive) independent schools were accepted more often that students from state schools (28% vs 20%). Of...

The simplest possible way to convert Celsius and Fahrenheit

If you can switch the order of two numbers, you can convert temperatures.

This is a new way to convert temperatures between Celsius and Fahrenheit. It’s not the most accurate method, but it’s surely the easiest.

Making the Monty Hall problem weirder but obvious

It's this simple: Do you want what's behind one door or the other nine?

Here’s an Obvious Problem: There are 10 doors. A car is behind a random door, goats behind the others. Do you want what’s behind door 1, or what’s behind all the other doors? That’s easy, right? Well, how about the...