The Garden of Forking Paths is an idea introduced in a paper by Andrew Gelman and Eric Lokin that should be understood by everyone who uses statistics and analyses data.
Context for those unfamiliar with statistics. For a long time, and in many journals even now, research would only get published if it was “statistically significant”, which usually meant that the result had a
p value of less than 5% (a figure chosen arbitrarily). The
p-statistic can be calculated from the data and a hypothesis about the distribution of the data. This gave rise to the practice of “
p-hacking” or “fishing” – looking through data, excluding this and grouping that, recalculating the
p-statistic, until one found a result that had
p < 5%, which they then published. Many of these results turned out to be un-reproducible by other researchers.
In the old-school approach, a researcher is supposed to formulate an hypothesis, and run an experiment to test it. If the results of the experiment are insufficiently probable under the hypothesis, the hypothesis has to be rejected. What counts (classically) as "insufficiently probable” is a value of the
p-statistic greater than 5%. What you’re not allowed to do is throw away data you don’t like and change the hypothesis to suit the data that’s left. That’s downright dishonest. You have to take all the data, and there are complicated rules about what to do when subjects drop out of the study and other such eventualities. This is how the old-school founders worked. Much of their work was in agriculture and industry, and R A Fisher really did divide his plot of land on an agricultural research station, treat each patch of soil, plant the potatoes and stand back to see what happened. He had no previous theories, and if he did, the potatoes would decide which one was better.
In epidemiology, political science, social science, longditudinal health and lifestyle tracking surveys and other subjects, the experiments are not as simple nor as immediately relevant, and may even not be possible to conduct. The procedure is often reversed: the data appears first, and the hypotheses and statistical analysis are done afterwards. This is how businessmen read their monthly accounts and sales reports. Often those businessmen are expecting to see certain changes or figures, and when they don’t, want to know why (“We doubled advertising in Cornwall, why haven’t the sales increased? What are they playing at down there?”). Researchers in social sciences and epidemiology also come bristling with pet theories, some of which they are obliged to adopt by the prevailing academic
mores.
Under these circumstances, the data is scanned
by very practised eyes for patterns and trends that the readers expect to find. If there seem to be no such patterns, those same eyes will look a little harder to find places where they can see the patterns they want, or at least some patterns that make sense of the lack of expected results. Researchers looking at diet know but cannot say that the less educated are less healthy and eat worse food, because they cannot afford better. So the researchers scan the data and blame bacon and eggs, or whatever else is believed to be eaten by the lower classes. This saves the researchers' grants and jobs.
However, the next survey fails to find that eating bacon and eggs did not alter the health of the people who ate it. Though nobody will ever know, this is because, in the first sample, the people who ate bacon and eggs were mostly older unemployed English people who did not exercise, whereas in the second survey, they were mostly Romanian builders in their late twenties who also played football at the weekends.
What happens in this
practised data scanning? It is a series of decisions to select these data points, and group those properties, and maybe construct a joint index of this and that variable. It may include comparing the usual summary statistics, looking at histograms, time series, scatter graphs and linear regressions, and maybe even running a quick-and-dirty logistic regression, GLM or cluster analysis. All this can be done in SAS or R, and much of it in Excel, in a few moments by a reasonable analyst. Speaking from experience, it does not feel any more sophisticated than looking at the raw numbers, and so, because familiarity breeds neutrality all this is seen as part of the “observation process” rather than the hypothesis-formation and testing process. (Methodological aside: Plenty of people still think that observation is a theory-free process that generates unambiguous “hard facts”, or that it is possible to have observations that may involve theories but are still neutral between the theories being tested, and so “relative hard facts”. The word has not got out far enough.)
These decisions about data choice and variable definition are what Gelman and Lokin call the “Garden of Forking Paths”. Their point is that
to get the bad result about bacon-and-eggs we took one path, but we could have taken another and not found any result at all. And if we used all the data, we would have found nothing. The error is to present the result of the data-scanning, the walk down the Forking Path, as if the whole survey provided the evidence for it, instead of a very restricted subset of the data chosen to provide exactly that result.
The Forking Paths we take through the Garden of Data in effect create idiosyncratic populations that would never be used in a classical test, or which are so specialised that it is impossible to carry over the result to the general population. The decisions that are made almost unconsciously in that practised data scanning seem to produce evidence for a conclusion,
but the probability of obtaining that evidence again is minimal. That is the key point. When the old-school statisticians did their experiments on potatoes, they could be fairly sure, based on what they knew about soil and potatoes, that the exact patch of ground they chose would not matter. Another patch would yield different results, but within the expected variations. The probability that their results would be reproducible was high. When researchers walk along a Forking Path, they risk losing reproducibility and therefore a broader relevance.
That’s why so many attention-grabbing results are never reproduced: because the evidence lying at the end of the Forking Path was itself improbable. Nobody cheated overtly, they just chose what made a nice story but didn’t then check on the probability of the evidence itself. Practised data scanning, or a good stroll through the Garden of Forking Paths, can give you a good value for
P(Nice_Story | Evidence), but
P(Evidence) can be almost zero, and so the
P(Nice_Story) =
P(Nice story | Evidence)*
P(Evidence) is also nearly zero and Nice_Story, really is just a fiction.
The difference between outright
p-hacking and practiced data scanning is subtle, but it is politically important.
p-hacking is clearly dishonest, and heaven forbid pharmaceutical companies should do it. Forking Paths is just, well, an understandable temptation. Gelman and Lokin stress how natural a temptation it is, as if to excuse it, but of course, if it is a natural temptation, the Virtuous Analyst will take care to resist it.
What Virtuous Analysts want to know is: how does one take a pre-existing data set and avoid the Garden of Forking Paths? Isn’t that an analyst’s job? Isn’t that why businesses have all that data? Because in amongst all that dross is the gold that will double sales and profits overnight? So suppose as a result of a thorough stroll round the Garden, I find what my manager wants to hear: that when sales of product A increase, sales of product B decrease. Product B, of course, is his, and product A belongs to a rival in the same organisation. This result holds only during periods of specific staff incentives in larger stores and not during the school holidays, and that makes up 65% of the sales during those periods. Everywhere else during those times, there is no relationship, and in the small stores at all times there is no relationship. That’s what I tell my manager, with all the caveats. It’s his decision whether to simplify it for the higher-ups. The Virtuous Analyst does not anticipate political or commercial decisions, but leaves that to the politicians and commercial managers.
Virtue sometimes hangs on a nuance.