The reproducibility crisis! It’s shaking the very foundations of the ivory tower. Reportedly the psychology wing is already in rubble. Medical researchers are having to glue their microscopes to their benches. But down in our dusty corner of the basement, the archaeologists don’t appear to have even noticed. Why not?
The reproducibility crisis
Over-extended metaphors aside, a lot of people are talking about reproducibility in science right now. Although I’m sure there have always been murmerings, the extent of the problem was exposed in a 2005 paper in PLOS ONE, Why Most Published Research Findings Are False. At the crux of the matter is the p value, a figure returned by a statistical test which, to explain it in terms my high school maths teacher would summarily execute me for, is the probability that your “statistically significant” finding is actually wrong. The threshold p value used in most sciences that aren’t physics is 0.05, so you would expect just 1 in 20 published results to be wrong, right? Wrong. In the PLOS ONE paper, “metascientist” John Ioannidis pointed out that if you take into account the cumulative effect inadequate replication, failure to publish negative results, bias in selecting hypotheses to test in the first place, and ‘p-hacking’ (wrangling your data until you get a significant result), then simple probability theory tells you that “it is more likely for a research claim to be false than true”. The problem doesn’t go away if you use a stricter threshold of significance, either, even physicists’ “5 sigma” gold standard (≈0.0000003), as explained in this video.
Worrying, right? But it was all statistical arcana until somebody collected some data. Then last year, a group of psychologists reported that of 100 landmark studies in psychology, only 39% could be replicated. Inevitably, there have been questions about whether that result is itself reproducible (and so on), but it is clear that the concerns raised by John Ioannidis in 2005 have at least some substance to them. The problem appears to be particularly acute in fields like psychology and medical research, which rely so heavily on the aggregation of “significant” findings from experimental research, but it is by no means limited to them. (And perhaps they were just the first to stick their head up and take a good hard look at their methods). A survey of 1500 scientists conducted by Nature found that 52% thought there was a “significant crisis” of reproducibility. It has led to calls to fundamentally rethink the way science is done, by finding ways to incentivize replications of previously published studies, report negative results, and generally do more rigorous, transparent and statistically robust research.
Yet despite its apocalyptic proportions, archaeologists have been largely absent from the conversation about reproducibility in science. Admittedly, all the talk of experimental design, statistical power and p-values is hard to relate to. We don’t do experiments, we do excavations. And seeing in a p-value in a paper at all is a sign of above-average statistical literacy for the discipline. Is archaeology reproducible at all? It is easy to dismiss the whole thing as a problem of experimental science. We archaeologists do things differently.
I’d argue that this attitude misses two important points. First, while archaeology is not an experimental science, a lot of what archaeologists do is reproducible, and therefore affected by all the problems being discussed in the wider world of science. Second, while the basic building blocks of archaeological knowledge are non-replicable observations, this does not mean we are immune to the reproducibility crisis. Rather, we should be asking ourselves, if disciplines that can interrogate their subjects again and again are getting it so wrong, how can we, who only have one shot at it, express a credible claim to knowledge?
Is archaeology reproducible?
When I was a teenager I was very keen on the scientific method. I’d read my Popper, knew that the one true path to knowledge was through the careful deduction of falsifiable hypotheses from theory, the derivation of testable predictions from hypotheses, and the subjecting of predictions to controlled and decisive experiments. It all fell apart when I started studying archaeology.
The stuff of archaeology―landscapes, sites, assemblages―are unique and finite records of the past. That doesn’t mean we can’t be scientific, just not in the neat, hypothetico-deductive mold so fervently extolled by bright-eyed physics students. It’s hard to come up with testable predictions for the field when you have no idea what you’re going to find there. Controlled experiments are a non-starter because, as excellently put by Roger Peng, the stuff we study is “generally reluctant to be controlled by human beings”. Archaeology, like geology or astronomy, is an observational science, not an experimental one.
But archaeological research is not just a list of sites and artefacts. In order to extract understanding from our irreproducible corpus of material, we subject it to an extraordinary range of analytical methods, most of which are replicable. This is where archaeologists have been most active in promoting reproducibility, as part of a larger trend towards open science. Repositories like the ADS and journals like JOAD facilitate the sharing of “raw data” along with results. In archaeological computing and statistics, there is a push to use scripting languages like R and Python, rather than opaque graphical software, and open source over proprietary tools, to achieve reproducible analyses. And access to physical material is made easier through digital tools like MorphoSource.
These are all encouraging trends, but they lack the urgency of a “crisis”. We’re trying to make our research reproducible in principle, but is any replication actually being done? The recent tumult in psychology (a discipline that is leagues ahead of archaeology in terms of open science and statistical rigour) shows that it is not enough to just adhere to the principles of reproducible research design, they must be put into practice.
Although many aspects of archaeological research can be repeated in this way, ultimately we are at the mercy of the finite amount of human rubbish that we can pull out of the ground. No matter how many times we measure, sample, analyse and crunch the numbers on a particular assemblage, if the assemblage itself was anomalous it will always give us an anomalous result. We can’t rerun the history that produced that material, or even the process through which we obtained it. So how can we obtain reproducible results from non-replicable observations?
A good place to start would be to look at other observational sciences. Roger Peng’s explanation of the reproducibility crisis chimed with me for that reason:
The replication crisis in science is concentrated in areas where (1) there is a tradition of controlled experimentation and (2) there is relatively little basic theory underpinning the field.
The basic idea is that psychology and medical science are particularly affected by the reproducibility crisis because they are both experimentation-heavy and lack a strong basic theory. Other fields, that have only one of those qualities, or neither are less effected:
On the bottom of Peng’s chart are fields with a unified and well-defined underlying theory, like physics and astronomy. This harks back to Ioannidis’ 2005 paper: basic theory reduces the field of possible hypotheses to test, meaning that if one is fishing for significant p values, it is from a drastically smaller pool, and one that is much more likely to contain true propositions. So far, so obvious – but few sciences are on as rock-solid a footing as physics, and archaeology is certainly not one of them!
But Peng also describes his own field, epidemiology, in the top left of the chart, as having largely been spared the reproducibility crisis, even though it is as theoretically rudderless as medicine or psychology (or archaeology). His contention is that in fields that rely on experimentation, breeds an unreasonable expectation that the results of single experiments (if well-designed and statistically validated) are true. Meanwhile, observational sciences, which are accustomed to the fact that they have little control over their observations and that single results may well be wrong, have already learned the core lesson of the reproducibility crisis: don’t trust that that isn’t replicated.
This seems promising: is archaeology, as an observational science, also immune to the reproducibility crisis? It got me thinking about whether we have a tradition of “observational scepticism” in archaeology. That is, do we hold off on accepting an observation as reliable until we have seen similar observations from multiple places? I’m not sure what the answer is. Certainly I think there is a tendency for archaeologists to construct narratives of prehistory from the tales of certain iconic and well-studied sites, which may well be leading us astray. At the same time, we have a long tradition of regional synthesis and typological thinking, which should counteract the effect of some anomalous results.
Archaeology and the reproducibility crisis
Either way, I think Peng’s take on the reproducibility crisis has a lot of relevance for archaeologists. We don’t have experiments we can rerun and we won’t have an underlying basic theory any time soon. Putting more impetus behind the drive towards open science and reproducible analyses can only be a good thing; but if we are going to preserve our corner of the basement from the reproducibility crisis, we need to become (or emphasise that we already are) observational sceptics.