Robert P Crease discusses readers' responses to his "experiment" into the nature of discovery and statistics
In June I asked readers to collaborate in an “experiment” about scientific discovery as it happens. The discovery concerned dark matter, the yet-to-be-detected, invisible substance that, researchers are convinced, makes up more than 80% of the matter in the universe. My experiment was prompted by a paper submitted last December to arXiv by members of the Cryogenic Dark Matter Search (CDMS-II). That paper announced a finding of two events, compared with 0.5 expected from background, with a confidence level of about 1.3σ or 21% (arXiv:0912.3592v1).
No-one, in or outside the CDMS-II collaboration, considered this a “discovery”, even though they were sure that it – or another experiment – will eventually find dark matter. I therefore asked readers three questions. First, what would count as a discovery of dark matter? Second, what should we call the CDMS-II findings, assuming they are of true dark-matter events? Third, what other findings in physics did readers know of that had grown into discoveries – or non-discoveries – thanks to more statistics?
The several dozen responses, both direct to me and via the online version of my column, were heated and illuminating.
The DAMA effect
Paul Grannis, a physicist colleague of mine at Stony Brook, gave an apparently straightforward answer: “When we see 3σ, we call it evidence; when we see 5σ, we call it a discovery.” Indeed, 5σ seems to be a rule of thumb among physicists of the confidence level required for a discovery. This originated in the mid-1990s, when evidence for the top quark accumulated in the data of two teams at Fermilab. In 1995 the two teams jointly announced the discovery with a confidence level of 5σ, promptly convincing the research community and establishing that as the benchmark confidence level.
Ever since, the role of statistics, and the justification of the 5σ confidence level, have been major topics in particle physics. For example, Oxford University physicist Louis Lyons organized a workshop on the subject at CERN in 2007, and he plans another one on the same topic at CERN early next year. The workshop proceedings reveal just how rich and sophisticated a field the application of statistics to physics has become (see the webpage).
In 2008 Lyons wrote an article, “Open statistical issues in particle physics” (arXiv:0811.1663v1), that included a section entitled “Why 5σ?”. While statisticians invariably say that being so stringent is overkill, Lyons writes, there are several good reasons for it. One is past experience. As he points out, “we have all too often seen interesting effects at the 3σ or 4σ level go away as more data are collected”. A second is the “look elsewhere” effect: the decisions you make in sorting the data into “bins” in a histogram may serve to concentrate fluctuations, meaning that “the chance of a 5σ fluctuation occurring somewhere in the data is much larger than it might at first appear”. Finally, physicists worry that some systematic effect may have been underestimated or even missed altogether.
Nevertheless, 5σ is essentially arbitrary, with many discoveries accepted with considerably less sigma, and some not accepted even with higher sigma. The classic recent instance, numerous respondents reminded me, is the still-disputed claim, made several years ago by the DAMA/LIBRA experiment at the Gran Sasso National Laboratory in Italy, of evidence for the presence of dark-matter particles in the galactic halo at a confidence level of 8.2σ (arXiv:0804.2741). No-one doubts that DAMA has seen something. But the fact that other experiments have not seen anything – even though they should if DAMA did – raises doubts about DAMA’s interpretations, as did a certain chariness by the collaboration about sharing information. The “DAMA effect” underscores that statistics alone do not make a discovery.
One factor is that the translation of sigma into a probability often involves the assumption of a normal distribution of errors. “It is by no means clear how to justify this assumption in many cases,” Charles Jenkins from CSIRO in Canberra, Australia, told me. “And it is certainly not clear that the assumption of normality applies so far out in the wings of the distribution. If we had enough data to draw a histogram and verify the error distribution out to 5σ, we probably wouldn’t be bothering with statistics!” Scientists insist on this apparently extreme level of significance, Jenkins continued, as “an insurance policy against the multitude of error sources that don’t average away quickly and give fat tails to the error distribution”. It is, he continued, a cheap and cheerful way of dealing with the underlying issue, which is that ascribing the significance level to an observation requires an assumption about the error distribution. “One has to view a result as a package,” he wrote, “where the statistical interpretation is one of the things that may be wrong. As a fellow student of mine once asked in a seminar, ‘What are the errors on your errors?’.”
The critical point
Few respondents were excited about the CDMS-II results. “If you roll a dice six times,” says astrophysicist Rafael Lang from Columbia University in the US, “would you be excited if you rolled the ‘four’ twice?”
Nearly everyone I spoke to had tales – many well known – of signals that went away, some at 3σ: proton decay, monopoles, the pentaquark, an excess at Fermilab of high-transverse-momentum jets. Several people reminded me of one case from the story of dark energy – what is believed to be causing the expansion of the universe to accelerate – when Saul Perlmutter and colleagues (Astrophys. J. 483 565) concluded that the mass density (ΩM) of a flat universe was about 1 (and the cosmological constant Ωλ ~ 0) based on their first seven supernovae. This result was, however, 2.3σ away from their later answer, although the inclusion of more supernovae data could have made a big systematic difference.
Some of these tales – such as the latter – were the result of statistical fluctuation. Others, however, were due to faulty analysis. The need to protect against that is, I think, the reason for the otherwise absurdly high confidence level. “The fact is, in high-energy-physics experiments, you sometimes find substantial systematic errors,” says Grannis. “The big fear is: how do I know I have thought of all sources of error?”