Reproducible exploratory analysis - Mitigating multiplicity when mining data

Identifying meaningful patterns and relationships within noisy data is a fundamental component of neuroscience research; however, multiplicity—the practice of conducting multiple simultaneous comparisons—can result in spurious and misleading conclusions. By contrast, overly strict corrections for multiplicity can cause us to miss meaningful scientific conclusions. In this unit, learners will understand how multiplicity occurs, its impacts, and strategies to address it.

0. Why mitigate multiplicity?

In science, it’s common to test data in multiple ways. We might use our data to test multiple theories or we might test multiple features of data collected from a novel device.

After doing so, we often want to make confident statements about these test results. It’s then important to deal with the challenges of multiplicity. Multiplicity occurs whenever we have multiple opportunities to find a result. The more tests we perform, the more likely we are to obtain at least one apparently meaningful result purely by chance - even when no real effect exists.

For example:

A researcher tests whether a treatment affects 20 different health outcomes.
A geneticist tests thousands of genes to determine which are associated with a disease.
A researcher examines several subgroups, outcomes, and time periods, then reports the most promising result.

In each case, performing more tests creates more opportunities for chance findings.

After performing these tests, we often want to make confident statements about the results. It is therefore important to address the challenges created by multiplicity.

There are at least two common mistakes when confronting multiplicity:

Too lenient. If we do nothing about multiplicity in our results, then we might mistake spurious results occurring by chance for meaningful scientific findings.
Too strict. If we too aggressively address multiplicity, then - in our quest to avoid false positive results - we might overlook real and meaningful scientific conclusions.

Unfortunately, no one-size-fits-all solution exists to address multiplicity. The appropriate approach depends on the scientific question, the number and structure of the tests, and the consequences of false-positive and false-negative conclusions.

Our goal in this unit is to practice making decisions about multiplicity and to explore the consequences of being too lenient or too strict.

5. Rat brains either predict many things or no things?

We’ve applied two strategies to confront multiplicity in our analysis:

The two strategies give completely different results. When we do nothing (Part 3), we conclude that neurons in the rat brain spike in association with many observed signals, including rodent behavior and stock prices. However, when we correct for multiplicity using the Bonferroni correction (Part 4), we identify no associations between signals and spikes. Neither result makes sense when compared to the existing literature.

So, now what?

Again, many strategies exist to correct for multiplicity. There’s no single “right” approach.

Our first approach (do nothing) is too lenient - we allow too many false positives. And, in doing so, we draw a ridiculous scientific conclusion: action potentials in the rat brain predict prices on the stock market.
Our second approach (Bonferroni correction) is too strict - we allow too few false positives. And, in doing so, we find no evidence for a well-established scientific conclusion: action potentials in the rat brain encode the rodent’s position.

Perhaps we can find an intermediate approach, that allows neither too many nor too few false positives …

In what follows, we’ll implement alternatives to our choices above, and see how these choices to address multiplicity impact our results.

8. Conclusions

Exploratory analysis allows us to discover patterns in noisy data. But, when we test many possible relationships, we create a rigor problem: some results will appear significant by chance. In our example dataset, 20,000 tests at \(\alpha = 0.05\) would produce approximately 1000 false positives if all of the null hypotheses were true.

We applied four approaches to address the problem of multiplicity in these data.

Do nothing: This approach was too lenient. We found apparent relationships between rat brain activity and stock prices, but most of the 1104 results with p < 0.05 could occur by chance.
Apply the Bonferroni correction: This approach controlled the probability of making any false-positive claim, but it was too strict for these data. We found no significant associations, including no evidence for the well-established relationship between neural activity and rodent position.
Control the False Discovery Rate: The Benjamini-Hochberg procedure accepts a controlled proportion of false discoveries in exchange for greater power. It recovered associations between neural activity and rodent position and none with stock prices.
Split the data: The first half of the data generated candidate relationships, and the second half tested them. This separation of exploration from validation left three significant associations, all involving measures of rodent position.

The rat-brain stock-market study was intentionally tongue-in-cheek satire. Its ridiculous conclusion nevertheless illustrates a real threat to rigor: if we search enough unrelated signals and report only the most interesting results, chance associations can tell a compelling story. A rigorous report should make the exploratory nature of the analysis clear, describe how many relationships were tested, report nonsignificant as well as significant results, and explain how multiplicity was addressed.

The early rodent-position study sits in a different scientific context. From a modern perspective, selecting 8 neurons from 76 and reporting a handful of relationships raises reasonable questions: How were those neurons and behaviors selected? How many relationships were examined? Yet that exploratory work contributed to a finding that was supported by converging evidence and eventually recognized with a Nobel Prize. The lesson is not that exploratory analysis is unscientific. Exploratory results can generate transformative hypotheses, but confidence grows when those results are transparently reported, replicated, and supported by independent evidence.

Correcting for multiplicity is therefore one part of rigorous scientific inference, not a complete solution. Other strategies developed across these units also matter: choose an adequate sample size, interpret p-values in context, and repeatedly check and refine the inference model.

Reproducible exploratory analysis - Mitigating multiplicity when mining data

0. Why mitigate multiplicity?

1. Rat brains, position, and profit: an introduction to mitigating multiplicity

The first suspicious scenario

A second suspicious scenario

So, what do we believe?

2. Does this really work? Analyze some data

3. Profit! Rat brains for high-frequency trading on the NYSE

4. Not so fast … finding meaningful relationships after you’ve tested everything.

Intuition building aside

Now, back to the main story - the Bonferroni correction

5. Rat brains either predict many things or no things?

6. FDR: a less conservative approach to multiplicity.

To build some intuition for these steps, let’s start with a simple example.

Example - Step 1. Assign each p-value a rank.

Example - Step 2. Calculate the critical value for each rank.

Example - Step 3. Compare each p-value to its corresponding critical value.

Now, back to the main story.

7. Split the Data

8. Conclusions