Putting the p-value in Context

Neuroscience researchers typically report p-values to express the strength of statistical evidence; but p-values are not sufficient on their own to understand the meaning and value of a scientific inference. In this unit, learners will learn how to interpret the p-value, how to express the size of an effect and uncertainty about a result, and how to interpret results at both the individual and population levels.

4- Not so fast: visualize the measured data, always.

In our previous analysis, we may have found an interesting result: spindle activity post-treatment, but not during treatment, differs from baseline.

Scientifically, we might conlcude that our treatment has a long-lasting effect, impacting spindle activity post-treatment.

To infer these results, we computed and compared p-values, testing specific null hypotheses for each subject.

We’ve hinted above that something isn’t right … let’s now dive in and identify what we could have done differently.

Our initial approach has focused exclusively on p-values.

P-values indicate how much evidence we have to reject a null hypothesis given the data we observe.

Let’s again plot the p-values during treatment and post-treatment for each subject:

For each subject, compare the p-values during treatment (red) versus post-treatment (blue). What do you observe?

We’ve focused on p-values to draw our scientific conclusions.

However, we’ve almost completely ignored the spindle measurements themselves!

Let’s return to the spindle activity measurements themselves, and see how these measurements relate to the p-values.

Consider Subject 6. We find during treatment (p = 0.033), and post-treatment (p = 0.0021). The p-value is much smaller post-treatment. How do you think the spindle activity differs during treatment versus post-treatment?

Consider the p-values computed for all subjects. How do you expect spindle activity to behave during treatment and post-treatment?

Now, let’s return to the spindle activity and look at those values directly.

Let’s begin with an example from Subject 6.

For Subject 6, we found:

treatment p=0.033
post-treatment p=0.0021

From these p-values, we might expect:

Spindle activity during treatment near 0 (i.e., similar to baseline).
Spindle activity post-treatment far from 0 (i.e., different from baseline).

But, we find the opposite.

Spindle activity during treatment far from 0 (i.e., different from baseline).
Spindle activity post-treatment near 0 (i.e., similar to baseline).

Let’s make similar plots for all 20 subjects.

Looking at the plots of spindle measurements, do you observe an effect during treatment (red) compared to baseline (black)?

Looking at the plots of spindle measurements, do you observe an effect post-treatment (blue) compared to baseline (black)?

Looking at the plots of spindle measurements, are these plots consistent with your p-value results?

It’s nice to visualize all of the data, but doing so can also be overwhelming.

Let’s summarize the spindle activity in for each subject by ploting the mean and the standard error of the mean.

Looking at the summary plots of the spindle activity for each subject, do you observe an effect during treatment (red)?

Looking at the summary plots of the spindle activity for each subject, do you observe an effect post-treatment (blue)?

Looking at the summary plots of the spindle activity for each subject, are these plots consistent with your p-value results?

Let’s summarize what we’ve found so far:

State	p-values	spindle activity
During treatment	p>0.05/20 (not significant)	mean spindle activity > 0
Post-treatment	p<<0.05/20 (signficiant)	mean spindle activity \(\approx\) 0.

Something’s not adding up here …

During treatment, we find no evidence of a signficant change in spindle activity from baseline (i.e., the p-values are big). However, looking at the mean spindle activity, we find spindle activities that often exceed 0.
Post-treatment, we find evidence of a signficant change in spindle activity from baseline (i.e., the p-values are small) in each subject. However, looking at the mean spindle activity, we find those values tend to appear near 0.

So, why do the spindle activities during treatment often exceed 0 (i.e., exceed baseline) spindle activity, but p>0.05?

And, why are the post-treatment spindle activities so near 0 (i.e., so near the baseline) spindle activity, but p<<0.05?

I’m confused!

Alert:

These confusing conclusions occur because we’ve made two common errors:

We compared p-values between the treatment and post-treatment groups.
We focused exclusively on p-values without thinking more carefully about the data used to compute those p-values.

Conclusion / Summary / Morale:

We began with the scientific statement:

“I expect during treatment that spindle activity exceeds the baseline spindle activity.”

Our initial approach focused on computing and comparing p-values.

That’s a bad idea.

We’re not interested in comparing the evidence we have for each null-hypothesis (the p-value); the evidence depends on the sample size, effect size, and measurement variability.

Instead, we’re more interested in comparing the spindle activity between condidtions.

In other words, we’re intested in the effect size, not the p-value.

This observation suggests a different analysis path for an improved approach.

We can answer the same scienfitic question by comparing the spindle activities between conditions, not the p-values.

We’ve started to see this in the plots of spindle activity at baseline, during treatment, and post-treatment.

For more analysis (e.g., different statistical test and effect size) continue on to the next sections.

5- So, what went wrong?

In our initial analysis, we’ve made a couple of common mistakes.

Mistake #1: Confusing p-values with effect size.

A p-value indicates the amount of statistical evidence, not the effect size. In our analysis, we found small p-values when comparing the spindle rates at baseline versus post-treatment. However, the effect size was very small; the large number of observations increased our statistical evidence, allowing us to detect a small effect size.

Remedy #1: Estimate what you care about.

We’re interested in the effect during treatment; so, examine the spindle rate during treatment. Plotting the mean spindle rate, and standard error of the mean, during treatment allows us to directly estimate the effect of interest.

Mistatke #2: Comparing p-values instead of data.

We found much smaller p-values post-treatment compared to during treatment (both computed versus the same baseline). Therefore, the effect is “stronger” post-treatment, right? WRONG! The p-values indicate that we have more statistical evidence of a difference post-treatment, but not that the effect size is strong.

Remedy #2: Compare the effect sizes.

Comparing the effect sizes, we find larger means spindle rates during treatment compared to baseline (and post-treatment)

Mistake #3: Separate statistical tests for each subject.

Our scientific question was, initially, focused on the impact of treatment on spindle rate. We’re not necessarily interested in this for each individual subject [URI HELP!]

Remedy #3: Use an omnibus test.

See the next section to learn more.

6- One test to rule them all: an omnibus test.

(PENDING)

Do there exist subjects for which there is a significant effect? NO

Concatenate data from all subjects (works if you beleive everyone has an effect).
- show CI, provide associated p-values 1a. Treatment (significant & meaningful @ population level) 1b. Post ( significant & not meaningful @ population level)
If not everyone has an effect, dilute effect size / less power.
Alternartive, if you believe not everyone has an effect, mixed-effect model.

7 - Optional Section: LME

(PENDING)

Estimate effect size and responders

In Intro: initial H is some people respond and some don’t

8- Summary

(PENDING)

Putting the p-value in Context

1 - You want to do some science; your PI just wants the p’s!

Introduction

2- Let’s do it: Define & compute p-values.

What does a p-value mean?

What does p<0.05 mean?

Is there a significant effect treatment? Let’s now compute some p-values.

Do we have evidence to reject the null hypothesis?

Mini Summary & Review

3- Maybe there’s something else we can publish?

Our results seem to reveal a new conclusion:

But are we sure?

So, are you sure about the `post-treatment` results?

Mini Summary & Review

4- Not so fast: visualize the measured data, always.

To resolve these confusing conclusions, let’s think more carefully about what the p-value represents.

Conclusion / Summary / Morale:

5- So, what went wrong?

6- One test to rule them all: an omnibus test.

7 - Optional Section: LME

8- Summary

1 - You want to do some science; your PI just wants the p’s!

Introduction

2- Let’s do it: Define & compute p-values.

What does a p-value mean?

What does p<0.05 mean?

Is there a significant effect treatment? Let’s now compute some p-values.

Do we have evidence to reject the null hypothesis?

Mini Summary & Review

3- Maybe there’s something else we can publish?

Our results seem to reveal a new conclusion:

But are we sure?

So, are you sure about the post-treatment results?

Mini Summary & Review

4- Not so fast: visualize the measured data, always.

To resolve these confusing conclusions, let’s think more carefully about what the p-value represents.

Conclusion / Summary / Morale:

5- So, what went wrong?

6- One test to rule them all: an omnibus test.

7 - Optional Section: LME

8- Summary

So, are you sure about the `post-treatment` results?