Sample Size - How much data is enough for your experiment?

1 - Just Google it?

Based on the groundbreaking research previously conducted in your lab, you and your collaborators have formulated a compelling scientific hypothesis: substance \(x\) could be a genetic biomarker for longevity, potentially influencing the age at which individuals pass away. This intriguing hypothesis opens up a new frontier in our understanding of genetics and lifespan, promising significant advancements in the field.

Before we can embark on an experimental journey to test the predictive power of this novel biomarker, we must first tackle a critical step: determining the appropriate sample size for a follow-up research study. The sample size is not just a number; it is a cornerstone of experimental design that ensures our data will be robust enough to support or refute our hypothesis.

To accurately compute this sample size, we need to consider our prior beliefs and existing knowledge about substance \(x\) and its relationship to longevity. Let’s delve into the specifics. Imagine we have the following limited yet crucial pieces of information:

Distribution of Substance \(x\): The expression levels of substance \(x\) in people follow a normal distribution.
Impact on Longevity: Individuals at the high end of the expression spectrum tend to live approximately 5 years longer than those at the low end.

Given these insights, our task is to calculate a sample size that can yield statistically significant results. This endeavor will not only help us test our hypothesis with precision but also pave the way for future research that could revolutionize our understanding of genetic influences on lifespan. We will see that the sample size required to generate data that can support a scientific hypothesis depends directly on the prior beliefs and knowledge about that hypothesis. Let’s proceed with this vital calculation, knowing that the outcomes will bring us one step closer to potentially groundbreaking discoveries in genetic biomarkers and longevity.

Given this information, from how many individuals do we need to collect data to have a reasonable chance of demonstrating this hypothesis is correct? (I.e., What is the sample size?)

Wait, I have no idea how to answer this?

Don’t worry!
The goal of this unit is to teach you to tackle this problem.
Let’s first come up with any approach to compute a sample size, even if we’re not confident in the results.

A few possible places to start:

Take an educated guess: Perhaps you have taken part in or read about similar research before. What order of magnitude seems right for this sort of experiment?
Find a source: Sample size estimation is a common topic in introductory statistics textbooks. These often include formulas that students can use to compute sample size for specific categories of questions.
Google it: There are many web-based resources (including online calculators) that are designed to enable sample size calculations. Search engines provide a starting point for finding such resource Doing so, you might end up at a website like this or like this.

Or, if you’d like to skip this step, we’ll suggest a sample size of 100.

It you estimated the sample size, what obstacles did you encounter along the way?

Sample size calculations aren’t always easy or obvious, even for veteran researchers!
Here’s a good video of the challenge.

Given the description of the scientific hypothesis and experiment, think about what data you would collect and what analyses you would perform to test the hypothesis.

What types of values do you expect for each variable? What are their distributions, do you think?
How do you expect the variables to be related?
Try drawing a sketch of what you imagine a successful result might look like?
(Text) For each participant, we will collect expression levels of substance \(x\) and age at death.
(Text) I expect age at death to increase with \(x\).
(Multiple Choice) Show different plots of \(x\) versus age at death, and ask learner to select the plot most consistent with the hypothesis.

We provided very little information and asked you to compute the sample size. What other information do you think would be helpful to estimate the sample size?

2- Underpowered experiments are doomed to failure.

Now that you’ve determined (or guessed) the sample size N for your experiment, let’s perform the experiment.

You collect N samples of data, so that you receive from each individual:

x - a measure of the proposed biomarker for longevity,
lifespan - the individual’s age at death.

Let’s start by defining our choice for N

With your chosen sample size N, you do the experiment and collect both x and lifespan for each subject.

Let’s start by plotting the data.

Do you observe the hypothesized relationship between the biomarker \(x\) and lifespan in these data?

The data look like a random cloud of points.
It’s very difficult to see the hypothesized relationship between \(x\) and lifespan.

Let’s assess the relationship between the biomarker x and lifespan beyond visual inspection.

There are many ways to do so.

Here, we’ll fit a line to the data and compute the slope.

If this code is new to you, don’t worry. Can you find the equation for the line in the code above?

The general equation for a line is y = b + mx where b is the intercept and m is the slope.
Here, we’re interested in the specific line lifespan = b + mx.
In the code above, we represent this equation with the notation lifespan ~ 1 + x. In this notation, we tell Python to estimate the outcome variable lifespan as a function of a constant (with label 1 in the code) and predictor x. Python then estimates the solution to linespan = b + mx by finding the best values for b (the intercept) and m (the slope).
In the code above, we estimate the slope m, which characterizes the relationship between lifespan and x.

Now, with the line estimated, we can print the estimated slope, and its p-value.

Let’s interpret these numbers:

Slope estimate

Meaning: The slope estimate represents the change in the lifespan for a one-unit change in the genetic biomarker x.

Interpretation: For every one-unit increase the genetic biomarker x, the lifespan is estimated to increase by the slope_estimate in years, on average.

Standard error of slope estimate

Meaning: The standard error measures the average amount that the slope estimate varies from the true slope of the population regression line. It indicates the precision of the slope estimate.

Interpretation: A standard error of \(\sigma\), for example, suggests that the slope estimate could vary by about \(2 sigma\) units from the true slope. In our case, the standard error is relatively large compared to the slope estimate. This implies there is a considerable amount of uncertainty in the estimate.

p-value

Meaning: The p-value is used to test the null hypothesis that the slope of the regression line is zero (no relationship between x and lifespan).

Interpretation: The p-value describes the probability of seeing an effect at least this large if substance x had no relation to lifespan. A p-value of 0.3, for example, is much larger than commonly used thresholds to significance levels (e.g., 0.05). In this case, this means that there is not enough evidence to reject the null hypothesis. In other words, the data do not provide sufficient evidence to conclude that there is a statistically significant relationship between x and lifespan.

There’s much more to say about p-values. If you’re curious, check out this link.

Let’s also visualize the estimated line by plotting it with the data.

Do you find a significant relationship between the genetic biomarker x and lifespan?

No. The results suggest that while there is a positive slope, indicating a potential relationship between the genetic biomarker x and lifespan, the high standard error and non-significant p-value imply that this relationship is not statistically significant. Further investigation with more data or additional variables may be needed to draw more definitive conclusions.

Wait, this doesn’t make sense!

We’ve applied a standard approach to compute sample size N and performed the experiment using this sample size.
We see a trend supporting the hypothesized relationship, but it’s not significant.
Why did we fail to detect a significant relationship?
What’s going on?

Choose your own adventure

Many alternatives exist to calculate sample size.

These include using our prior knowledge or, when available, using prior collected data.

Now, it’s your choice.

Choose a path to compute the sample size.

3A- I’ll resample the pilot data, Turn To Page 3A.

3B- I’ll use the pilot data to build a model, Turn To Page 3B.

3C- I’ll ignore the pilot data and use my prior knowledge, Turn To Page 3C.

3D- I’ll do nothing and stick with my current sample size choice, Turn to Page 3D.

3A- With resampling you can compute the sample size!

The data provided in Mini 2 represent one instantiation of the experiment, conducted with a sample size N.

Our analysis of these data did not yield evidence to support our hypothesis.

However, these data remain extremely useful for our continued investigation into sample size.

We can leverage these data to estimate the necessary sample size for a subsequent experiment.

We’ll implement a resampling procedure and systematically examine how variations in sample size N influence our capacity to detect a significant result.

Doing so will let us optimize our experimental design for future investigations.

Resampling procedure (Introduction)

We’re going to attempt something that seems far-fetched and magical:

we’ll generate new data from our existing data.

To do so, we’ll implement a nonparametric bootstrap and generate pseudodata from our observed data.

A note on the nonparametric bootstrap

Briefly, there is strong theoretical justification for the nonparametric bootstrap. The fundamental idea is that resampling the data with replacement is equivalent to sampling new pseudodata from the empirical cumulative distribution function (eCDF) of the observed data. For a large sample of independent, identically distributed random variables, the distribution of the pseudodata generated from the eCDF will be close to the true distribution of the data. Note the important caveat that the variables are independent, identically distributed; this assumption fails in many cases, such as for time series. Here, we assume that the genetic biomarker and lifespan from each subject are drawn independently from the same distribution (i.e., the values from a subject are independent, identically distributed variables).

Resampling procedure (4 steps)

Our resampling procedure consists of 4 steps:

Choose a new sample size (call it N_resampled).
Draw a new (random) set of N_resampled labels to index our data (biomarker \(x\) and lifespan).
Use these indices to create new pseudodata: a resampled data set.
Compute the relationship between the biomarker \(x\) and lifespan in our resampled data.

We’ll now describe each step. For a related example, see this video.

Resampling procedure: Step 1

Our first step is to choose a new sample size. Let’s call it N_resampled.

What will you choose for N_resampled?

Our original choice of sample size resulted in a positive slope estimate, indicating a potential relationship between the genetic biomarker x and lifespan. But, the high standard error and non-significant p-value imply that this relationship is not statistically significant. Further investigation with more data may draw more definitive conclusions. However, to start, let’s fix N_resampled = N, the original sample size.

We’ll start by fixing N_resampled = N, the original sample size.

In what follows, we’ll adjust this value and examine the impact.

Thus concludes Step 1 of our resampling procedure.

Resampling procedure: Step 2

Our second step is to draw a random set of N_resampled labels to index our data (biomarker x and lifespan).

To visualize this procedure, imagine we assign each patient in the original data set a number, from \(0\) up to N-1. We then write each number on a marble and place all N marbles in an opaque bag. Each marble is assigned a unique integer value from \(0\) to N-1.

Now, reach your hand into the bag, grab a marble, record its number, and replace the marble in the bag. We assume that each marble is equally likely to be selected at each draw (i.e., there are no special features that allow some marbles to be drawn more often). Repeat this procedure N_resampled times to create a list of N_resampled integers. Notice that after recording the drawn marble’s number, we replace it in the bag. So, we could potentially draw the same marble N_resampled times, although that’s extremely unlikely.

Performing this sampling with replacement procedure by hand would, of course, be extremely time consuming (e.g., who will paint integers on each marble?). Fortunately, this is the type of boring task where a computer excels.

Let’s have a look at this “marble draw”:

Look at the values in ind. What do they mean?

There are N_resampled values in the vector ind. That’s because we draw N_resampled marbles.
These are the indices to our original data set. You can think of these as numbers indicating participants in the study (e.g., Participant 1, Participant 10, Participating 102, …)

Run the code to generate ind again. What do you find? (i.e., is it the same or different than the first time?)

Because we draw random sets of indices, the values in ind will differ each time we run the code.

We can now generate a set of random indices to our data.

We’ll use these to create pseudodata in the next step.

Thus concludes Step 2 of our resampling procedure.

Resampling procedure: Step 3

Our third step is to use these indices to generate the resampled data.

To do so, we’ll draw data from the study participants using the indices in ind.

Again, this is the type of boring task that computers love:

How many values are in the resampled data x_sampled and lifespan_resampled?

There are N_resampled values in resampled data. That’s because we’re usig the vector ind to resample the data, and we drew N_resampled marbles.

Let’s see what those values look like, compared to our original data set.

Compare the plots of the original data (x and lifespan) with the pseudodata (x_resampled and lifespan_resampled). What do you observe? Do the pseudodata “look like” the original data?

The pseudodata overlaps the original data. That makes sense because we draw the pseudodata from the original data.
By chance, we draw some of the original data multiple times, and other original data not at all.

We can now generate pseudodata from our original data.

We’ll assess the relationship between these pseudodata in the next step.

Thus concludes Step 3 of our resampling procedure.

Resampling procedure: Step 4

Our fourth step is to compute the relationship (and its statistical significance) between the resampled biomarker \(x\) and resampled lifespan.

To do so, we’ll follow the exact same approach as above.

We’ll fit a line to new resampled data and again compute the slope and significance.

Are the slope estimate in the original data and resampled data similar or different? What about the standard errors in the estimates?

The slope estimates are similar (near 1).
The standard error estimates are similar (near 1).

So far, we’ve fixed N_resampled = N, the original sample size. Change N_resampled and repeat Modeling Steps 2,3,4 to generate results from multiple “experiments”. Do you ever find a significant result? How often do the p-values you find reach your desired level of statistical significance? How does this depend on the value N_resampled?

Yes, now we can sometimes find p<0.05 in the modeled data when N_resampled is large (e.g., 1000).

We’ve now marched through the entire resampling procedure.

As a final step , we’ll use this resampling approach to estimate the statsitcal power of our original experiment and a good sample size for increased power.

Thus concludes Step 4 of our resampling procedure.

Putting it all together

We’ll now use resampling to determine a good sample size for our experiment.

To do so, let’s first review the concept of statistical power.

In the context of statistical analysis, power and sample size are interrelated concepts.

Statistical power is the probability that a test will correctly reject a false null hypothesis (i.e., detect an effect if there is one). Higher power reduces the risk of a Type II error, where a real effect is missed (failing to reject a false null hypothesis).

Our initial challenge was to compute the sample size: the number of observations or data points included in a study. Using our initial choice N, we failed to detect a significant relationship between biomarker x and lifespan. This failure may occur because:

There is no relationship between biomarker x and lifespan.
There is a relationship between biomarker x and lifespan, but we did not collect enough data to detect it.

We’d like to rule out the scenario #2. To do so, we need to understand the statistical power of our original experiment.

Let’s now use our resampled data to compute the statistical power for our original sample size choice N.

We’ll do so in a few steps:

Resampling procedure to estimate the statistical power (6 steps)

Choose a sample size (call it N_resampled).
Draw a new (random) set of N_resampled labels to index our data.
Use these indices to create new pseudodata: a resampled data set.
Perform the statistical test of interest (i.e., compute a p-value).
Repeat Steps 1-5 K times, saving the p-value each time.
The statistical power is the proportion of p-values below a chosen threshold alpha.

That’s a lot of steps! Let’s break them down:

Resampling procedure: Steps 1-4

You’ve already done steps 1-4 when performing resampling to create pseudodata! Nothing new to see here.

Resampling procedure: Step 5

We’ve added Step 5, in which we create K new instances of the pseudodata. For each instance, we calculate and save the p-value corresponding to the statistical significance of the relationship between the biomarker \(x\) and lifespan in our resampled data.

At the end of Step 5, we’ll have created a vector of K p-values. Let’s do so now:

Let’s investigate this list of p-values by plotting a historgram:

What p-values do you observe?

P-values extend from near 0 to near 1.
P-values are slightly more concentrated near 0, but extend to cover the entire range.

Resampling procedure: Step 6

Our last step to compute the statistical power is to deteremine the proportion of p-values below a chosen threshold alpha.

The threshold alpha represents the threshold for rejecting the null hypothesis when it is actually true. It’s conventional to set

alpha = 0.05

which means that there is a 5% chance of committing a Type I error, which is the error of incorrectly rejecting a true null hypothesis. This value is not inherently magical or optimal in all circumstances. But, it has become a convention primarily because it offers a middle ground that has been deemed acceptable by the scientific community for controlling Type I errors.

To implement Step 6, let’s compute the statistical_power as the proportion of times that p_values is less than the threshold alpha.

Interpret the value in statistical_power. What does it mean?

This value represents the proportion of times we drew pseudodata and detected a significant relationship between the biomarker x and lifespan.

The value in statistical_power is the statistical power of our test. It represents the proportion of times we reject the null hypothesis and declare a significant relationship between the biomarker x and lifespan.

To make this graphically explicit, let’s replot the histogram of p-values with a line at our threshold alpha.

In this plot, the statistical power is the proportion of values to the left (i.e., smaller than) the red line.

And that’s it!

The statistical power is not a mystical quantity. It’s the probability that a test will correctly reject a false null hypothesis. And, using the data we collected, we can compute this statistical power for different choices of sample size (N_resample).

What value do you compute for the statistical power?

For our initial choice of sample size N, we find a statistical power less than 0.15.

Conclusion: our original study was severly underpowered.

Using the resampling procedure, we conclude that our initial sample size N results in a statistical power less than 0.15.

This means that, if the effect (a relationship between biomarker x and lifespan) truly exists, we have less than a 15% chance of detecting it.

At this sample size, our study is severly underpowered. The statistical power is well below the recommended threshold of 0.8.

There is high false negative risk. There is at least a 85% chance that our study will fail to detect a true effect (Type II error).

Conclusion

Our original study found no statistically significant result. However, the statistical power of our original study was low.

Therefore, we should not conclude that there is no effect - our test is likely too weak to detect one (if it exists).

Design a future experiment with enough statistical power.

We would like to perform a future experiment with enough statistical power.

In other words, we’d like our future experiment to protect against failing to detect a true effect (i.e., protect against Type II error).

It’s common to design experiemnts with a statistical power of 0.8. In that case, there’s a 20% chance of failing to detect a true effect ( i.e., of a Type II error); that’s a big improvement compared to our original experiement, which had at least an 85% chacne of failing to detect a true effect.

Our goal, therefore, is to design a future experiement with a large enough sample size N to have 80% power.

So, what is the smallest sample size N we can choose to have 80% power?

To answer this, let’s use our resampling procedure with different choices of N_resampled.

Run the code below with different values of N_resampled.

What is the smallest value of N_resampled the provides 80% power?

Why are we interested in the smallest sample size that provides 80% power?

Larger sample sizes can escalate the costs and logistical complexity of a study.
The choice of 0.8 is considered a good trade-off between increasing precision and controlling operational constraints.

At what value of N_resampled does the statistical power approximately equal 0.80?

At approximately N_resampled = 1000, the statistical power equals 0.80.

To answer the question above, and determine the N_resampled that gives approximately 80% power, you probably ran the same code over and over again, with different values of N_resampled.

That’s a fine approach.

In doing so, you may have gotten a sense of how the statistical power varies with N_resampled.

That leads to our next question:

What happens to the statistical power as `N_resampled` increases?

To answer this question, let’s compute the power at different choices of N_resampled and plot the results.

To do so, we’ll set K=100 so the code runs in your browser. The plot will be jagged, but you’ll get the idea.

We find that the statistical power increases with N_resampled.

Larger samples provide more information about the population, leading to more precise estimates of the population parameters. This precision reduces the standard error and widens the gap between the null hypothesis and the alternative hypothesis if there is a true effect, making it easier to detect significant differences. Therefore, increasing the sample size typically increases the power of a statistical test.

Aside: Why do we choose statistical power 0.8?

Choosing a statistical power of 0.8, or 80%, is a common convention in many fields of research, particularly in the social and biomedical sciences.

Statistical power is the probability of correctly rejecting a false null hypothesis, thus avoiding a Type II error. A power of 0.8 means there is a 20% chance of a Type II error (failing to detect a true effect). Setting the power at 0.8 provides a reasonable balance between the risks of Type I errors (false positives) and Type II errors (false negatives). Researchers often choose a 5% (alpha=0.05) significance level for Type I errors, aiming to maintain a pragmatic yet cautious approach to declaring findings.

Increasing power beyond 0.8 generally requires larger sample sizes, which can escalate the costs and logistical complexity of a study. The choice of 0.8 is considered a good trade-off between increasing precision and controlling operational constraints.

The 0.8 level has become somewhat of a standard through historical precedent and its endorsement in statistical texts and guidelines. Researchers often follow these conventions to align with accepted practices, making their studies comparable to others in the field.

A loophole in the statistical power: increase `alpha`?

Increasing the sample size is one way to increase the statistical power.

However, other approaches exist.

For example, what if we increase alpha, the Type I error rate?

How do we interpret an increase in alpha from 0.05 (the standard value) to 0.1?

Increasing the significance level alpha from 0.05 to 0.1 means we’re more willing to reject the null hypothesis. We’ll accept a higher risk of rejecting the null hypothesis when it’s actually true (i.e., a higher Type I error).
At this looser threshold for significance, it’s easier to declare a result significant. But, the risk of false positives increases - more results will appear significant, but more could be false.

To study the impact of increasing alpha on the statistical power, let’s compute it.

We’ll do so using a sample size N_resampled=100. (Wait, is that a good choice? Let’s see …)

We’ll again set K=100 so the code runs in your browser. The plot will be jagged, but you’ll get the idea.

We find that, at the low sample size N_resampled=100, we can still acheive 80% power if we accept alpha \(\approx\) 0.75.

Did we find a loophole? Can we get the statistical power we want (0.8) at a low sample size?

No, we did not find a loophole.

What we’ve discovered is a fundamental tradeoff in hypothesis testing: you can increase statistical power by increasing alpha, but at a cost.

Statistical power depends on sample size and the signficance threshold (alpha).

By setting alpha=0.75, we’re saying:

I’m willing to reject the null hypothesis 75% of the time even when it is true.

This results in:

Very high false positive rate (Type I error = 75%)
Artificially inflated power (since it’s now very easy to reject the null)
A test that’s statistically meaningless by conventional standards

It’s like lowering the bar so much that everyone passes the test. Sure, your “pass rate” (power) goes up—but the test no longer distinguishes between those who know the material and those who don’t.

So, did we find a loophole?

No! We can always increase power by increasing alpha, but that destroys the credibility of our findings. At alpha=0.75 there is a very high probability of being a false positive.

Resampling is a universal way to calculate power and sample size, but only works if the resampled data captures the effect of interest.

When preliminary data exist and these data accurately represent the effect of interest, then resampling is a powerful and universal appraoch to calculate power and sample size.

In this case, our preliminary data (see Mini 2) do capture the expected effect of interest - a weak positive relationship between biomarker x and longevity, although when the sample size N is too small, we fail to detect a significant relationship between these features.

However, you might imagine an alternative scenario. What if our preliminary data revealed a negative relationship between biomarker x and longevity? That could happen, for example, if the observations were noisy or the initial sample size N was small. In that case, the resampled data will not capture the effect of interest; resampling these data to calculate power and sample size is a poor choice. Instead, you might pursue alterantive appraoches, including such as Building models without the data.

Warning:

In this example, we were lucky that the initial draw of a small sample size produced the expected effect. An unlucky sample may have produced (by chance) an opposite effect. In that case, resampling will not produce meaningful power/sample size results. Preliminary data is often important for future experimental design, but it’s important to consider how variability in a small, preliminary dataset can influence power and sample size estimates.

Turn to Page 4 Summary

3B- Build models with the data to compute the sample size!

The data provided in Mini 2 represent one instantiation of the experiment, conducted with a sample size N. While our analysis of these data did not yield evidence to support our hypothesis, they remain extremely useful for our continued investigation into sample size. Specifically, we can leverage these data to estimate the necessary sample size for a subsequent experiment. By estimating models of these data, we will systematically examine how variations in sample size N influence our capacity to detect a significant result, thus optimizing our experimental design for future investigations.

Modeling procedure (Introduction)

We’re going to attempt something that seems far-fetched and magical: we’ll generate new data from our existing data. To do so, we’ll estimate models from the existing data, and use those models to simulate new synthetic data.

Modeling procedure (4 steps)

Our modeling procedure consists of 4 steps:

Estimate a model for biomarker x.
Estimate a model of the relationship between biomarker x and lifespan.
Choose a new sample size (call it N_modeled) and simulate data from the models.
Compute the relationship (and its statistical significance) between the simulated biomarker \(x\) and lifespan.

We’ll now describe each step.

Modeling procedure: Step 1

Our first step is to estimate a model for biomarker x.

To do so, let’s begin by visualizing the biomarker x in a historgram:

Describe the histogram. What does it look like?

The histogram has most values near 0, with fewer values near +/- 2.
It’s approximately bell-shaped.
It’s looks “normal” or “Gaussian”.

We conclude that the values for biomarker x look approximately normally distributed.

That’s very useful, because it suggests we can model the data as normally distributed.

Normal distributions are nice because we only need to estimate two parameters: the mean and standard deviation.

Let’s compute those values in Python:

Interpert the values of mean_x and std_x. How do they compare to the histogram of biomarker x plotted above?

The mean of the distribution for biomarker x is very close to zero. This suggests that on average, the values in the biomarker center around zero, consistent with our visualization of the data in the histogram.
The standard deviation measures the dispersion or spread of the data points around the mean. A standard deviation of 0.71 indicates that the typical deviation from the mean biomarker values is about 0.71 units. For a normal distribution, we expect about 68% of the data to fall within one standard deviation of the mean (i.e., between -0.68 and 0.74), and about 95% of the data to fall within two standard deviations (i.e., between -1.39 and 1.45). Those ranges appear are consistent with our visualization of the data in the histogram.

Now, with these two parameters estimated, we’ve completely specificed our model of biomarker x.

In words, we’ll model x as normally distributed with mean mean_x = 0.03 and standard deviation std_x = 0.71.

In Python, it’s easy to simulate values from this model.

Let’s do so, and simulate 100 values from the model.

Compared x_modeled to the original values x. Do the values appear consistent?

Yes. In x_modeled there are many values near 0, and few values near +/- 2.

Let’s also plot histograms for the original data x and the modeled data x_modeled.

Compared the histograms of x_modeled and the original values x. Do the histograms appear consistent?

Yes. In x_modeled there are many values near 0, and few values near +/- 2.

Thus concludes Step 1 of our modeling procedure.

We’ve modeled the biomarker x as normally distributed, with mean and standard deviation estimated from our observed data.

Modeling procedure: Step 2

Our second step is to estimate a model of the lifespan.

Thankfully, we’ve already completed this step!

In Mini 2, we modeled the relationship between biomarker \(x\) and lifespan as a line.

Here’s that code again:

As we discussed in Mini 2, the model consists of two parameters: the slope and intercept.

Let’s define and print those:

Do those values make sense?

Let’s check by plotting the data with our estimated line:

Examine the plot above. Do the estimated values for slope and intercept make sense?

The intercept is the value of lifespan when x=0. Looking at the plot, this occurs near 73 years, consistent with the value of intercept.
The line has a small upward tilt. This indicates the lifespan increases a tiny bit for each unit increase in biomarker \(x\), consistent with the slope of 0.91.

To simulate the model of lifespan, we’ll need to extract one more parameter from the estimated model: the dispersion.

The dispersion parameter describes the amount of uncertainty in our ability to predinct each data point. In this case, it is the residual standard deviation of the lifespan after we have tried to predict it using the expression level of substance \(x\).

Let’s get the dispersion from the estimated model:

Programming aside

We calculate the dispersion parameter using the np.sqrt(model.scale) formula from the fitted Ordinary Least Squares (OLS) model. In the context of an OLS model, the model.scale attribute reflects the variance of the residuals (errors), and taking the square root of this variance gives you the standard deviation.

In general the dispersion (or standard deviation) tells us how much the residuals (differences between observed and predicted values) are spread out around the mean of the residuals. A higher value indicates a greater spread, suggesting more variability in the errors.

In this case, a dispersion value of 7.22 means that, on average, the actual lifespan values deviate from the values predicted by our model by about 7.22 units. This indicates an average error magnitude of approximately 7 years from the predicted lifespan based on your model.

So, the dispersion parameter indicates how well my model fits the data. Don’t I want the dispersion to be 0?

In general, we do not expect the dispersion parameter to be 0. Our model represents a simplification of the data: we’re using a simple line to capture the relationship between biomarker \(x\) and lifespan. We do not expect this line to capture every nuance of the relationship; these are complicated biological entities with complex relationships. So, we’re happy with a non-zero dispersion.

Thus concludes Step 2 of our modeling procedure.

With the 3 estimated model parameters, we can now simulate values of lifespan from the model. We’ll do so in the next step.

Modeling procedure: Step 3

Our next step is to choose a new sample size (call it N_modeled) and simulate data from our models.

With the our models estiamted from the original data, we can now simulate realizations of models.

To do so, we’ll evaluate these model:

x_modeled = np.random.normal(loc=mean_x, scale=std_x, size=N_modeled)

lifespan_modeled = intercept + slope * x_modeled + np.random.normal(loc=0.0, scale=dispersion, size=N_modeled)

Describe - in words - each term in the equations above.

What variables do you recognize?
What variables are now?
What is the equation doing?

The first equation should look familiar … it’s our model of biomarker x, described in Step 2.

The second equation might also look familar … it’s the linear model we originally estimated data, with an added random noise component to simulate data.

Here’s a breakdown of each term in the second equation:

lifespan_modeled: This is the dependent variable in the equation, representing the predicted values of lifespan based on the model. This variable is being assigned the values calculated by the formula on the right-hand side.
intercept: This is the intercept of the linear model. It represents the value of the dependent variable (lifespan_modeled) when the independent variable (x) is zero.
slope: This term is the coefficient of the independent variable x in the linear model. It measures the amount by which lifespan_modeled is expected to increase for a one-unit increase in x. It represents the steepness or incline of the regression line.
x: This is the independent variable or predictor in the model. Here we use biomarker ‘x’ to predict lifespan_modeled. We assume the relationship between x and lifespan_modeled is linear in this model.
np.random.normal(loc=0.0, scale=dispersion, size=[N_modeled,1]): This
function generates random noise added to the linear model, simulating variability in the data that is not explained by the independent variable x alone:
- loc=0.0: This specifies the mean of the normal distribution from which the random noise is drawn. A mean of 0 indicates that the noise is centered around zero, adding no systematic bias to the predictions, just variability.
- scale=dispersion: This is the standard deviation of the normal distribution. It controls the variability of the noise added to the model. The term dispersion here is whate we calculated above, representing the standard deviation of the residuals in the linear model.
- size=[N_modeled,1]: This specifies the shape of the array of random values generated. N_modeled is the number of observations or samples for which you are modeling lifespan_modeled. The [N_modeled,1] format makes the output an N_modeled-by-1 array, where each element is a random noise value added to the corresponding model prediction.

With the estimated models in hand, it’s now simple to simulate data from the models.

To do so, we must choose a value for N_modeled.

Let’s start by setting N_modeled = N, our original sample size, and then simulate the models. To do so,

How many values are in the simulated data x_sampled and lifespan_resampled?

There are N_resampled values in resampled data. That’s because we’re usig the vector ind to resample the data, and we drew N_resampled marbles.

Let’s see what those modeled values look like, compared to our original data set.

Compare the plots of the original data (x and lifespan) with the modeled data (x_modeled and lifespan_modeled). What do you observe? Do the modeled data “look like” the original data?

The modeled data overlap the original data. The two sets of data have appoximately the same range of biomarker values (from -2 to 2) and lifespan values (from 55 to 90 years).

Thus concludes Step 3 of our modeling procedure.

We can now simulate data from our models. We’ll assess the relationship between these simulated values in the next step.

Modeling procedure: Step 4

Our fourth step is to compute the relationship (and its statistical significance) between the modeled biomarker \(x\) and modeled lifespan.

To do so, we’ll follow the same approach as above. We’ll fit the same line to new modeled data, and again compute the slope and significance.

Are the slope estimate in the original data and modeled data similar or different? What about the standard errors in the estimates?

The slope estimates are similar (near 1).
The standard error estimates are similar (also near 1).

So far, we’ve fixed N_modeled = N, the original sample size. Change N_modeled and repeat Modeling Steps 2,3,4 to generate results from multiple “experiments”. Do you ever find a significant result? How often do the p-values you find reach your desired level of statistical significance? How does this depend on the value N_modeled?

Yes, now we can sometimes find p<0.05 in the modeled data when N_modeled is large (e.g., 1000).

Thus concludes Step 4 of our modeling procedure.

We’ve now marched through the entire modeling procedure.

As a final step of this procedure, we’ll use this modeling approach to estimate the statsitcal power of our original experiment and a good sample size for increased power.

Now, let’s use this modeling approach to determine a good sample size for our experiment.

To do so, we’ll first introduce the concept of statistical power.

In the context of statistical analysis, power and sample size are closely interrelated concepts.

Statistical Power is the probability that a test will correctly reject a false null hypothesis (i.e., detect an effect if there is one). Higher power reduces the risk of a Type II error, where a real effect is missed (failing to reject a false null hypothesis).

Our initial challenge was to compute the sample size: the number of observations or data points included in a study. Our initial choice N was too small; with this choice, we did not detect a significant relationship between the biomarker x and lifespan, i.e., we did not have enough statistical power.

Using our modeling approach, we can generate simulated data with an increased sample size N_modeled. Doing so in the exercise above, you might have found (sometimes) a significant relationship between the siualted biomarker x and simulated lifespan; you might have (sometimes) found p<0.05, the arbitrary magical threshold often used to declare a significant effect. If you’d like to understand this magic, check out [LINKS TO OTHER METERS].

We can use this same modeling procedure to compute the statistical power of our test given the sample size. We’ll do so in a few steps:

Modeling procedure to estimate the statistical power (6 steps)

Estimate a model for biomarker x.
Estimate a model of the relationship between biomarker x and lifespan.
Choose a new sample size (call it N_modeled) and simulate data from the models.
Compute the relationship (and its statistical significance) between the simulated biomarker \(x\) and lifespan.
Repeat Steps 1-4 K times, saving the p-value each time.
The statistical power is the proportion of p-values below a chosen threshold alpha.

That’s a lot of steps! Let’s break them down:

Modeling procedure: Steps 1-4

You’ve already done Steps 1-4 when creating the models to generate simulated data. Nothing new to see here.

Modeling procedure: Step 5

We’ve added Step 5, in which we create K new instances of the modeled data. For each instance, we calculate and save the p-value corresponding to the statistical significance of the relationship between the biomarker \(x\) and lifespan in our modeled data.

At the end of Step 5, we’ll have created a vector of K p-values. Let’s do so now, with N_modeled=N.

Let’s investigate this list of p-values by plotting a historgram:

What p-values do you observe?

P-values extend from near 0 to near 1.
P-values are slightly more concentrated near 0, but extend to cover the entire range.

Modeling procedure: Step 6

Our last step to compute the statistical power is the proportion of p-values below a chosen threshold alpha.

The threshold alpha represents the threshold for rejecting the null hypothesis when it is actually true. It’s conventional to set

alpha = 0.05

To implement Step 6, let’s compute the statistical_power as the proportion of times that p_values is less than the threshold alpha.

Question : Interpret the value in statistical_power. What does it mean?

This value represents the proportion of times we created simualted data and detected a significant relationship between the biomarker x and lifespan.

To make this graphically explicit, let’s replot the histogram of p-values with a line at our threshold alpha.

In this plot, the statistical power is the proportion of values to the left (i.e., smaller than) the red line.

And that’s it!

The statistical power is not a mystical quantity. It’s the probability that a test will correctly reject a false null hypothesis.

And, using the data we collected, we can compute how the statistical power depends on sample size by changing the value of (N_modeled).

To that end, let’s collect all the code, and perform one more experiment:

We’ve used our original sample size by setting N_modeled = N in the code above. What is the statistical power? Does this make sense with our original conclusion in Mini 2?

Using our original sample size (N=100), the statistical power is small, near 0.15.
Therefore, with this sample size, we do not expect enough power to detect a significant effect.
This is consistent with the results in Mini 2, in which we failed to detect a significant effect.

Now, use this code to determine the value at which N_modeled produces statistical power equal to 0.80?

At approximately N_modeled = 1000, the statistical power equals 0.80.

Why do we choose statistical power 0.8?

Choosing a statistical power of 0.8, or 80%, is a common convention in many fields of research, particularly in the social and biomedical sciences.

Modeling preliminary data is a universal way to calculate power and sample size, but only works if the modeled data captures the effect of interest.

When preliminary data exist and these data accurately represent the effect of interest, then modeling these data is a powerful and universal appraoch to calculate power and sample size.

However, you might imagine an alternative scenario. What if our preliminary data revealed a negative relationship between biomarker x and longevity? That could happen, for example, if the observations were noisy or the initial sample size N was small. In that case, the modeled data will not capture the effect of interest; modeling these data to calculate power and sample size is a poor choice. Instead, you might pursue alterantive appraoches, including such as Building models without the data.

Warning:

In this example, we were lucky that the initial draw of a small sample size produced the expected effect. An unlucky sample may have produced (by chance) an opposite effect. In that case, modeling these data will not produce meaningful power/sample size results. Preliminary data is often important for future experimental design, but it’s important to consider how variability in a small, preliminary dataset can influence power and sample size estimates.

Turn to Page 4 Summary.

3C- Build models without the data to compute the sample size!

The previous analysis created surrogate data by resampling from a pilot study.
This assumed that pilot data was available and that this pilot data reflects the real effect and variability we expect to see in a full experiment.
If pilot data is not available or doesn’t reflect what we expect, another way to generate surrogate data is to simulate it directly

In the alternative paths, we tried to use pilot data from Mini 2 to learn enough about the properties of our experiment to compute the power of our proposed analysis for different sample sizes. For this alternate approach, we will ignore everything from the pilot data. We will still generate surrogate data and compute statistical power for experiments with different sample sizes, but instead of using pilot data, we will use our prior knowledge and beliefs about the experiment to generate that surrogate data.

Let’s begin by trying to quantify our beliefs about the data:

What would you guess is the mean lifespan (in years) for an individual from this population?

Note:

While you chose a specific guess for lifespan here, you will later investigate how changing this value affects the statistical power.

Next we need a guess for the standard deviation for the lifespan of an individual in this population.

You might be able to guess based on your knowledge of human lifespans and standard deviations.

For example, you could guess a range of lifespans for 95% of humans and divide the difference between the top and bottom of that age range by 4. Thus reflects the idea that human age ranges follow an approximate normal distribution and that 95% of the data from a normal distribution falls between +/- 2 standard deviations of the mean.

Note:

While you chose a specific guess for the standard deviation here, you will later investigate how changing this value affects the statistical power.

Finally, we need to decide what effect we expect an increase in substance X to have on lifespan in our experiment.

Before we started the experiment we stated a scientific hypothesis that people in the highest range of values for substance X tend to live an average of 5 years longer than people in the lowest range.

The difference between the effect we expect when the null hypothesis is true and when our scientific hypothesis is true is called the effect size.

We can translate our statement of the scientific hypothesis in words to a mathematical statement of our hypothesis in terms of the analysis we plan to conduct in our experiment.

Specifically, we plan to compute a linear regression between the expression level of substance X and a person’s lifespan.

If we know that substance X is normally distributed in the population and we say that a person in the lowest range of values is in the 5th percentile and a person in the highest range is in the 95th percentile, then we would want the expected lifespan to increase 5 years between a z-scored value for substance X of -2 to +2.

So under our scientific hypothesis, the slope would be 5/4 years increase per unit increase in the z score of substance X.

We can generate surrogate data from the model representing our scientific hypothesis and compute the probability that our analysis will correctly reject the null hypothesis.

Let’s start by guessing a sample size for our analysis:

Now we’ll generate 1000 datasets, each of size TrySampleSize, which have the data properties we defined above.

Most importantly, for each of these surrogate datasets our scientific hypothesis is true.

Therefore, a successful statistical analysis should correctly reject the null hypothesis of no difference in lifespan as a function of substance X levels.

Out of these 1000 datasets, in what fraction could we successfully reject the null hypothesis?

How does our statistical power depend on our choice of TrySampleSize?

Let’s repeat the above analysis for a range of values of TrySampleSize.

4- Summary

To start this unit, we chose sample size N and began with N observations in our original data set. Analyzing these data, you may have detected a weak - but not significant - positive relationship between the biomarker \(x\) and lifespan. These results were encouraging, and consistent with the expected result, but did not provide enough evidence to make a scientific conclusion.

To collect enough evidence to reveal the relationship in our data required we choose an appropriate sample size. There are many approachs to do so.

If you chose paths 3A or 3B, then you repurposed the preliminary data to perform a sample size calculation.

Depending on the adventure you choose, you may have used the original data to:

3A: Resample the data to create pseudodata with different sample sizes N_resample.
3B: Model the data to create simualted data with different sample sizes N_modeled.

You repeated this procedure to compute the statistical power: the proportion of times we reject the null hypothesis.

This procedure provides a more direct, intuitive approach to computing the statistical power.

If you have some data, you can compute the sample size!

If you chose path 3C, then you chose to abondon the original data and perform a sample size calculation based on the hypothesized effect (i.e., a positive relationship between biomarker \(x\) and lifespan) and knowledge of the data. Your adventure followed:

3C: Create simualted data with different sample sizes N_modeled.

This appraoch is especially useful if you do not have (or do not trust) preliminary data.

You may have explore one path, but the adventure does not end here. Try the other paths to explore alternative approaches to compute sample size.

1 - Just Google it?

2- Underpowered experiments are doomed to failure.

Slope estimate

Standard error of slope estimate

p-value

Choose your own adventure

3A- I’ll resample the pilot data, Turn To Page 3A.

3B- I’ll use the pilot data to build a model, Turn To Page 3B.

3C- I’ll ignore the pilot data and use my prior knowledge, Turn To Page 3C.

3D- I’ll do nothing and stick with my current sample size choice, Turn to Page 3D.

3A- With resampling you can compute the sample size!

Resampling procedure (Introduction)

Resampling procedure (4 steps)

Resampling procedure: Step 1

Resampling procedure: Step 2

Resampling procedure: Step 3

Resampling procedure: Step 4

Putting it all together

Resampling procedure to estimate the statistical power (6 steps)

Resampling procedure: Steps 1-4

Resampling procedure: Step 5

Resampling procedure: Step 6

Conclusion: our original study was severly underpowered.

Design a future experiment with enough statistical power.

What happens to the statistical power as N_resampled increases?

A loophole in the statistical power: increase alpha?

Resampling is a universal way to calculate power and sample size, but only works if the resampled data captures the effect of interest.

Turn to Page 4 Summary

3B- Build models with the data to compute the sample size!

Modeling procedure (Introduction)

Modeling procedure (4 steps)

Modeling procedure: Step 1

Modeling procedure: Step 2

Modeling procedure: Step 3

Modeling procedure: Step 4

Now, let’s use this modeling approach to determine a good sample size for our experiment.

Modeling procedure to estimate the statistical power (6 steps)

Modeling procedure: Steps 1-4

Modeling procedure: Step 5

Modeling procedure: Step 6

Turn to Page 4 Summary.

3C- Build models without the data to compute the sample size!

4- Summary

What happens to the statistical power as `N_resampled` increases?

A loophole in the statistical power: increase `alpha`?