Bootstrapped error calculations

< Back to homepage

Bootstrapping is a well-established method for estimating statistical errors (rather than, say, systematic or technical errors). The idea behind the bootstrap is to ask how much the estimate of whatever you’re interested (e.g. the mean of your data) would have changed if you’d collected a slightly different set of data, but still drawn from the same underlying population distribution. If your estimate is being substantially influenced by an outlier or some other peculiarity not representative of the data set, the bootstrap can let you know.

Say you have 100 samples; how much does your estimate of the mean depend on a particular 10 of them? What if you collected samples over several days, and one day, for whatever reason, all of your sample values from that day were a little bit larger than the rest? How much would it change your estimate of the mean if you’d not collected data that day? A large part of statistics is about answering the question, what estimates of my sample mean would I have obtained if I re-ran this experiment 100 or 1000 more times? The bootstrap attempts to answer this question by direct simulation.

The classic reference on bootstrapping and why it works is Efron and Tibshirani (1993), An introduction to the bootstrap.

Considerations

The most essential prerequisite for the bootstrapping routine is the size of your data. You really can’t bootstrap a data set with 3 numbers in it. For the kinds of single molecule data I’ve encountered, I’ve found 50-100 samples to be fine, but the more the better.

As a corollary to the data set size, your sample data set also needs to span a representative range of possible measurement values. Let’s say the underlying population you’re measuring is Gaussian distributed around a mean of 20. Let’s say you make 10 measurements of this population, and all the samples happen to be concentrated around the value 15. (Small sample sizes–it can happen!) When you bootstrap an estimator for the mean, because your sample’s variability didn’t accurately represent the population variability, all the bootstrapped estimates might be near 15, and you might erroneously conclude that you have low statistical uncertainty. If we had drawn a larger dataset, this kind of fluke would be exponentially unlikely to happen.

How do you if you have “enough” data? Well, when in doubt, take more data and see if it changes the estimate of the mean. Alternatively, you can generate multiple sub-sets of your data, of varying size, to ask how confident your estimate of the mean would have been if you’d collected, say, half the data. That is, sub-sample your whole data set with replacement, generating, say, 100 new data sets that each have half the number of measurements of your full data set. (“With replacement” means each sub-sample can have duplicates from your original data set.) Calculate the mean value (or whatever your parameter of interest is) for each 100 new half-sized data sets. How much does that mean value vary? What if your sub-sampled data sets had a quarter of the data?

Bootstrapping time trajectories (or comparable)

Single-molecule data often take the form of time trajectories, in which something (e.g., FRET value) is measured over time. Often these time trajectories have multiple states or events, and you may want to quantify something for each state. For example, say each time trajectory explores three different states, and you quantify the mean dwell time or mean lifetime of each state. For 100 time trajectories, each with 3 states, this would give you 3 data sets, each with 100 numbers in them.

When you bootstrap these data to estimate the error on each mean dwell time, you have to decide whether to bootstrap over dwell times, or over trajectories. That is, do you treat each of the 3 sets of dwell times independently, and bootstrap each separately? Or do you take into account that these 3 sets of numbers may not be independent from each other? What if there are correlations between dwells–e.g., if one trajectory has a really long first dwell, perhaps the second and third dwells are especially long as well. (Maybe that one molecule was just really slow, at everything!)

My choice is usually to bootstrap over trajectories, going with the principle of, what if I’d happened not to collect this particular trajectory as part of my data set. Or if I’d collected 5 of them instead of 1.

Bootstrapping with respect to trajectories requires a slight modification to a standard bootstrapping routine, as described below.

Code

If whatever software or language you’re doing data analysis in has a built-in bootstrapping function, use that. Matlab has one (called bootstrp).

The outline of a very simple bootstrapping routine would be:

(1) Load your data, or generate some random data for testing purposes:

data = randn(1,87);

(2) Generate 1000 bootstrapped data sets (these days, with computers as fast as they are, 1000 or even 10,000 is not unreasonable. You can certainly have too few bootstrapped samples, but anything above a couple hundred should be fine. Above a reasonable threshold, increasing the number of bootstrapped data sets shouldn’t affect your estimate of the error!):

num_bs = 1000;

(3) Call Matlab’s bootstrapping function. Let’s say I’m trying to estimate the error on the sample mean:

means_bs = bootstrp(num_bs, @(x) mean(x), data);

Alternatively, to bootstrap with respect to trajectories, I first generate a resampled-with-replacement set of trajectory indicies–that is, if I have 87 trajectories, I assign them each a number (usually the order they load in) and then resample with replacement the indicies. Note below that instead of bootstrapping data directly, I’m bootstrapping indices:

bootstat = bootstrp(num_bs, @(x) x, 1:length(data));

Here, each row of bootstat will be a set of indices with which to resample data with replacement (bootstat will have size num_bs by length(data)).

Then I generate new data sets based on each row of bootstat–for example, load all the trajectories corresponding to the indicies in the first row of bootstat, then extract the durations of the first dwell for each of these trajectories, and take the mean of those values. Then extract the durations of the second dwell of these trajectories and calculate the mean. Etc. Then move on to the next row of bootstat. Re-load the appropriate trajectories, extract all their first dwell durations … Eventually I would have num_bs new mean values for each dwell time.

(4) I report the standard deviation of the bootstrapped means as my estimate of the standard error on the mean:

err = std(means_bs);

Thanks to Matt Johnson for teaching me this material, and for editing this tutorial for accuracy and clarity!