What I’ve learned about Bootstrapping (statistics)
I’ve recently been going through some new things in machine learning, and came across a seemingly important but small topic called bootstrapping. I looked it up and aside from a few places (and Wikipedia, of course), there weren’t many explanations for it, so I thought I could try to explain it as well.
Bootstrapping is based loosely on the law of large numbers, which is a theorem that tells us that if an experiment is performed a large number of times, the average of the results should be approximately close to the true average. An example (shown in the following code) is rolling a dice multiple times. Taking the average of your dice rolls, as you roll the dice over and over, your average will get closer and closer to the average of 3.5.
That brings us to bootstrapping.
Bootstrapping is a resampling method, where a small number of samples is drawn with replacement from a larger sample \(P\). For instance,
from a sample of [3, 6, 10, 13, 14, 18, 33, 37, 46]
, we can draw [10, 18, 33]
, or [6, 13, 37]
, or even [14, 18, 18]
. Generally, a large number of these samples
is drawn, say \(n\) times. The statistics for each of these samples can then be calculated, and then an approximation of the mean or variance can be found.
The steps can be broken down into:
- Draw a random sample \(n\) times from the dataset \(P\)
- Calculate the statistics (called a bootstrap statistic) for each of the \(n\) samples
- Calculate the mean and variance for the bootstrap samples to get the approximate mean and variance for \(P\)
“Why resample?” would be one of the first questions I had after reading through the explanations.
As it is, an ideal sampling situation would be from a very big population and so there would be no resampling needed. However, if the only a small sample is available, it could be treated as a mini-population, which we can then draw repeated smaller samples from with replacement. This makes bootstrapped samples good representations of the population. This can be shown in the code below:
Uses
As shown, bootstrapping can be used to obtain data from a small or insufficient sample.
The machine learning case I came across bootstrapping was in reading about ensemble learning algorithms and random forests. In a type of random forest, a number of decision trees are created and trained on a smaller sample of the dataset created with replacement - this is called bagging (bootstrap aggregating). The alternative to this, done without replacement, is called pasting. I was surprised by how effective bagging actually was, especially on a small dataset.
More information on bootstrapping:
- The Wikipedia article
- A Towards Data Science post I read, much more technical than I can write
- An explanation paper from an MIT class that is also very technical