Mastering Inference Analysis with Bootstrapping in Python

Introduction to Bootstrapping Analysis

Bootstrapping inference analysis is a powerful technique, particularly useful when working with extensive and representative datasets. Recently, my supervisor inquired if we could determine the true average views on Instagram Reels for leading brands in North America. Relying on the mean views from just one sample would be misleading. To tackle such issues, we must adopt a probabilistic approach.

Bootstrapping: A Solution

Bootstrapping comes to our aid! This method generates a finite number of samples, computes a summary statistic (like the mean) for each sample, and establishes an interval for the solution. Previously, collecting 10,000 samples seemed daunting, but with Python, we can make it manageable.

In addition to forming confidence intervals, bootstrapping can also facilitate significance tests. This article aims to clarify bootstrapping and its role in generating confidence intervals.

Why Choose Bootstrapping?

If your supervisor questions the use of bootstrapping, you can explain that it is a non-parametric method. This means it does not rely on any assumptions about the underlying data distribution. Therefore, we don't need to contend with issues of data normality or equal variance. Unlike traditional t-tests, bootstrapping doesn't require strict assumptions, although a representative sample size is essential.

Understanding Bootstrapping

Bootstrapping involves using resampled data (with replacement) to carry out statistical inference. A simple example in Python illustrates this concept effectively:

import numpy as np

reel_views = [191, 145]

bootstrapped_video_views = np.random.choice(reel_views, size=len(reel_views))

print(f'The first sample: {bootstrapped_video_views}')

In this example, the code might return samples like [191, 191], [145, 145], or [191, 145]. The process of random selection and replacement helps us understand bootstrapping better.

Generating Bootstrapped Confidence Intervals

We can estimate uncertainty through bootstrapping by calculating confidence intervals. The goal is to identify an interval for the summary statistic (mean) that encompasses the true mean of the sample population. The procedure is as follows:

Create a bootstrap sample by randomly selecting with replacement from our data (video views).
Calculate the summary statistic (mean) from this new sample and store it as a bootstrap replicate.
Repeat these steps a fixed number of times (typically 10,000).
Determine the 2.5 and 97.5 percentiles from the stored summary statistics.

Here's how the code looks:

def draw_bs_reps(data: list, func: Callable, size: int = 10000) -> np.array:

"""

Generate bootstrap replicates. Typical size = 5,000 - 10,000 repetitions.

"""

rng = np.random.default_rng(seed=42)

bs_sample = rng.choice(data, size=(size, len(data)))

bs_replicates = np.apply_along_axis(func, arr=bs_sample, axis=1)

return bs_replicates

reel_views = [191, 145, 198, 192, 1022, ...] # Large Representative Sample

bootstrap_replicates = draw_bs_reps(reel_views, np.mean)

conf_int = np.percentile(bootstrap_replicates, [2.5, 97.5])

print(f'95% Confidence Interval: {conf_int}')

For example, the output might show a 95% confidence interval of [137.9, 150.3]. This implies that if we were to repeatedly sample the views from top brands, 95% of the sample means would fall between these two values.

Visualizing Results

Visual representation of the results can enhance understanding. When generating bootstrap replicates, consider using ECDFs, KDE plots, or histograms. Personally, I find ECDFs and KDE plots superior since they display all data points without binning. Below is a KDE showing all 10,000 bootstrap replicates:

The purple lines indicate that 95% of the bootstrap mean replicates fall within these bounds. If the interval appears wide, increasing the number of bootstrap replicates (e.g., to 20,000) won’t necessarily narrow it. To achieve a tighter interval, you’ll need to collect a larger dataset.

Key Insights

Bootstrapped confidence intervals assist in estimating uncertainty.
This technique requires a large and representative sample.
Bootstraps cannot magically increase the data within a sample.
To obtain narrower intervals, you must enhance your sample size.

In the upcoming article in this bootstrapping series, I will explore significance tests. Thank you for reading!

afyonkarahisarkitapfuari.com

Mastering Inference Analysis with Bootstrapping in Python

Introduction to Bootstrapping Analysis

Bootstrapping: A Solution

Generating Bootstrapped Confidence Intervals

Visualizing Results

Key Insights

Share the page:

Recent Post:

Exploring the Secrets of Finland's Happiness: A Global Perspective

Transform Your Life with One Simple Change: Boost Your Confidence

Embracing Five Years of Singleness: A Journey to Self-Discovery

Finding the Right Books: A Journey Through Knowledge and Growth

Exploring Questions of Existence and Understanding the Cosmos

The Ethical Implications of Mandatory COVID-19 Vaccination

Astrology and Healing: Letting Go of Emotional Baggage

17 GitHub Repositories Every Developer Should Know