You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
144 KiB
144 KiB
None
<html lang="en">
<head>
</head>
</html>
Week 2 Lab (Jupyter) — Descriptive Stats + Sampling Variability + Bootstrap¶
Course focus: Descriptive statistics (center/spread/shape) + population vs sample + sampling variability
Lab focus: Bootstrap intuition (mean vs median) + why estimates “move” + effect of sample size (n)
In [4]:
# Cell 1 — Imports + settings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
rng = np.random.default_rng(7)
rng.normal(size=3)
Out[4]:
Example 1 — Descriptive statistics (center, spread, shape)¶
We’ll compute: mean/median, SD/variance, IQR, five-number summary, and flag outliers using the 1.5×IQR rule.
In [33]:
# Cell 2 — Build a dataset with skew + an outlier
n = 60
x = rng.lognormal(mean=1.0, sigma=0.6, size=n) # right-skewed
x[-1] *= 10 # inject outlier
x = pd.Series(x, name="x")
x.head(),x.describe()
Out[33]:
In [34]:
# Cell 3 — Five-number summary + IQR + outlier fences
q1 = x.quantile(0.25)
q3 = x.quantile(0.75)
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
five_num = {
"min": x.min(),
"Q1": q1,
"median": x.median(),
"Q3": q3,
"max": x.max(),
}
five_num, iqr, (lower_fence, upper_fence)
Out[34]:
In [35]:
# Cell 4 — Identify outliers + summary of center/spread
outliers = x[(x < lower_fence) | (x > upper_fence)]
summary = {
"mean": x.mean(),
"median": x.median(),
"std": x.std(ddof=1),
"var": x.var(ddof=1),
"IQR": iqr,
"n_outliers": len(outliers),
}
summary, outliers
Out[35]:
In [37]:
# Cell 5 — Visual diagnostics: histogram + boxplot
plt.figure()
plt.hist(x, bins=25)
plt.title("Histogram (skew + outlier)")
plt.xlabel("x"); plt.ylabel("count")
plt.show()
plt.figure()
plt.boxplot(x, vert=False, showmeans=True)
plt.title("Boxplot (shows outliers)")
plt.ylabel("x")
plt.show()
Example 2 — Sampling variability: why estimates “move”¶
We create a population (known distribution), then repeatedly take random samples and compute the sample mean. We compare how variability changes for different sample sizes (n).
In [41]:
# Cell 6 — Define a population (large synthetic population)
# (Treat this as the "true" population we are sampling from.)
population = rng.lognormal(mean=1.0, sigma=0.6, size=200_000)
pop_mu = population.mean()
pop_sd = population.std(ddof=0)
pop_mu, pop_sd
Out[41]:
In [56]:
# Cell 7 — Repeated sampling experiment
def repeated_sampling_means(pop, n, R=3000, rng=7):
means = np.empty(R)
for r in range(R):
sample = rng.choice(pop, size=n, replace=False)
means[r] = sample.mean()
return means
R = 300
means_n10 = repeated_sampling_means(population, n=10, R=R, rng=rng)
means_n50 = repeated_sampling_means(population, n=50, R=R, rng=rng)
means_n200 = repeated_sampling_means(population, n=200, R=R, rng=rng)
means_n10
np.std(means_n10, ddof=1), np.std(means_n50, ddof=1), np.std(means_n200, ddof=1)
Out[56]:
In [57]:
# Cell 8 — Plot sampling distributions of the mean for different n
bins = 10
plt.figure()
plt.hist(means_n10, bins=bins)
plt.axvline(pop_mu)
plt.title("Sampling distribution of mean (n=10)")
plt.xlabel("sample mean"); plt.ylabel("count")
plt.show()
plt.figure()
plt.hist(means_n50, bins=bins)
plt.axvline(pop_mu)
plt.title("Sampling distribution of mean (n=50)")
plt.xlabel("sample mean"); plt.ylabel("count")
plt.show()
plt.figure()
plt.hist(means_n200, bins=bins)
plt.axvline(pop_mu)
plt.title("Sampling distribution of mean (n=200)")
plt.xlabel("sample mean"); plt.ylabel("count")
plt.show()
In [58]:
# Cell 9 — Quick table: variability vs n (empirical standard error)
se_table = pd.DataFrame({
"n": [10, 50, 200],
"SD of sample means (empirical SE)": [
np.std(means_n10, ddof=1),
np.std(means_n50, ddof=1),
np.std(means_n200, ddof=1),
]
})
se_table
Out[58]:
Example 3 — Bootstrap intuition: mean vs median (with an outlier)¶
Bootstrap = resample with replacement from the observed sample to approximate sampling variability.
In [ ]:
# Cell 10 — Bootstrap function
def bootstrap_statistic(x, stat_fn, B=5000, rng=None):
if rng is None:
rng = np.random.default_rng()
x = np.asarray(x)
n = len(x)
stats = np.empty(B, dtype=float)
for b in range(B):
sample = rng.choice(x, size=n, replace=True)
stats[b] = stat_fn(sample)
return stats
def percentile_ci(samples, alpha=0.05):
lo = np.quantile(samples, alpha/2)
hi = np.quantile(samples, 1 - alpha/2)
return lo, hi
In [66]:
# Cell 11 — Run bootstrap for mean and median on the observed sample x
B = 5000
boot_mean = bootstrap_statistic(x.values, np.mean, B=B, rng=rng)
boot_med = bootstrap_statistic(x.values, np.median, B=B, rng=rng)
mean_ci = percentile_ci(boot_mean, alpha=0.05)
med_ci = percentile_ci(boot_med, alpha=0.05)
pd.DataFrame({
"stat": ["mean", "median"],
"point_estimate": [x.mean(), x.median()],
"bootstrap_SD": [np.std(boot_mean, ddof=1), np.std(boot_med, ddof=1)],
"CI_95_lo": [mean_ci[0], med_ci[0]],
"CI_95_hi": [mean_ci[1], med_ci[1]],
})
Out[66]:
In [61]:
# Cell 12 — Plot bootstrap distributions
plt.figure()
plt.hist(boot_mean, bins=40)
plt.title("Bootstrap distribution: mean")
plt.xlabel("mean"); plt.ylabel("count")
plt.show()
plt.figure()
plt.hist(boot_med, bins=40)
plt.title("Bootstrap distribution: median")
plt.xlabel("median"); plt.ylabel("count")
plt.show()
In [62]:
# Cell 13 — Remove the outlier and compare stability
x_no = x.iloc[:-1] # drop the injected outlier
boot_mean_no = bootstrap_statistic(x_no.values, np.mean, B=B, rng=rng)
boot_med_no = bootstrap_statistic(x_no.values, np.median, B=B, rng=rng)
comparison = pd.DataFrame({
"case": ["with outlier", "no outlier"],
"mean": [x.mean(), x_no.mean()],
"median": [x.median(), x_no.median()],
"SD(boot mean)": [np.std(boot_mean, ddof=1), np.std(boot_mean_no, ddof=1)],
"SD(boot median)": [np.std(boot_med, ddof=1), np.std(boot_med_no, ddof=1)],
})
comparison
Out[62]:
Student Task (deliverables)¶
Submit a single notebook (.ipynb) with the following:¶
Task A — Descriptive Stats (10 pts)¶
- Compute and report: mean, median, SD, IQR, five-number summary.
- Plot: histogram + boxplot.
- Identify outliers using the 1.5×IQR rule and print them.
Task B — Sampling Variability (10 pts)¶
- Using the provided population experiment, run repeated sampling for n = 10, 50, 200.
- Plot the sampling distributions (3 histograms).
- Make a table of the empirical standard error (SD of sample means) vs n.
- Write 3–4 sentences: Why does variability decrease when n increases?
Task C — Bootstrap Mean vs Median (10 pts)¶
- Bootstrap the mean and median (B=5000) for the dataset with the outlier.
- Plot both bootstrap distributions.
- Compute 95% percentile CIs for mean and median.
- Repeat after removing the outlier and compare:
- Which statistic changes more (mean or median)?
- Which bootstrap distribution is wider, and why?
Reflection (Bonus +2)¶
In one paragraph: “Big n does not fix systematic bias.” Give one real-world example.
In [63]:
# Cell 14 — Student: write answers here (replace with your text)
answers = {
"TaskB_explanation": "WRITE YOUR 3–4 SENTENCES HERE",
"TaskC_comparison": "WRITE YOUR COMPARISON HERE",
"Bonus_reflection": "OPTIONAL: WRITE YOUR PARAGRAPH HERE",
}
answers
Out[63]:
In [ ]: