You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

144 KiB

Raw Blame History Unescape Escape

None <html lang="en"> <head> </head>

Week 2 Lab (Jupyter) — Descriptive Stats + Sampling Variability + Bootstrap¶

Course focus: Descriptive statistics (center/spread/shape) + population vs sample + sampling variability
Lab focus: Bootstrap intuition (mean vs median) + why estimates “move” + effect of sample size (n)

In [4]:

# Cell 1 — Imports + settings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

rng = np.random.default_rng(7)
rng.normal(size=3)

Out[4]:

array([ 0.00123015,  0.29874554, -0.27413786])

Example 1 — Descriptive statistics (center, spread, shape)¶

We’ll compute: mean/median, SD/variance, IQR, five-number summary, and flag outliers using the 1.5×IQR rule.

In [33]:

# Cell 2 — Build a dataset with skew + an outlier
n = 60
x = rng.lognormal(mean=1.0, sigma=0.6, size=n)  # right-skewed
x[-1] *= 10  # inject outlier
x = pd.Series(x, name="x")

x.head(),x.describe()

Out[33]:

(0    3.567780
 1    1.159428
 2    1.668679
 3    1.543439
 4    4.254155
 Name: x, dtype: float64,
 count    60.000000
 mean      3.999189
 std       5.105683
 min       0.512643
 25%       1.657953
 50%       2.505819
 75%       4.258704
 max      36.847101
 Name: x, dtype: float64)

In [34]:

# Cell 3 — Five-number summary + IQR + outlier fences
q1 = x.quantile(0.25)
q3 = x.quantile(0.75)
iqr = q3 - q1

lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

five_num = {
    "min": x.min(),
    "Q1": q1,
    "median": x.median(),
    "Q3": q3,
    "max": x.max(),
}
five_num, iqr, (lower_fence, upper_fence)

Out[34]:

({'min': np.float64(0.5126428223213325),
  'Q1': np.float64(1.657952864334265),
  'median': np.float64(2.5058190123666924),
  'Q3': np.float64(4.258704400008809),
  'max': np.float64(36.84710076013428)},
 np.float64(2.600751535674544),
 (np.float64(-2.2431744391775514), np.float64(8.159831703520625)))

In [35]:

# Cell 4 — Identify outliers + summary of center/spread
outliers = x[(x < lower_fence) | (x > upper_fence)]
summary = {
    "mean": x.mean(),
    "median": x.median(),
    "std": x.std(ddof=1),
    "var": x.var(ddof=1),
    "IQR": iqr,
    "n_outliers": len(outliers),
}
summary, outliers

Out[35]:

({'mean': np.float64(3.999188799394666),
  'median': np.float64(2.5058190123666924),
  'std': np.float64(5.105682866634222),
  'var': np.float64(26.067997534642245),
  'IQR': np.float64(2.600751535674544),
  'n_outliers': 4},
 13    13.336209
 25    11.128763
 45    13.372176
 59    36.847101
 Name: x, dtype: float64)

In [37]:

# Cell 5 — Visual diagnostics: histogram + boxplot
plt.figure()
plt.hist(x, bins=25)
plt.title("Histogram (skew + outlier)")
plt.xlabel("x"); plt.ylabel("count")
plt.show()

plt.figure()
plt.boxplot(x, vert=False, showmeans=True)
plt.title("Boxplot (shows outliers)")
plt.ylabel("x")
plt.show()

No description has been provided for this image

Example 2 — Sampling variability: why estimates “move”¶

We create a population (known distribution), then repeatedly take random samples and compute the sample mean. We compare how variability changes for different sample sizes (n).

In [41]:

# Cell 6 — Define a population (large synthetic population)
# (Treat this as the "true" population we are sampling from.)
population = rng.lognormal(mean=1.0, sigma=0.6, size=200_000)
pop_mu = population.mean()
pop_sd = population.std(ddof=0)

pop_mu, pop_sd

Out[41]:

(np.float64(3.246809305332583), np.float64(2.1294506397257105))

In [56]:

# Cell 7 — Repeated sampling experiment
def repeated_sampling_means(pop, n, R=3000, rng=7):
    means = np.empty(R)
    for r in range(R):
        sample = rng.choice(pop, size=n, replace=False)
        means[r] = sample.mean()
    return means

R = 300
means_n10  = repeated_sampling_means(population, n=10,  R=R, rng=rng)
means_n50  = repeated_sampling_means(population, n=50,  R=R, rng=rng)
means_n200 = repeated_sampling_means(population, n=200, R=R, rng=rng)
means_n10
np.std(means_n10, ddof=1), np.std(means_n50, ddof=1), np.std(means_n200, ddof=1)

Out[56]:

(np.float64(0.6432811001722105),
 np.float64(0.30330130600629723),
 np.float64(0.14896614310477774))

In [57]:

# Cell 8 — Plot sampling distributions of the mean for different n
bins = 10
plt.figure()
plt.hist(means_n10, bins=bins)
plt.axvline(pop_mu)
plt.title("Sampling distribution of mean (n=10)")
plt.xlabel("sample mean"); plt.ylabel("count")
plt.show()

plt.figure()
plt.hist(means_n50, bins=bins)
plt.axvline(pop_mu)
plt.title("Sampling distribution of mean (n=50)")
plt.xlabel("sample mean"); plt.ylabel("count")
plt.show()

plt.figure()
plt.hist(means_n200, bins=bins)
plt.axvline(pop_mu)
plt.title("Sampling distribution of mean (n=200)")
plt.xlabel("sample mean"); plt.ylabel("count")
plt.show()

In [58]:

# Cell 9 — Quick table: variability vs n (empirical standard error)
se_table = pd.DataFrame({
    "n": [10, 50, 200],
    "SD of sample means (empirical SE)": [
        np.std(means_n10, ddof=1),
        np.std(means_n50, ddof=1),
        np.std(means_n200, ddof=1),
    ]
})
se_table

Out[58]:

	n	SD of sample means (empirical SE)
0	10	0.643281
1	50	0.303301
2	200	0.148966

Example 3 — Bootstrap intuition: mean vs median (with an outlier)¶

Bootstrap = resample with replacement from the observed sample to approximate sampling variability.

In [ ]:

# Cell 10 — Bootstrap function
def bootstrap_statistic(x, stat_fn, B=5000, rng=None):
    if rng is None:
        rng = np.random.default_rng()
    x = np.asarray(x)
    n = len(x)
    stats = np.empty(B, dtype=float)
    for b in range(B):
        sample = rng.choice(x, size=n, replace=True)
        stats[b] = stat_fn(sample)
    return stats

def percentile_ci(samples, alpha=0.05):
    lo = np.quantile(samples, alpha/2)
    hi = np.quantile(samples, 1 - alpha/2)
    return lo, hi

In [66]:

# Cell 11 — Run bootstrap for mean and median on the observed sample x
B = 5000
boot_mean = bootstrap_statistic(x.values, np.mean, B=B, rng=rng)
boot_med  = bootstrap_statistic(x.values, np.median, B=B, rng=rng)

mean_ci = percentile_ci(boot_mean, alpha=0.05)
med_ci  = percentile_ci(boot_med,  alpha=0.05)

pd.DataFrame({
    "stat": ["mean", "median"],
    "point_estimate": [x.mean(), x.median()],
    "bootstrap_SD": [np.std(boot_mean, ddof=1), np.std(boot_med, ddof=1)],
    "CI_95_lo": [mean_ci[0], med_ci[0]],
    "CI_95_hi": [mean_ci[1], med_ci[1]],
})

Out[66]:

	stat	point_estimate	bootstrap_SD	CI_95_lo	CI_95_hi
0	mean	3.999189	0.648365	2.964689	5.471043
1	median	2.505819	0.295072	2.103734	3.163472

In [61]:

# Cell 12 — Plot bootstrap distributions
plt.figure()
plt.hist(boot_mean, bins=40)
plt.title("Bootstrap distribution: mean")
plt.xlabel("mean"); plt.ylabel("count")
plt.show()

plt.figure()
plt.hist(boot_med, bins=40)
plt.title("Bootstrap distribution: median")
plt.xlabel("median"); plt.ylabel("count")
plt.show()

In [62]:

# Cell 13 — Remove the outlier and compare stability
x_no = x.iloc[:-1]  # drop the injected outlier
boot_mean_no = bootstrap_statistic(x_no.values, np.mean, B=B, rng=rng)
boot_med_no  = bootstrap_statistic(x_no.values, np.median, B=B, rng=rng)

comparison = pd.DataFrame({
    "case": ["with outlier", "no outlier"],
    "mean": [x.mean(), x_no.mean()],
    "median": [x.median(), x_no.median()],
    "SD(boot mean)": [np.std(boot_mean, ddof=1), np.std(boot_mean_no, ddof=1)],
    "SD(boot median)": [np.std(boot_med, ddof=1), np.std(boot_med_no, ddof=1)],
})
comparison

Out[62]:

	case	mean	median	SD(boot mean)	SD(boot median)
0	with outlier	3.999189	2.505819	0.663112	0.294632
1	no outlier	3.442445	2.451680	0.356388	0.291492

Student Task (deliverables)¶

Submit a single notebook (.ipynb) with the following:¶

Task A — Descriptive Stats (10 pts)¶

Compute and report: mean, median, SD, IQR, five-number summary.
Plot: histogram + boxplot.
Identify outliers using the 1.5×IQR rule and print them.

Task B — Sampling Variability (10 pts)¶

Using the provided population experiment, run repeated sampling for n = 10, 50, 200.
Plot the sampling distributions (3 histograms).
Make a table of the empirical standard error (SD of sample means) vs n.
Write 3–4 sentences: Why does variability decrease when n increases?

Task C — Bootstrap Mean vs Median (10 pts)¶

Bootstrap the mean and median (B=5000) for the dataset with the outlier.
Plot both bootstrap distributions.
Compute 95% percentile CIs for mean and median.
Repeat after removing the outlier and compare:
- Which statistic changes more (mean or median)?
- Which bootstrap distribution is wider, and why?

Reflection (Bonus +2)¶

In one paragraph: “Big n does not fix systematic bias.” Give one real-world example.

In [63]:

# Cell 14 — Student: write answers here (replace with your text)
answers = {
    "TaskB_explanation": "WRITE YOUR 3–4 SENTENCES HERE",
    "TaskC_comparison": "WRITE YOUR COMPARISON HERE",
    "Bonus_reflection": "OPTIONAL: WRITE YOUR PARAGRAPH HERE",
}
answers

Out[63]:

{'TaskB_explanation': 'WRITE YOUR 3–4 SENTENCES HERE',
 'TaskC_comparison': 'WRITE YOUR COMPARISON HERE',
 'Bonus_reflection': 'OPTIONAL: WRITE YOUR PARAGRAPH HERE'}