You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
28 KiB
28 KiB
None
<html lang="en">
<head>
</head>
</html>
Week 2 lab: Descriptive stattistics + Bootstrap(sampling variability)¶
Keypoints:
- Descriptive statistics (center, spread, shape)
- population vs sample
- sampling variability
Lab focus:
- Bootstrap intuition (mean vs median)
- Why estimates "move"
- effect of sample size
In [8]:
# cell 1: Imports + settings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
rng = np.random.default_rng(7)
rng.normal(size=3)
Out[8]:
Example 1: Descriptive statistic¶
We will compute the mean, median, SD, variance, IQR, five-number summary, and flag out the outliers using the $1.5\times IQR$ rule.
In [16]:
# Cell 2: Build a dataset with skew + outlier
rng = np.random.default_rng(7)
n = 60
x = rng.lognormal(mean=1.0, sigma=0.6, size=n) #right-skew
x[-1] *= 10
x = pd.Series(x, name="x")
x.head(), x.describe()
Out[16]:
In [24]:
# Cell 3: Five-numbers + IQR + outliers fence:
q1 = x.quantile(0.25)
q2 = x.quantile(0.50)
q3 = x.quantile(0.75)
iqr = q3-q1
lowerFence = q1-1.5*iqr
upperFence = q3+1.5*iqr
fiveN = {
"min": x.min(),
"Q1": q1,
"median": x.median(),
"Q3": q3,
"max": x.max()
}
fiveN, iqr, lowerFence, upperFence
Out[24]:
In [31]:
outliers = x[(x<lowerFence) | (x>upperFence)]
summary = {
"mean": x.mean(),
"median": x.median(),
"std": x.std(ddof=1),
"var": x.var(ddof=1),
"IQR": iqr,
"nOutliers": len(outliers)
}
summary, outliers
Out[31]:
In [35]:
# Cell 5: Visual Diagnostics - boxplot + histogram
nBins = 25
plt.figure()
plt.hist(x, bins=nBins)
plt.title("Histogram")
plt.xlabel("x"); plt.ylabel("Counts")
plt.show()
plt.figure()
plt.boxplot(x, vert=False, showmeans=True)
plt.title("Boxplot")
plt.xlabel("x")
plt.show()