You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

28 KiB

None <html lang="en"> <head> </head>

Week 2 lab: Descriptive stattistics + Bootstrap(sampling variability)

Keypoints:

  • Descriptive statistics (center, spread, shape)
  • population vs sample
  • sampling variability

Lab focus:

  • Bootstrap intuition (mean vs median)
  • Why estimates "move"
  • effect of sample size
In [8]:
# cell 1: Imports + settings
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

rng = np.random.default_rng(7) 
rng.normal(size=3)
Out[8]:
array([ 0.00123015,  0.29874554, -0.27413786])

Example 1: Descriptive statistic

We will compute the mean, median, SD, variance, IQR, five-number summary, and flag out the outliers using the $1.5\times IQR$ rule.

In [16]:
# Cell 2: Build a dataset with skew + outlier
rng = np.random.default_rng(7) 
n = 60 
x = rng.lognormal(mean=1.0, sigma=0.6, size=n) #right-skew
x[-1] *= 10
x = pd.Series(x, name="x")
x.head(), x.describe()
Out[16]:
(0    2.720289
 1    3.251926
 2    2.306007
 3    1.593041
 4    2.069273
 Name: x, dtype: float64,
 count    60.000000
 mean      3.025476
 std       2.512343
 min       0.600462
 25%       1.674114
 50%       2.626124
 75%       3.280787
 max      18.123107
 Name: x, dtype: float64)
In [24]:
# Cell 3: Five-numbers + IQR + outliers fence:
q1 = x.quantile(0.25)
q2 = x.quantile(0.50)
q3 = x.quantile(0.75)
iqr = q3-q1

lowerFence = q1-1.5*iqr
upperFence = q3+1.5*iqr

fiveN = {
    "min": x.min(),
    "Q1": q1,
    "median": x.median(),
    "Q3": q3,
    "max": x.max()
}
fiveN, iqr, lowerFence, upperFence
Out[24]:
({'min': np.float64(0.6004620561861885),
  'Q1': np.float64(1.6741136607158649),
  'median': np.float64(2.6261243464658732),
  'Q3': np.float64(3.2807868747809836),
  'max': np.float64(18.12310679128717)},
 np.float64(1.6066732140651188),
 np.float64(-0.7358961603818135),
 np.float64(5.690796695878662))
In [31]:
outliers = x[(x<lowerFence) | (x>upperFence)]
summary = {
    "mean": x.mean(),
    "median": x.median(),
    "std": x.std(ddof=1),
    "var": x.var(ddof=1),
    "IQR": iqr,
    "nOutliers": len(outliers)
}
summary, outliers
Out[31]:
({'mean': np.float64(3.0254764501575737),
  'median': np.float64(2.6261243464658732),
  'std': np.float64(2.512343173247777),
  'var': np.float64(6.31186822016471),
  'IQR': np.float64(1.6066732140651188),
  'nOutliers': 5},
 7      6.074679
 44     6.142882
 49     9.027269
 58     6.443769
 59    18.123107
 Name: x, dtype: float64)
In [35]:
# Cell 5: Visual Diagnostics - boxplot + histogram
nBins = 25
plt.figure()
plt.hist(x, bins=nBins)
plt.title("Histogram")
plt.xlabel("x"); plt.ylabel("Counts")
plt.show()

plt.figure()
plt.boxplot(x, vert=False, showmeans=True)
plt.title("Boxplot")
plt.xlabel("x")
plt.show()
No description has been provided for this image
No description has been provided for this image
</html>