291 KiB
Week 6 Lab — Commonly Used Distributions and the Central Limit Theorem¶
In this lab, we will:
- work with Bernoulli, Binomial, Poisson, Normal, and Exponential distributions,
- compute probabilities numerically,
- simulate random samples,
- compare theory and simulation,
- visualize the Central Limit Theorem (CLT).
We will use Python, NumPy, Matplotlib, and SciPy.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom, poisson, norm, expon
np.random.seed(42)
# Bernoulli distribution
p = 0.7
print("P(X=1) =", p)
print("P(X=0) =", 1 - p)
print("Mean =", p)
print("Variance =", p * (1 - p))
# Now we simulate Bernoulli observations.
#This is equivalent to repeating a success/failure experiment many times.
#For example:
#- success = component passes,
#- failure = component fails.
p = 0.7
n_samples = 1000
bernoulli_data = np.random.binomial(1, p, size=n_samples)
print("First 20 observations:")
print(bernoulli_data[:20])
print("Sample mean:", bernoulli_data.mean())
print("Sample variance:", bernoulli_data.var(ddof=1))
values, counts = np.unique(bernoulli_data, return_counts=True)
plt.figure(figsize=(6,4))
plt.bar(values, counts / n_samples, width=0.4)
plt.xticks([0, 1])
plt.xlabel("Outcome")
plt.ylabel("Relative frequency")
plt.title("Simulated Bernoulli Distribution")
plt.show()
Part 3 — Binomial Distribution¶
The Binomial distribution counts the number of successes in n independent Bernoulli trials.
If X ~ Binomial(n, p), then:
P(X = k) = C(n,k) p^k (1-p)^(n-k)E[X] = npVar(X) = np(1-p)
Example:
Let X be the number of defective components in a sample of 20, if the defect probability is 0.08.
n = 20
p = 0.08
k = 2
prob_k2 = binom.pmf(k, n, p)
print(f"P(X=2) = {prob_k2:.6f}")
print("Mean =", n * p)
print("Variance =", n * p * (1 - p))
x = np.arange(0, n + 1)
pmf = binom.pmf(x, n, p)
plt.figure(figsize=(8,4))
plt.bar(x, pmf, width=0.7)
plt.xlabel("Number of successes")
plt.ylabel("Probability")
plt.title("Binomial Distribution: n=20, p=0.08")
plt.show()
Part 4 — Binomial Simulation¶
Now we compare the theoretical Binomial model with simulation.
n = 20
p = 0.08
n_sim = 10000
binom_sim = np.random.binomial(n, p, size=n_sim)
print("Simulated mean:", binom_sim.mean())
print("Theoretical mean:", n * p)
print("Simulated variance:", binom_sim.var(ddof=1))
print("Theoretical variance:", n * p * (1 - p))
plt.figure(figsize=(8,4))
plt.hist(binom_sim, bins=np.arange(-0.5, n + 1.5, 1), density=True)
plt.plot(x, pmf, 'o')
plt.xlabel("Number of successes")
plt.ylabel("Relative frequency / probability")
plt.title("Binomial: Simulation vs Theory")
plt.show()
Part 5 — Poisson Distribution¶
The Poisson distribution models counts of events in a fixed interval.
If X ~ Poisson(λ), then:
P(X=k) = exp(-λ) λ^k / k!E[X] = λVar(X) = λ
Example:
Let X be the number of flaws per meter of cable, with average rate λ = 3.
lam = 3
print("P(X=0) =", poisson.pmf(0, lam))
print("P(X=2) =", poisson.pmf(2, lam))
print("Mean =", lam)
print("Variance =", lam)
x = np.arange(0, 15)
pmf_poisson = poisson.pmf(x, lam)
plt.figure(figsize=(8,4))
plt.bar(x, pmf_poisson, width=0.7)
plt.xlabel("Count")
plt.ylabel("Probability")
plt.title("Poisson Distribution: lambda = 3")
plt.show()
Part 6 — Poisson Simulation¶
lam = 3
n_sim = 10000
poisson_sim = np.random.poisson(lam, size=n_sim)
print("Simulated mean:", poisson_sim.mean())
print("Theoretical mean:", lam)
print("Simulated variance:", poisson_sim.var(ddof=1))
print("Theoretical variance:", lam)
plt.figure(figsize=(8,4))
plt.hist(poisson_sim, bins=np.arange(-0.5, 15.5, 1), density=True)
plt.plot(np.arange(0, 15), poisson.pmf(np.arange(0, 15), lam), 'o')
plt.xlabel("Count")
plt.ylabel("Relative frequency / probability")
plt.title("Poisson: Simulation vs Theory")
plt.show()
Part 7 — Normal Distribution¶
The Normal distribution is a continuous distribution used widely in science and engineering.
If X ~ N(μ, σ²), then:
μis the mean,σis the standard deviation.
Example:
Suppose a sensor output is normally distributed with mean 50 and standard deviation 4.
We compute the probability that the output is less than 58.
mu = 50
sigma = 4
prob_less_58 = norm.cdf(58, loc=mu, scale=sigma)
z_score = (58 - mu) / sigma
print("Z-score =", z_score)
print("P(X < 58) =", prob_less_58)
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 400)
pdf = norm.pdf(x, loc=mu, scale=sigma)
plt.figure(figsize=(8,4))
plt.plot(x, pdf)
plt.axvline(58, linestyle='--')
plt.xlabel("x")
plt.ylabel("Density")
plt.title("Normal Distribution")
plt.show()
Part 8 — Simulating Normal Data¶
mu = 50
sigma = 4
n_sim = 10000
normal_sim = np.random.normal(loc=mu, scale=sigma, size=n_sim)
print("Simulated mean:", normal_sim.mean())
print("Theoretical mean:", mu)
print("Simulated std:", normal_sim.std(ddof=1))
print("Theoretical std:", sigma)
plt.figure(figsize=(8,4))
plt.hist(normal_sim, bins=30, density=True)
plt.plot(x, pdf)
plt.xlabel("Value")
plt.ylabel("Density")
plt.title("Normal: Simulation vs Theory")
plt.show()
Part 9 — Exponential Distribution¶
The Exponential distribution is often used to model waiting times.
If X ~ Exponential(λ), then:
- mean =
1/λ
Example:
Suppose machine failures occur at rate λ = 0.5 failures per hour.
Then the average waiting time until the next failure is 1/0.5 = 2 hours.
lam = 0.5
print("Mean waiting time =", 1 / lam)
print("P(X < 2) =", expon.cdf(2, scale=1/lam))
x = np.linspace(0, 12, 400)
pdf_exp = expon.pdf(x, scale=1/lam)
plt.figure(figsize=(8,4))
plt.plot(x, pdf_exp)
plt.xlabel("Waiting time")
plt.ylabel("Density")
plt.title("Exponential Distribution")
plt.show()
Part 10 — Central Limit Theorem¶
The Central Limit Theorem states that the sampling distribution of the sample mean becomes approximately normal as the sample size increases.
This is true even when the original population is not normal, provided the sample size is sufficiently large.
We will demonstrate this using a strongly skewed population.
population = np.random.exponential(scale=2, size=100000)
plt.figure(figsize=(8,4))
plt.hist(population, bins=50, density=True)
plt.xlabel("Value")
plt.ylabel("Density")
plt.title("Original Population (Skewed)")
plt.show()
Part 11 — Sampling Distribution of the Mean¶
We now repeatedly draw samples and compute their means.
We will compare:
- sample size
n = 5 - sample size
n = 30 - sample size
n = 100
def sample_means(population, sample_size, n_reps=5000):
means = []
for _ in range(n_reps):
sample = np.random.choice(population, size=sample_size, replace=True)
means.append(sample.mean())
return np.array(means)
means_5 = sample_means(population, 5)
means_30 = sample_means(population, 30)
means_100 = sample_means(population, 100)
plt.figure(figsize=(8,4))
plt.hist(means_5, bins=40, density=True)
plt.xlabel("Sample mean")
plt.ylabel("Density")
plt.title("Sampling Distribution of the Mean (n = 5)")
plt.show()
plt.figure(figsize=(8,4))
plt.hist(means_30, bins=40, density=True)
plt.xlabel("Sample mean")
plt.ylabel("Density")
plt.title("Sampling Distribution of the Mean (n = 30)")
plt.show()
plt.figure(figsize=(8,4))
plt.hist(means_100, bins=40, density=True)
plt.xlabel("Sample mean")
plt.ylabel("Density")
plt.title("Sampling Distribution of the Mean (n = 100)")
plt.show()
Part 12 — Numerical Check of the CLT¶
For an exponential population with scale = 2:
- population mean should be about 2,
- population variance should be about 4,
- variance of the sample mean should be about
4/n.
We compare this with simulation.
print("Population mean (approx.):", population.mean())
print("Population variance (approx.):", population.var())
print("\nFor n = 5")
print("Mean of sample means:", means_5.mean())
print("Variance of sample means:", means_5.var(ddof=1))
print("Theoretical variance approx.:", 4/5)
print("\nFor n = 30")
print("Mean of sample means:", means_30.mean())
print("Variance of sample means:", means_30.var(ddof=1))
print("Theoretical variance approx.:", 4/30)
print("\nFor n = 100")
print("Mean of sample means:", means_100.mean())
print("Variance of sample means:", means_100.var(ddof=1))
print("Theoretical variance approx.:", 4/100)
Part 13 — Final Questions¶
Answer the following after running the notebook:
- What is the difference between a Bernoulli and a Binomial random variable?
- When is the Poisson model appropriate?
- Why is the Normal model so important in applications?
- What does a Z-score represent?
- What does the Central Limit Theorem say precisely?
- Does the CLT apply to raw data or to sample means?