You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

291 KiB

Raw Permalink Blame History Unescape Escape

None <html lang="en"> <head> </head>

Week 6 Lab — Commonly Used Distributions and the Central Limit Theorem¶

In this lab, we will:

work with Bernoulli, Binomial, Poisson, Normal, and Exponential distributions,
compute probabilities numerically,
simulate random samples,
compare theory and simulation,
visualize the Central Limit Theorem (CLT).

We will use Python, NumPy, Matplotlib, and SciPy.

In [3]:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom, poisson, norm, expon

np.random.seed(42)

In [5]:

# Bernoulli distribution
p = 0.7

print("P(X=1) =", p)
print("P(X=0) =", 1 - p)
print("Mean =", p)
print("Variance =", p * (1 - p))

P(X=1) = 0.7
P(X=0) = 0.30000000000000004
Mean = 0.7
Variance = 0.21000000000000002

In [7]:

# Now we simulate Bernoulli observations.  
#This is equivalent to repeating a success/failure experiment many times.

#For example:
#- success = component passes,
#- failure = component fails.

p = 0.7
n_samples = 1000

bernoulli_data = np.random.binomial(1, p, size=n_samples)

print("First 20 observations:")
print(bernoulli_data[:20])

print("Sample mean:", bernoulli_data.mean())
print("Sample variance:", bernoulli_data.var(ddof=1))

First 20 observations:
[1 1 0 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 1]
Sample mean: 0.693
Sample variance: 0.21296396396396394

In [8]:

values, counts = np.unique(bernoulli_data, return_counts=True)

plt.figure(figsize=(6,4))
plt.bar(values, counts / n_samples, width=0.4)
plt.xticks([0, 1])
plt.xlabel("Outcome")
plt.ylabel("Relative frequency")
plt.title("Simulated Bernoulli Distribution")
plt.show()

No description has been provided for this image

Part 3 — Binomial Distribution¶

The Binomial distribution counts the number of successes in n independent Bernoulli trials.

If X ~ Binomial(n, p), then:

P(X = k) = C(n,k) p^k (1-p)^(n-k)
E[X] = np
Var(X) = np(1-p)

Example: Let X be the number of defective components in a sample of 20, if the defect probability is 0.08.

In [10]:

n = 20
p = 0.08

k = 2
prob_k2 = binom.pmf(k, n, p)

print(f"P(X=2) = {prob_k2:.6f}")
print("Mean =", n * p)
print("Variance =", n * p * (1 - p))

P(X=2) = 0.271091
Mean = 1.6
Variance = 1.4720000000000002

In [11]:

x = np.arange(0, n + 1)
pmf = binom.pmf(x, n, p)

plt.figure(figsize=(8,4))
plt.bar(x, pmf, width=0.7)
plt.xlabel("Number of successes")
plt.ylabel("Probability")
plt.title("Binomial Distribution: n=20, p=0.08")
plt.show()

Part 4 — Binomial Simulation¶

Now we compare the theoretical Binomial model with simulation.

In [12]:

n = 20
p = 0.08
n_sim = 10000

binom_sim = np.random.binomial(n, p, size=n_sim)

print("Simulated mean:", binom_sim.mean())
print("Theoretical mean:", n * p)

print("Simulated variance:", binom_sim.var(ddof=1))
print("Theoretical variance:", n * p * (1 - p))

Simulated mean: 1.5811
Theoretical mean: 1.6
Simulated variance: 1.4365664466446644
Theoretical variance: 1.4720000000000002

In [14]:

plt.figure(figsize=(8,4))
plt.hist(binom_sim, bins=np.arange(-0.5, n + 1.5, 1), density=True)
plt.plot(x, pmf, 'o')
plt.xlabel("Number of successes")
plt.ylabel("Relative frequency / probability")
plt.title("Binomial: Simulation vs Theory")
plt.show()

Part 5 — Poisson Distribution¶

The Poisson distribution models counts of events in a fixed interval.

If X ~ Poisson(λ), then:

P(X=k) = exp(-λ) λ^k / k!
E[X] = λ
Var(X) = λ

Example: Let X be the number of flaws per meter of cable, with average rate λ = 3.

In [16]:

lam = 3

print("P(X=0) =", poisson.pmf(0, lam))
print("P(X=2) =", poisson.pmf(2, lam))
print("Mean =", lam)
print("Variance =", lam)

P(X=0) = 0.049787068367863944
P(X=2) = 0.22404180765538775
Mean = 3
Variance = 3

In [18]:

x = np.arange(0, 15)
pmf_poisson = poisson.pmf(x, lam)

plt.figure(figsize=(8,4))
plt.bar(x, pmf_poisson, width=0.7)
plt.xlabel("Count")
plt.ylabel("Probability")
plt.title("Poisson Distribution: lambda = 3")
plt.show()

Part 6 — Poisson Simulation¶

In [20]:

lam = 3
n_sim = 10000

poisson_sim = np.random.poisson(lam, size=n_sim)

print("Simulated mean:", poisson_sim.mean())
print("Theoretical mean:", lam)

print("Simulated variance:", poisson_sim.var(ddof=1))
print("Theoretical variance:", lam)

Simulated mean: 2.9968
Theoretical mean: 3
Simulated variance: 2.9834881088108816
Theoretical variance: 3

In [21]:

plt.figure(figsize=(8,4))
plt.hist(poisson_sim, bins=np.arange(-0.5, 15.5, 1), density=True)
plt.plot(np.arange(0, 15), poisson.pmf(np.arange(0, 15), lam), 'o')
plt.xlabel("Count")
plt.ylabel("Relative frequency / probability")
plt.title("Poisson: Simulation vs Theory")
plt.show()

Part 7 — Normal Distribution¶

The Normal distribution is a continuous distribution used widely in science and engineering.

If X ~ N(μ, σ²), then:

μ is the mean,
σ is the standard deviation.

Example: Suppose a sensor output is normally distributed with mean 50 and standard deviation 4. We compute the probability that the output is less than 58.

In [23]:

mu = 50
sigma = 4

prob_less_58 = norm.cdf(58, loc=mu, scale=sigma)
z_score = (58 - mu) / sigma

print("Z-score =", z_score)
print("P(X < 58) =", prob_less_58)

Z-score = 2.0
P(X < 58) = 0.9772498680518208

In [24]:

x = np.linspace(mu - 4*sigma, mu + 4*sigma, 400)
pdf = norm.pdf(x, loc=mu, scale=sigma)

plt.figure(figsize=(8,4))
plt.plot(x, pdf)
plt.axvline(58, linestyle='--')
plt.xlabel("x")
plt.ylabel("Density")
plt.title("Normal Distribution")
plt.show()

Part 8 — Simulating Normal Data¶

In [25]:

mu = 50
sigma = 4
n_sim = 10000

normal_sim = np.random.normal(loc=mu, scale=sigma, size=n_sim)

print("Simulated mean:", normal_sim.mean())
print("Theoretical mean:", mu)

print("Simulated std:", normal_sim.std(ddof=1))
print("Theoretical std:", sigma)

Simulated mean: 49.950383162899335
Theoretical mean: 50
Simulated std: 4.0548202451563125
Theoretical std: 4

In [26]:

plt.figure(figsize=(8,4))
plt.hist(normal_sim, bins=30, density=True)
plt.plot(x, pdf)
plt.xlabel("Value")
plt.ylabel("Density")
plt.title("Normal: Simulation vs Theory")
plt.show()

Part 9 — Exponential Distribution¶

The Exponential distribution is often used to model waiting times.

If X ~ Exponential(λ), then:

mean = 1/λ

Example: Suppose machine failures occur at rate λ = 0.5 failures per hour. Then the average waiting time until the next failure is 1/0.5 = 2 hours.

In [27]:

lam = 0.5

print("Mean waiting time =", 1 / lam)
print("P(X < 2) =", expon.cdf(2, scale=1/lam))

Mean waiting time = 2.0
P(X < 2) = 0.6321205588285577

In [28]:

x = np.linspace(0, 12, 400)
pdf_exp = expon.pdf(x, scale=1/lam)

plt.figure(figsize=(8,4))
plt.plot(x, pdf_exp)
plt.xlabel("Waiting time")
plt.ylabel("Density")
plt.title("Exponential Distribution")
plt.show()

Part 10 — Central Limit Theorem¶

The Central Limit Theorem states that the sampling distribution of the sample mean becomes approximately normal as the sample size increases.

This is true even when the original population is not normal, provided the sample size is sufficiently large.

We will demonstrate this using a strongly skewed population.

In [29]:

population = np.random.exponential(scale=2, size=100000)

plt.figure(figsize=(8,4))
plt.hist(population, bins=50, density=True)
plt.xlabel("Value")
plt.ylabel("Density")
plt.title("Original Population (Skewed)")
plt.show()

Part 11 — Sampling Distribution of the Mean¶

We now repeatedly draw samples and compute their means.

We will compare:

sample size n = 5
sample size n = 30
sample size n = 100

In [30]:

def sample_means(population, sample_size, n_reps=5000):
    means = []
    for _ in range(n_reps):
        sample = np.random.choice(population, size=sample_size, replace=True)
        means.append(sample.mean())
    return np.array(means)

means_5 = sample_means(population, 5)
means_30 = sample_means(population, 30)
means_100 = sample_means(population, 100)

In [31]:

plt.figure(figsize=(8,4))
plt.hist(means_5, bins=40, density=True)
plt.xlabel("Sample mean")
plt.ylabel("Density")
plt.title("Sampling Distribution of the Mean (n = 5)")
plt.show()

In [32]:

plt.figure(figsize=(8,4))
plt.hist(means_30, bins=40, density=True)
plt.xlabel("Sample mean")
plt.ylabel("Density")
plt.title("Sampling Distribution of the Mean (n = 30)")
plt.show()

In [33]:

plt.figure(figsize=(8,4))
plt.hist(means_100, bins=40, density=True)
plt.xlabel("Sample mean")
plt.ylabel("Density")
plt.title("Sampling Distribution of the Mean (n = 100)")
plt.show()

Part 12 — Numerical Check of the CLT¶

For an exponential population with scale = 2:

population mean should be about 2,
population variance should be about 4,
variance of the sample mean should be about 4/n.

We compare this with simulation.

In [34]:

print("Population mean (approx.):", population.mean())
print("Population variance (approx.):", population.var())

print("\nFor n = 5")
print("Mean of sample means:", means_5.mean())
print("Variance of sample means:", means_5.var(ddof=1))
print("Theoretical variance approx.:", 4/5)

print("\nFor n = 30")
print("Mean of sample means:", means_30.mean())
print("Variance of sample means:", means_30.var(ddof=1))
print("Theoretical variance approx.:", 4/30)

print("\nFor n = 100")
print("Mean of sample means:", means_100.mean())
print("Variance of sample means:", means_100.var(ddof=1))
print("Theoretical variance approx.:", 4/100)

Population mean (approx.): 2.0079329836826765
Population variance (approx.): 4.030341196852308

For n = 5
Mean of sample means: 2.001364330224136
Variance of sample means: 0.78849285045314
Theoretical variance approx.: 0.8

For n = 30
Mean of sample means: 2.0074367218583755
Variance of sample means: 0.13639464205227103
Theoretical variance approx.: 0.13333333333333333

For n = 100
Mean of sample means: 2.002914768611863
Variance of sample means: 0.039199397022797754
Theoretical variance approx.: 0.04

Part 13 — Final Questions¶

Answer the following after running the notebook:

What is the difference between a Bernoulli and a Binomial random variable?
When is the Poisson model appropriate?
Why is the Normal model so important in applications?
What does a Z-score represent?
What does the Central Limit Theorem say precisely?
Does the CLT apply to raw data or to sample means?

</html>

291 KiB Raw Permalink Blame History Unescape Escape