You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

291 KiB

Raw Blame History Unescape Escape

None <html lang="en"> <head> </head>

Week 6 Lab — Commonly Used Distributions and the Central Limit Theorem¶

In this lab, we will:

work with Bernoulli, Binomial, Poisson, Normal, and Exponential distributions,
compute probabilities numerically,
simulate random samples,
compare theory and simulation,
visualize the Central Limit Theorem (CLT).

We will use Python, NumPy, Matplotlib, and SciPy.

In [2]:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom, poisson, norm, expon

np.random.seed(42)

In [3]:

# Bernoulli distribution
p = 0.7

print("P(X=1) =", p)
print("P(X=0) =", 1 - p)
print("Mean =", p)
print("Variance =", p * (1 - p))

P(X=1) = 0.7
P(X=0) = 0.30000000000000004
Mean = 0.7
Variance = 0.21000000000000002

In [4]:

# Now we simulate Bernoulli observations.  
#This is equivalent to repeating a success/failure experiment many times.

#For example:
#- success = component passes,
#- failure = component fails.

p = 0.7
n_samples = 1000

bernoulli_data = np.random.binomial(1, p, size=n_samples)

print("First 20 observations:")
print(bernoulli_data[:20])

print("Sample mean:", bernoulli_data.mean())
print("Sample variance:", bernoulli_data.var(ddof=1))

First 20 observations:
[1 0 0 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1]
Sample mean: 0.712
Sample variance: 0.20526126126126118

In [5]:

values, counts = np.unique(bernoulli_data, return_counts=True)

plt.figure(figsize=(6,4))
plt.bar(values, counts / n_samples, width=0.4)
plt.xticks([0, 1])
plt.xlabel("Outcome")
plt.ylabel("Relative frequency")
plt.title("Simulated Bernoulli Distribution")
plt.show()

No description has been provided for this image

Part 3 — Binomial Distribution¶

The Binomial distribution counts the number of successes in n independent Bernoulli trials.

If X ~ Binomial(n, p), then:

P(X = k) = C(n,k) p^k (1-p)^(n-k)
E[X] = np
Var(X) = np(1-p)

Example: Let X be the number of defective components in a sample of 20, if the defect probability is 0.08.

In [6]:

n = 20
p = 0.08

k = 2
prob_k2 = binom.pmf(k, n, p)

print(f"P(X=2) = {prob_k2:.6f}")
print("Mean =", n * p)
print("Variance =", n * p * (1 - p))

P(X=2) = 0.271091
Mean = 1.6
Variance = 1.4720000000000002

In [7]:

x = np.arange(0, n + 1)
pmf = binom.pmf(x, n, p)

plt.figure(figsize=(8,4))
plt.bar(x, pmf, width=0.7)
plt.xlabel("Number of successes")
plt.ylabel("Probability")
plt.title("Binomial Distribution: n=20, p=0.08")
plt.show()

Part 4 — Binomial Simulation¶

Now we compare the theoretical Binomial model with simulation.

In [8]:

n = 20
p = 0.08
n_sim = 10000

binom_sim = np.random.binomial(n, p, size=n_sim)

print("Simulated mean:", binom_sim.mean())
print("Theoretical mean:", n * p)

print("Simulated variance:", binom_sim.var(ddof=1))
print("Theoretical variance:", n * p * (1 - p))

Simulated mean: 1.5818
Theoretical mean: 1.6
Simulated variance: 1.4380525652565255
Theoretical variance: 1.4720000000000002

In [9]:

plt.figure(figsize=(8,4))
plt.hist(binom_sim, bins=np.arange(-0.5, n + 1.5, 1), density=True)
plt.plot(x, pmf, 'o')
plt.xlabel("Number of successes")
plt.ylabel("Relative frequency / probability")
plt.title("Binomial: Simulation vs Theory")
plt.show()

Part 5 — Poisson Distribution¶

The Poisson distribution models counts of events in a fixed interval.

If X ~ Poisson(λ), then:

P(X=k) = exp(-λ) λ^k / k!
E[X] = λ
Var(X) = λ

Example: Let X be the number of flaws per meter of cable, with average rate λ = 3.

In [10]:

lam = 3

print("P(X=0) =", poisson.pmf(0, lam))
print("P(X=2) =", poisson.pmf(2, lam))
print("Mean =", lam)
print("Variance =", lam)

P(X=0) = 0.049787068367863944
P(X=2) = 0.22404180765538775
Mean = 3
Variance = 3

In [11]:

x = np.arange(0, 15)
pmf_poisson = poisson.pmf(x, lam)

plt.figure(figsize=(8,4))
plt.bar(x, pmf_poisson, width=0.7)
plt.xlabel("Count")
plt.ylabel("Probability")
plt.title("Poisson Distribution: lambda = 3")
plt.show()

Part 6 — Poisson Simulation¶

In [12]:

lam = 3
n_sim = 10000

poisson_sim = np.random.poisson(lam, size=n_sim)

print("Simulated mean:", poisson_sim.mean())
print("Theoretical mean:", lam)

print("Simulated variance:", poisson_sim.var(ddof=1))
print("Theoretical variance:", lam)

Simulated mean: 3.0051
Theoretical mean: 3
Simulated variance: 3.0315771477147715
Theoretical variance: 3

In [13]:

plt.figure(figsize=(8,4))
plt.hist(poisson_sim, bins=np.arange(-0.5, 15.5, 1), density=True)
plt.plot(np.arange(0, 15), poisson.pmf(np.arange(0, 15), lam), 'o')
plt.xlabel("Count")
plt.ylabel("Relative frequency / probability")
plt.title("Poisson: Simulation vs Theory")
plt.show()

Part 7 — Normal Distribution¶

The Normal distribution is a continuous distribution used widely in science and engineering.

If X ~ N(μ, σ²), then:

μ is the mean,
σ is the standard deviation.

Example: Suppose a sensor output is normally distributed with mean 50 and standard deviation 4. We compute the probability that the output is less than 58.

In [14]:

mu = 50
sigma = 4

prob_less_58 = norm.cdf(58, loc=mu, scale=sigma)
z_score = (58 - mu) / sigma

print("Z-score =", z_score)
print("P(X < 58) =", prob_less_58)

Z-score = 2.0
P(X < 58) = 0.9772498680518208

In [15]:

x = np.linspace(mu - 4*sigma, mu + 4*sigma, 400)
pdf = norm.pdf(x, loc=mu, scale=sigma)

plt.figure(figsize=(8,4))
plt.plot(x, pdf)
plt.axvline(58, linestyle='--')
plt.xlabel("x")
plt.ylabel("Density")
plt.title("Normal Distribution")
plt.show()

Part 8 — Simulating Normal Data¶

In [16]:

mu = 50
sigma = 4
n_sim = 10000

normal_sim = np.random.normal(loc=mu, scale=sigma, size=n_sim)

print("Simulated mean:", normal_sim.mean())
print("Theoretical mean:", mu)

print("Simulated std:", normal_sim.std(ddof=1))
print("Theoretical std:", sigma)

Simulated mean: 50.002185302479354
Theoretical mean: 50
Simulated std: 3.9891411700286694
Theoretical std: 4

In [17]:

plt.figure(figsize=(8,4))
plt.hist(normal_sim, bins=30, density=True)
plt.plot(x, pdf)
plt.xlabel("Value")
plt.ylabel("Density")
plt.title("Normal: Simulation vs Theory")
plt.show()

Part 9 — Exponential Distribution¶

The Exponential distribution is often used to model waiting times.

If X ~ Exponential(λ), then:

mean = 1/λ

Example: Suppose machine failures occur at rate λ = 0.5 failures per hour. Then the average waiting time until the next failure is 1/0.5 = 2 hours.

In [18]:

lam = 0.5

print("Mean waiting time =", 1 / lam)
print("P(X < 2) =", expon.cdf(2, scale=1/lam))

Mean waiting time = 2.0
P(X < 2) = 0.6321205588285577

In [19]:

x = np.linspace(0, 12, 400)
pdf_exp = expon.pdf(x, scale=1/lam)

plt.figure(figsize=(8,4))
plt.plot(x, pdf_exp)
plt.xlabel("Waiting time")
plt.ylabel("Density")
plt.title("Exponential Distribution")
plt.show()

Part 10 — Central Limit Theorem¶

The Central Limit Theorem states that the sampling distribution of the sample mean becomes approximately normal as the sample size increases.

This is true even when the original population is not normal, provided the sample size is sufficiently large.

We will demonstrate this using a strongly skewed population.

In [26]:

population = np.random.exponential(scale=2, size=100000)

plt.figure(figsize=(8,4))
plt.hist(population, bins=100, density=True)
plt.xlabel("Value")
plt.ylabel("Density")
plt.title("Original Population (Skewed)")
plt.show()

Part 11 — Sampling Distribution of the Mean¶

We now repeatedly draw samples and compute their means.

We will compare:

sample size n = 5
sample size n = 30
sample size n = 100

In [21]:

def sample_means(population, sample_size, n_reps=5000):
    means = []
    for _ in range(n_reps):
        sample = np.random.choice(population, size=sample_size, replace=True)
        means.append(sample.mean())
    return np.array(means)

means_5 = sample_means(population, 5)
means_30 = sample_means(population, 30)
means_100 = sample_means(population, 100)

In [22]:

plt.figure(figsize=(8,4))
plt.hist(means_5, bins=40, density=True)
plt.xlabel("Sample mean")
plt.ylabel("Density")
plt.title("Sampling Distribution of the Mean (n = 5)")
plt.show()

In [23]:

plt.figure(figsize=(8,4))
plt.hist(means_30, bins=40, density=True)
plt.xlabel("Sample mean")
plt.ylabel("Density")
plt.title("Sampling Distribution of the Mean (n = 30)")
plt.show()

In [24]:

plt.figure(figsize=(8,4))
plt.hist(means_100, bins=40, density=True)
plt.xlabel("Sample mean")
plt.ylabel("Density")
plt.title("Sampling Distribution of the Mean (n = 100)")
plt.show()

Part 12 — Numerical Check of the CLT¶

For an exponential population with scale = 2:

population mean should be about 2,
population variance should be about 4,
variance of the sample mean should be about 4/n.

We compare this with simulation.

In [25]:

print("Population mean (approx.):", population.mean())
print("Population variance (approx.):", population.var())

print("\nFor n = 5")
print("Mean of sample means:", means_5.mean())
print("Variance of sample means:", means_5.var(ddof=1))
print("Theoretical variance approx.:", 4/5)

print("\nFor n = 30")
print("Mean of sample means:", means_30.mean())
print("Variance of sample means:", means_30.var(ddof=1))
print("Theoretical variance approx.:", 4/30)

print("\nFor n = 100")
print("Mean of sample means:", means_100.mean())
print("Variance of sample means:", means_100.var(ddof=1))
print("Theoretical variance approx.:", 4/100)

Population mean (approx.): 2.0012337957358866
Population variance (approx.): 3.983317421944735

For n = 5
Mean of sample means: 2.001016205586466
Variance of sample means: 0.7806365493596817
Theoretical variance approx.: 0.8

For n = 30
Mean of sample means: 2.006761294144693
Variance of sample means: 0.12896356422754562
Theoretical variance approx.: 0.13333333333333333

For n = 100
Mean of sample means: 2.0015222052034187
Variance of sample means: 0.040359743223367446
Theoretical variance approx.: 0.04

Part 13 — Final Questions¶

Answer the following after running the notebook:

What is the difference between a Bernoulli and a Binomial random variable?
When is the Poisson model appropriate?
Why is the Normal model so important in applications?
What does a Z-score represent?
What does the Central Limit Theorem say precisely?
Does the CLT apply to raw data or to sample means?

</html>

291 KiB Raw Blame History Unescape Escape