import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd
import scipy.stats
rng = np.random.default_rng(seed=123)
$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}}$$
This is here so we can use \P and \E and \var and \sd in LaTeX below.
Confidence intervals¶
Recall that if
- $X_1, \ldots, X_n$ are a bunch of independent samples from some distribution with mean $\mu$,
- $\bar X$ is the sample mean, and
- $S$ is the sample standard deviation,
then the difference between the sample mean and $\mu$, in units of $S/\sqrt{n}$: $$ T = \frac{\bar X - \mu}{S/\sqrt{n}} $$ has, approximately$^*$, Student's t distribution with $n-1$ degrees of freedom.
$^*$ (the approximation is better the bigger $n$ is and the closer $X_i$ is to Normal)
Rearranging, $$ \mu = \bar{X} + T \frac{S}{\sqrt{n}} $$ ... i.e., the true mean is within a few$^*$ multiples of $S/\sqrt{n}$ of the sample mean, where "a few" has the $t(\text{df}=n-1)$ distribution.
Therefore, if we choose $t_*$ so that $$ \P\{ - t_* \le T \le t* \} = 95\%, $$ then $$ \P\{ \bar X - t_* S / \sqrt{n} \le \mu \le \bar X + t_* S / \sqrt{n} \} = 95\% . $$
Note: the random quantities in that statement are $\bar X$ and $S$, not $\mu$!
How to get a confidence interval¶
Suppose we have $n$ samples, with mean $\bar x$ and sample SD $s$, and that $$ \P\{ - t_* \le T \le t_* \} = \alpha, $$ where $T$ has the Student's $t$ distribution with $\text{df}=n-1$.
Then a $\alpha$-confidence interval is $$ \bar x - t_* s/\sqrt{n} \qquad \text{to} \qquad \bar x + t_* s/\sqrt{n} . $$
What does it mean? If you do a great many experiments and in each construct a 95% confidence interval for the mean, then$^*$ the true mean should lie within 95% of those confidence intervals.
${}^*$ if the $t$-approximation is good.
Question: Why 95%?
Question: When can you say that "the probability that the true mean is in the confidence interval is 95%"?
Question: How far outside the confidence interval do you expect $\mu$ to be, in the other cases?
Exercise: confidence intervals¶
I have surveyed 100 people with small puppies, chosen randomly in Eugene, and recorded how many puppy boops they had received in the last 24hrs. Here are the data:
boops = np.array([ 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3,
3, 3, 3, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 8, 8,
8, 8, 8, 9, 9, 10, 10, 10, 10, 10, 11, 11, 11, 12, 13, 13, 14,
15, 15, 15, 16, 17, 17, 17, 18, 18, 19, 19, 20, 20, 20, 21, 21, 22,
23, 23, 23, 25, 25, 26, 26, 27, 28, 28, 29, 30, 31, 31, 32, 33, 33,
33, 37, 37, 38, 41, 42, 42, 43, 44, 46, 48, 53, 53, 68, 89])
Important Question: How many boops/day, on average, does a Eugene puppy owner get? How sure are we, and how much variation between puppy owners is there? We'd like to understand the accuracy for this important estimate as carefully as possible, so be sure to validate the method you use with simulation.
Plan:
- Look at the data.
- Estimate the mean and provide a confidence interval.
- Summarize between-owner variation.
- Validate the coverage of our confidence-interval-producing procedure with simulation.
fig, ax = plt.subplots()
ax.hist(boops)
ax.set_title(f"mean = {np.mean(boops):.3}, sd = {np.std(boops, ddof=1):.3}")
ax.set_xlabel("number of boops")
ax.set_ylabel("frequency");
Estimate the mean: here we have $\bar X = 17.8$, $S=16.4$, and $n=100$, and so:
xbar = np.mean(boops)
s = np.std(boops, ddof=1)
n = 100
t_star = scipy.stats.t.ppf(.975, df=n-1)
ci = xbar - t_star * s/np.sqrt(n), xbar + t_star * s/np.sqrt(n)
print(f"A 95% CI for the mean number of boops per day (bpd) across puppy owners is from {ci[0]:.1f} to {ci[1]:.1f} bpd.")
print(f"The number of boops varies across owners from 0 to {np.max(boops)}, with a standard deviation of {s:.1f}bpd.")
A 95% CI for the mean number of boops per day (bpd) across puppy owners is from 14.5 to 21.0 bpd. The number of boops varies across owners from 0 to 89, with a standard deviation of 16.4bpd.
Simulation study: is our procedure for getting CIs well-calibrated, i.e., do they actually cover the true value 95% of the time? First: pick a distribution.
Proposal: let's use the Exponential (looks right for that histogram). We'd like the mean to be near 17.8, but we'd also like to round it, so we'll check what the true mean is empirically.
xbar, np.mean(np.round(rng.exponential(scale=xbar, size=10000000)))
(np.float64(17.79), np.float64(17.7832548))
def sim():
return np.round(rng.exponential(scale=xbar, size=n))
sim()
array([ 3., 8., 28., 5., 5., 4., 7., 4., 19., 0., 19., 30., 54.,
21., 7., 10., 2., 42., 38., 3., 6., 1., 2., 1., 2., 7.,
4., 3., 2., 4., 56., 13., 36., 11., 9., 2., 2., 25., 1.,
31., 53., 10., 27., 2., 16., 51., 47., 2., 28., 10., 8., 30.,
6., 5., 2., 6., 10., 34., 6., 47., 39., 5., 12., 16., 23.,
6., 22., 11., 29., 12., 2., 17., 18., 22., 12., 29., 17., 59.,
12., 17., 0., 24., 7., 23., 4., 23., 2., 27., 10., 24., 5.,
9., 1., 42., 17., 5., 1., 2., 15., 11.])
def get_ci(boops):
xbar = np.mean(boops)
s = np.std(boops, ddof=1)
t_star = scipy.stats.t.ppf(.975, df=n-1)
ci = xbar - t_star * s/np.sqrt(n), xbar + t_star * s/np.sqrt(n)
return ci
# this should be the same as before
get_ci(boops)
(np.float64(14.532247400791988), np.float64(21.04775259920801))
many_cis = np.array([
get_ci(sim())
for _ in range(100)
])
many_cis.shape
(100, 2)
# code for plotting intervals
many_cis = many_cis[np.argsort(many_cis[:,0]),:]
fig, ax = plt.subplots()
for k, (x0, x1) in enumerate(many_cis):
ax.plot(np.repeat(k, 2), (x0, x1))
ax.axhline(xbar, c='red');
many_cis = np.array([
get_ci(sim())
for _ in range(10000)
])
Here's how often the true mean falls within our 95% CI (if it is "well calibrated" this should be 95%):
np.mean(np.logical_and(
xbar > many_cis[:,0],
xbar < many_cis[:,1],
))
np.float64(0.9385)
Conclusion: the CIs are fairly well calibrated (they hit the true mean 94-94.5% of the time, roughly), but are slightly too narrow.
Exercise: Power¶
The power of a statistical procedure quantifies its ability to correctly identify (roughly speaking) whatever it is we're looking for.
For instance: say we'd like to find out whether people spend more on coffee on Monday than on Wednesday. So, we're going to:
- survey 25 people, writing down how much they spent on coffee on Monday and Wednesday,
- use the list of 25 differences (how much more each spent on Monday than Wednesday) to find a 95% confidence interval for the Monday-Wednesday difference,
- and if that CI does not include zero, we declare that we have good evidence that people do spend more on Mondays.
Question: what's our power under a realistic situation? Make up a situation in which there is a real difference, and simulate from it, to find out how often we "find good evidence" in favor of a difference.
Simulation model:
Test:
Power simulation: