Motivation, and the Poisson¶
Main points: the Poisson distribution shows up when you are counting rare events. So do Exponentials, as waiting times.
Suppose that we're running a solar panel manufacturing plant. Each panel is made up of many modules, each of which may be defective. Usually only a few are: the average number of defects per panel is 1. A few defects are okay, but if a panel has more then 2 defects, the panel will not work. What proportion of panels will work?
What do we need to know? Well, let's suppose that a panel has $N$ modules, and each module is broken independently of the others. If the probability of a particular module being broken is $p$, then we must have $p = 1/N$. (Why?)
Poisson approximation¶
Let's call the number of defects $X$. So, $X$ is Binomial($N$, $\lambda/N$) with $\lambda=1$. Binomial probabilities say that $$\begin{aligned} \P\{ X = k \} &= \frac{N (N-1) \cdots (N-k+1)}{k!} \left(\frac{\lambda}{N}\right)^k \left(1-\frac{\lambda}{N}\right)^{N-k} . \end{aligned}$$ However, $N$ was arbitrary. For large $N$, $$\begin{aligned} N (N-1) \cdots (N-k+1) &\approx N^k, \\ \text{and} \qquad \left(1 - \frac{\lambda}{N}\right)^N &\approx e^{-\lambda}, \end{aligned}$$ so $$\begin{aligned} \P\{ X = k \} &\approx \frac{1}{k!} \lambda^k e^{-\lambda} , \end{aligned}$$ i.e., $X$ is approximately Poisson with mean $\lambda$.
Example: defects¶
Suppose that a few defects are okay, but if a panel has more then 2 defects, the panel will not work. What proportion of panels will work?
Solution: Let $X$ denote the number of defective modules on a randomly chosen panel. Then $X$ has a Poisson($\lambda = 1$) distribution, so $$\begin{aligned} \P\{X \le 2\} &= \P\{X = 0\} + \P\{X = 1\} + \P\{X = 2\} \\ &= e^{-\lambda} \left( 1 + \lambda + \lambda^2 / 2 \right) , \end{aligned}$$ which is here
lam = 1
p_good = np.exp(-lam) * (1 + lam + lam**2 / 2)
print(f"Proportion {p_good:.3f} good, and {1-p_good:.3f} bad")
Proportion 0.920 good, and 0.080 bad
Exercise: What proportion of good panels still have some defects? Answer with math and check it by simulation.
x = rng.poisson(lam=1, size=1000000)
a, b, c = np.sum(x == 0), np.sum(x==1), np.sum(x==2)
(b+c)/(a+b+c)
0.6006656677663269
$$ \P\{ X > 0 | X \le 2 \} = \frac{ \P\{ X > 0 \text{ and } X \le 2 \} }{ \P\{X \le 2\} } = \frac{ e^{-1} (1 + 1/2) }{ e^{-1} (1 + 1 + 1/2) } = \frac{3}{5} $$
Waiting times¶
Here's a different example. Cosmic rays from outside the solar system pass through us pretty often. Let's say we have a cosmic ray detector that plays a sound whenever a strong enough cosmic ray passes through it, which turns out to be about once every 1.2 minutes. How many do we expect in 6 minutes? How long do we have to wait between sounds?
How many in 6 minutes? By the same argument as above, the number of cosmic rays we get in $t$ minutes is Poisson with mean $t / 1.2$. So, we expect about $6 / 1.2 = 5$ cosmic rays in 6 minutes, with a standard deviation of $\sqrt{5} \approx 2.23$ cosmic rays. The probability we see zero cosmic rays is $e^{-5} = 0.0067$
How long between sounds? Well, the probability there's no rays in $t$ minutes is $\exp(-1.2t)$, i.e., if $T$ is the time we have to wait for the next ray, then $$\P\{T > t\} = e^{-1.2 t}.$$
So: $T$ is Exponentially distributed with mean 1/1.2 minutes, and so $\E[T] = 1/1.2$ and $\sd[T] = 1/1.2$.
Memorylessness¶
Fact: The Exponential distribution is memoryless, i.e., if $T$ is Exponential, then $$ \P\{ T > t + s | T > t \} = \P\{ T > s \} . $$
Proof:
$$\begin{aligned} \P\{ T > t + s | T > t \} &= \frac{ \P\{ T > t + s \} }{ \P\{ T > t \} } \\ &= \frac{ e^{-\lambda(t+s)} }{ e^{-\lambda t} } \\ &= \P\{ T > s \} . \end{aligned}$$
X = rng.exponential(scale=1/1.2, size=1000000)
np.sum(X > 2) / np.sum(X > 1), np.mean(X > 1)
(0.2992146622915139, 0.301399)
Exercise: This implies that cosmic rays don't care how long I've already been waiting for one to arrive: the probability that $T > 2$ minutes given that $T > 1$ minute is equal to the probability that $T > 1$ minute. Demonstrate this, by simulation.
Distributions: recap¶
If something happens at rate $\lambda$ per unit time, and whether it happens in each bit of time is independent of other bits of time, then:
The number of times it happens in $t$ units of time, $N(t)$, is Poisson with mean $\lambda t$; so $\E[N(t)] = \var[N(t)] = \lambda t$ and $$ \P\{ N(t) = n \} = e^{-\lambda t} \frac{(\lambda t)^n}{n!} ,$$
The time between subsequent events, $T$, is Exponential with rate $\lambda$, so $\E[T] = \sd[T] = 1/\lambda$ and $$ \P\{ T > t \} = \exp(-\lambda t) . $$
Complications: overdispersion¶
Let's go back to the solar panels: suppose that the number of defects per panel turns out to not be Poisson, and in fact the distribution of number of defects per panel looks like this:
defects
value | expected | observed | |
---|---|---|---|
0 | 0 | 3678.794412 | 5027 |
1 | 1 | 3678.794412 | 2454 |
2 | 2 | 1839.397206 | 1308 |
3 | 3 | 613.132402 | 616 |
4 | 4 | 153.283100 | 281 |
5 | 5 | 30.656620 | 168 |
6 | 6 | 5.109437 | 71 |
7 | 7 | 0.729920 | 42 |
8 | 8 | 0.091240 | 14 |
9 | 9 | 0.010138 | 8 |
10 | 10 | 0.001014 | 4 |
Overdispersion¶
What's going on? With some more investigation, we find that some panels are more error-prone than others: a better model for the number of defects per panel is that the "quality" of a panel, $R$, is drawn from an Exponential distribution, and given this quality, the number of defects is Poisson with mean $R$: $$\begin{aligned} \text{error rate: } R &\sim \text{Exponential}(1) \\ \text{number of defects: } X &\sim \text{Poisson}(R) . \end{aligned}$$
Question: What is $\E[X]$?
Well, given $R$, the mean is, well, $R$, i.e., $\E[X|R] = R$.
So, it would make sense if $\E[X] = \E[R] = 1$.
This is true; here is the "proof" from first principles: $$\begin{aligned} \E[X] &= \sum_x x \P\{X = x\} \\ &= \sum_x x \sum_r \P\{X = x, R = r\} \\ &= \sum_x x \sum_r \P\{R = r\} \P\{X = x \;|\; R = r\} \\ &= \sum_r \P\{R = r\} \sum_x x \P\{X = x \;|\; R = r\} \\ &= \sum_r \P\{R = r\} \E[X = x \;|\; R = r] \\ &= \sum_r \P\{R = r\} r \\ &= \E[R] . \end{aligned}$$
Challenge¶
Estimate the proportion of broken panels from this model by simulation.