Poisson counting¶

Peter Ralph

https://uodsci.github.io/dsci345

In [1]:

import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd
import math

rng = np.random.default_rng()

$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}}$$ This is here so we can use \P and \E and \var and \sd in LaTeX below.

Motivation, and the Poisson¶

Main points: the Poisson distribution shows up when you are counting rare events. So do Exponentials, as waiting times.

Suppose that we're running a solar panel manufacturing plant. Each panel is made up of many modules, each of which may be defective. Usually only a few are: the average number of defects per panel is 1. A few defects are okay, but if a panel has more then 2 defects, the panel will not work. What proportion of panels will work?

What do we need to know? Well, let's suppose that a panel has $N$ modules, and each module is broken independently of the others. If the probability of a particular module being broken is $p$, then we must have $p = 1/N$. (Why?)

Poisson approximation¶

Let's call the number of defects $X$. So, $X$ is Binomial($N$, $\lambda/N$) with $\lambda=1$. Binomial probabilities say that $$\begin{aligned} \P\{ X = k \} &= \frac{N (N-1) \cdots (N-k+1)}{k!} \left(\frac{\lambda}{N}\right)^k \left(1-\frac{\lambda}{N}\right)^{N-k} . \end{aligned}$$ However, $N$ was arbitrary. For large $N$, $$\begin{aligned} N (N-1) \cdots (N-k+1) &\approx N^k, \\ \text{and} \qquad \left(1 - \frac{\lambda}{N}\right)^N &\approx e^{-\lambda}, \end{aligned}$$ so $$\begin{aligned} \P\{ X = k \} &\approx \frac{1}{k!} \lambda^k e^{-\lambda} , \end{aligned}$$ i.e., $X$ is approximately Poisson with mean $\lambda$.

Example: defects¶

Suppose that a few defects are okay, but if a panel has more then 2 defects, the panel will not work. What proportion of panels will work?

Solution: Let $X$ denote the number of defective modules on a randomly chosen panel. Then $X$ has a Poisson($\lambda = 1$) distribution, so $$\begin{aligned} \P\{X \le 2\} &= \P\{X = 0\} + \P\{X = 1\} + \P\{X = 2\} \\ &= e^{-\lambda} \left( 1 + \lambda + \lambda^2 / 2 \right) , \end{aligned}$$ which is here

In [2]:

lam = 1
p_good = np.exp(-lam) * (1 + lam + lam**2 / 2)
print(f"Proportion {p_good:.3f} good, and {1-p_good:.3f} bad")

Proportion 0.920 good, and 0.080 bad

Exercise: What proportion of good panels still have some defects? Answer with math and check it by simulation.

In [3]:

x = rng.poisson(lam=1, size=1000000)
a, b, c = np.sum(x == 0), np.sum(x==1), np.sum(x==2)
(b+c)/(a+b+c)

Out[3]:

np.float64(0.5987938291359399)

$$ \P\{ X > 0 | X \le 2 \} = \frac{ \P\{ X > 0 \text{ and } X \le 2 \} }{ \P\{X \le 2\} } = \frac{ e^{-1} (1 + 1/2) }{ e^{-1} (1 + 1 + 1/2) } = \frac{3}{5} $$

Waiting times¶

Here's a different example. Cosmic rays from outside the solar system pass through us pretty often. Let's say we have a cosmic ray detector that plays a sound whenever a strong enough cosmic ray passes through it, which turns out to be about once every 1.2 minutes. How many do we expect in 6 minutes? How long do we have to wait between sounds?

How many in 6 minutes? By the same argument as above, the number of cosmic rays we get in $t$ minutes is Poisson with mean $t / 1.2$. So, we expect about $6 / 1.2 = 5$ cosmic rays in 6 minutes, with a standard deviation of $\sqrt{5} \approx 2.23$ cosmic rays. The probability we see zero cosmic rays is $e^{-5} = 0.0067$

How long between sounds? Well, the probability there's no rays in $t$ minutes is $\exp(-t/1.2)$, i.e., if $T$ is the time we have to wait for the next ray, then $$\P\{T > t\} = e^{- t/1.2}.$$

So: $T$ is Exponentially distributed with mean 1/1.2 minutes, and so $\E[T] = 1/1.2$ and $\sd[T] = 1/1.2$.

Memorylessness¶

Fact: The Exponential distribution is memoryless, i.e., if $T$ is Exponential, then $$ \P\{ T > t + s | T > t \} = \P\{ T > s \} . $$

Proof:

$$\begin{aligned} \P\{ T > t + s | T > t \} &= \frac{ \P\{ T > t + s \} }{ \P\{ T > t \} } \\ &= \frac{ e^{-\lambda(t+s)} }{ e^{-\lambda t} } \\ &= \P\{ T > s \} . \end{aligned}$$

Exercise: This implies that cosmic rays don't care how long I've already been waiting for one to arrive: the probability that $T > 2$ minutes given that $T > 1$ minute is equal to the probability that $T > 1$ minute. Demonstrate this, by simulation.

In [4]:

T = rng.exponential(scale=1/1.2, size=1000000)
num_gt_two = np.sum(T > 2)
num_gt_one = np.sum(T > 1)
num_gt_one, num_gt_two
print(f"prob > 1: {num_gt_one/len(T)}; prob > 2 given > 1: {num_gt_two/num_gt_one}")

prob > 1: 0.300328; prob > 2 given > 1: 0.300990916597853

Distributions: recap¶

If something happens at rate $\lambda$ per unit time, and whether it happens in each bit of time is independent of other bits of time, then:

The number of times it happens in $t$ units of time, $N(t)$, is Poisson with mean $\lambda t$; so $\E[N(t)] = \var[N(t)] = \lambda t$ and $$ \P\{ N(t) = n \} = e^{-\lambda t} \frac{(\lambda t)^n}{n!} ,$$

The time between subsequent events, $T$, is Exponential with rate $\lambda$, so $\E[T] = \sd[T] = 1/\lambda$ and $$ \P\{ T > t \} = \exp(-\lambda t) . $$

Complications: overdispersion¶

Let's go back to the solar panels: suppose that the number of defects per panel turns out to not be Poisson, and in fact the distribution of number of defects per panel looks like this:

In [5]:

import pandas as pd

npanels = 10000
nd = rng.poisson(lam * rng.exponential(size=npanels), size=npanels)
defects = pd.DataFrame({ "value" : np.arange(11, dtype='int') })
defects["expected"] = (
    npanels * np.exp(-lam) * lam ** defects['value'] 
    /
    np.array([math.factorial(k) for k in defects['value'].values])
)
defects['observed'] = [np.sum(nd == k) for k in defects['value']]

In [6]:

defects

Out[6]:

	value	expected	observed
0	0	3678.794412	4939
1	1	3678.794412	2542
2	2	1839.397206	1266
3	3	613.132402	621
4	4	153.283100	319
5	5	30.656620	161
6	6	5.109437	76
7	7	0.729920	38
8	8	0.091240	22
9	9	0.010138	10
10	10	0.001014	4

Overdispersion¶

What's going on? With some more investigation, we find that some panels are more error-prone than others: a better model for the number of defects per panel is that the "quality" of a panel, $R$, is drawn from an Exponential distribution, and given this quality, the number of defects is Poisson with mean $R$: $$\begin{aligned} \text{error rate: } R &\sim \text{Exponential}(1) \\ \text{number of defects: } X &\sim \text{Poisson}(R) . \end{aligned}$$

Question: What is $\E[X]$?

Well, given $R$, the mean is, well, $R$, i.e., $\E[X|R] = R$.

So, it would make sense if $\E[X] = \E[R] = 1$.

This is true; here is the "proof" from first principles: $$\begin{aligned} \E[X] &= \sum_x x \P\{X = x\} \\ &= \sum_x x \sum_r \P\{X = x, R = r\} \\ &= \sum_x x \sum_r \P\{R = r\} \P\{X = x \;|\; R = r\} \\ &= \sum_r \P\{R = r\} \sum_x x \P\{X = x \;|\; R = r\} \\ &= \sum_r \P\{R = r\} \E[X \;|\; R = r] \\ &= \sum_r \P\{R = r\} r \\ &= \E[R] . \end{aligned}$$

Challenge¶

Estimate the proportion of broken panels from this model by simulation.

In [7]:

R = rng.exponential(scale=1.0, size=1000000)
X = rng.poisson(R, size=1000000)
np.mean(X > 2)

Out[7]:

np.float64(0.124823)

About 12.5% (more than without $R$)