Poisson counting¶

Peter Ralph

https://uodsci.github.io/dsci345

In [1]:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd
import math

rng = np.random.default_rng()

$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}}$$ This is here so we can use \P and \E and \var and \sd in LaTeX below.

Motivation, and the Poisson¶

Main points: the Poisson distribution shows up when you are counting rare events. So do Exponentials, as waiting times.

Suppose that we're running a solar panel manufacturing plant. Each panel is made up of many modules, each of which may be defective. Usually only a few are: the average number of defects per panel is 1. A few defects are okay, but if a panel has more then 2 defects, the panel will not work. What proportion of panels will work?

What do we need to know? Well, let's suppose that a panel has $N$ modules, and each module is broken independently of the others. If the probability of a particular module being broken is $p$, then we must have $p = 1/N$. (Why?)

Poisson approximation¶

Let's call the number of defects $X$. So, $X$ is Binomial($N$, $\lambda/N$) with $\lambda=1$. Binomial probabilities say that $$\begin{aligned} \P\{ X = k \} &= \frac{N (N-1) \cdots (N-k+1)}{k!} \left(\frac{\lambda}{N}\right)^k \left(1-\frac{\lambda}{N}\right)^{N-k} . \end{aligned}$$ However, $N$ was arbitrary. For large $N$, $$\begin{aligned} N (N-1) \cdots (N-k+1) &\approx N^k, \\ \text{and} \qquad \left(1 - \frac{\lambda}{N}\right)^N &\approx e^{-\lambda}, \end{aligned}$$ so $$\begin{aligned} \P\{ X = k \} &\approx \frac{1}{k!} \lambda^k e^{-\lambda} , \end{aligned}$$ i.e., $X$ is approximately Poisson with mean $\lambda$.

Example: defects¶

Suppose that a few defects are okay, but if a panel has more then 2 defects, the panel will not work. What proportion of panels will work?

Solution: Let $X$ denote the number of defective modules on a randomly chosen panel. Then $X$ has a Poisson($\lambda = 1$) distribution, so $$\begin{aligned} \P\{X \le 2\} &= \P\{X = 0\} + \P\{X = 1\} + \P\{X = 2\} \\ &= e^{-\lambda} \left( 1 + \lambda + \lambda^2 / 2 \right) , \end{aligned}$$ which is here

In [2]:
lam = 1
p_good = np.exp(-lam) * (1 + lam + lam**2 / 2)
print(f"Proportion {p_good:.3f} good, and {1-p_good:.3f} bad")
Proportion 0.920 good, and 0.080 bad

Exercise: What proportion of good panels still have some defects? Answer with math and check it by simulation.

In [3]:
X = rng.poisson(lam=1, size=1000000)
1 - np.sum(X == 0) / np.sum(X <= 2)
Out[3]:
np.float64(0.5997959995911293)

We know that all good panels have at most 2 defects, and ones with still good defects have 1 or 2 defects, so we want proportion of 1 or 2 defects divided by the total proportion of good panels. So, this is $$\begin{aligned} \P(\text{some defects} | \text{good}) &= 1 - \P(\text{no defects} | \text{good}) \\ &= 1 - \frac{ \P(\text{good and no defects} ) }{ \P(\text{good}) } \\ &= 1 - \frac{ \P(X = 0) }{ \P( X <= 2 )} \\ \end{aligned}$$

Waiting times¶

Here's a different example. Cosmic rays from outside the solar system pass through us pretty often. Let's say we have a cosmic ray detector that plays a sound whenever a strong enough cosmic ray passes through it, which turns out to be about once every 1.2 minutes. How many do we expect in 6 minutes? How long do we have to wait between sounds?

How many in 6 minutes? By the same argument as above, the number of cosmic rays we get in $t$ minutes is Poisson with mean $t / 1.2$. So, we expect about $6 / 1.2 = 5$ cosmic rays in 6 minutes, with a standard deviation of $\sqrt{5} \approx 2.23$ cosmic rays. The probability we see zero cosmic rays is $e^{-5} = 0.0067$

How long between sounds? Well, the probability there's no rays in $t$ minutes is $\exp(-t/1.2)$, i.e., if $T$ is the time we have to wait for the next ray, then $$\P\{T > t\} = e^{- t/1.2}.$$

So: $T$ is Exponentially distributed with mean 1/1.2 minutes, and so $\E[T] = 1/1.2$ and $\sd[T] = 1/1.2$.

Memorylessness¶

Fact: The Exponential distribution is memoryless, i.e., if $T$ is Exponential, then $$ \P\{ T > t + s | T > t \} = \P\{ T > s \} . $$

Proof:

$$\begin{aligned} \P\{ T > t + s | T > t \} &= \frac{ \P\{ T > t + s \} }{ \P\{ T > t \} } \\ &= \frac{ e^{-\lambda(t+s)} }{ e^{-\lambda t} } \\ &= \P\{ T > s \} . \end{aligned}$$

Exercise: This implies that cosmic rays don't care how long I've already been waiting for one to arrive: the probability that $T > 2$ minutes given that $T > 1$ minute is equal to the probability that $T > 1$ minute. Demonstrate this, by simulation.

In [ ]:
 

Distributions: recap¶

If something happens at rate $\lambda$ per unit time, and whether it happens in each bit of time is independent of other bits of time, then:

The number of times it happens in $t$ units of time, $N(t)$, is Poisson with mean $\lambda t$; so $\E[N(t)] = \var[N(t)] = \lambda t$ and $$ \P\{ N(t) = n \} = e^{-\lambda t} \frac{(\lambda t)^n}{n!} ,$$

The time between subsequent events, $T$, is Exponential with rate $\lambda$, so $\E[T] = \sd[T] = 1/\lambda$ and $$ \P\{ T > t \} = \exp(-\lambda t) . $$

Complications: overdispersion¶

Let's go back to the solar panels: suppose that the number of defects per panel turns out to not be Poisson, and in fact the distribution of number of defects per panel looks like this:

In [4]:
import pandas as pd

lam = 1
npanels = 10000
nd = rng.poisson(lam * rng.exponential(size=npanels), size=npanels)
defects = pd.DataFrame({ "value" : np.arange(11, dtype='int') })
defects["expected"] = (
    npanels * np.exp(-lam) * lam ** defects['value'] 
    /
    np.array([math.factorial(k) for k in defects['value'].values])
)
defects['observed'] = [np.sum(nd == k) for k in defects['value']]
In [5]:
defects
Out[5]:
value expected observed
0 0 3678.794412 5059
1 1 3678.794412 2414
2 2 1839.397206 1236
3 3 613.132402 632
4 4 153.283100 347
5 5 30.656620 155
6 6 5.109437 81
7 7 0.729920 43
8 8 0.091240 16
9 9 0.010138 6
10 10 0.001014 7

Overdispersion¶

What's going on? With some more investigation, we find that some panels are more error-prone than others: a better model for the number of defects per panel is that the "quality" of a panel, $R$, is drawn from an Exponential distribution, and given this quality, the number of defects is Poisson with mean $R$: $$\begin{aligned} \text{error rate: } R &\sim \text{Exponential}(1) \\ \text{number of defects: } X &\sim \text{Poisson}(R) . \end{aligned}$$

Question: What is $\E[X]$?

Well, given $R$, the mean is, well, $R$, i.e., $\E[X|R] = R$.

So, it would make sense if $\E[X] = \E[R] = 1$.

This is true; here is the "proof" from first principles: $$\begin{aligned} \E[X] &= \sum_x x \P\{X = x\} \\ &= \sum_x x \sum_r \P\{X = x, R = r\} \\ &= \sum_x x \sum_r \P\{R = r\} \P\{X = x \;|\; R = r\} \\ &= \sum_r \P\{R = r\} \sum_x x \P\{X = x \;|\; R = r\} \\ &= \sum_r \P\{R = r\} \E[X \;|\; R = r] \\ &= \sum_r \P\{R = r\} r \\ &= \E[R] . \end{aligned}$$

Challenge¶

Estimate the proportion of broken panels from this model by simulation.

In [6]:
X = rng.poisson(1, size=1000000)
np.mean(X > 2)
Out[6]:
np.float64(0.080245)
In [7]:
R = rng.exponential(scale=1.0, size=1000000)
X = rng.poisson(R, size=1000000)
np.mean(X > 2)
Out[7]:
np.float64(0.125142)

About 12.5% (more than without $R$)