lam = 1
npanels = 1000
nd = rng.poisson(lam * rng.exponential(size=npanels), size=npanels)
defects = pd.DataFrame({ "value" : np.arange(11, dtype='int') })
defects['observed'] = [np.sum(nd == k) for k in defects['value']]
In our Poisson example, we had data like this for the distribution of numbers of defects per solar panel. Our goal was to get a good model for this number. How did we do that?
defects
value | observed | |
---|---|---|
0 | 0 | 479 |
1 | 1 | 276 |
2 | 2 | 123 |
3 | 3 | 63 |
4 | 4 | 34 |
5 | 5 | 10 |
6 | 6 | 5 |
7 | 7 | 6 |
8 | 8 | 1 |
9 | 9 | 1 |
10 | 10 | 1 |
Well, here's what we did:
The mean number of defects per panel is 1, and they're "counts", so... maybe Poisson? With mean 1?
In other words, we used the Method of Moments:
Our next step was to say
Gee, it's more spread out ("overdispersed") than that... maybe an Exponential mixture of Poissons? (i.e., Poisson but where the mean is random, from the Exponential distribution)
Let's have a look at applying the method of moments here, too.
obs_mean = np.sum(defects['value'] * defects['observed']) / np.sum(defects['observed'])
obs_sd = np.sqrt(
np.sum((defects['value'] ** 2) * defects['observed']) / np.sum(defects['observed'])
- obs_mean ** 2
)
x = rng.poisson(obs_mean, size=np.sum(defects['observed']))
x_counts = [np.sum(x==u) for u in defects['value']]
plt.plot(defects['value'], defects['observed'], label=f'observed, mean={obs_mean}, sd={obs_sd:.2f}')
plt.xlabel("count"); plt.ylabel("frequency")
plt.plot(defects['value'], x_counts, label=f'poisson, mean={obs_mean}, sd={np.sqrt(obs_mean):.2f}')
plt.legend();
We'd like to fit two moments (mean and SD), so let's use a model with two parameters: $$\begin{aligned} \text{error rate: } R &\sim \text{Gamma}(\text{scale}=\theta, \text{shape}=k) \\ \text{number of defects: } X &\sim \text{Poisson}(R) . \end{aligned}$$ We'd like to see if we can choose $\theta$ and $k$ so the mean and SD match the observed values.
Law of Total Variance: $$ \var[X] = \E[\var[X|R]] + \var[\E[X|R]] $$
Here we have $$\begin{aligned} \var[X|R] &= R \\ \E[X|R] &= R \\ \E[R] &= \theta k \\ \var[R] &= \theta^2 k \end{aligned}$$ and so $$ \var[X] = \theta k (1 + \theta ) , $$ i.e., the variance is inflated by $(1 + \theta)$.
Okay, so we want to find $\theta$ and $k$ so that $$\begin{aligned} \E[X] = \E[R] = \theta k = \text{observed mean} \\ \sd[X] = \sqrt{\theta k (1 + \theta)} = \text{observed SD} . \end{aligned}$$
theta = obs_sd**2 / obs_mean - 1
k = obs_mean / theta
print(f"Estimates: theta = {theta:.2f}, k = {k:.2f}.")
Estimates: theta = 0.86, k = 1.16.
R = rng.gamma(shape=k, scale=theta, size=np.sum(defects['observed']))
X = rng.poisson(R, size=np.sum(defects['observed']))
X_counts = [np.sum(X==u) for u in defects['value']]
plt.plot(defects['value'], defects['observed'], label=f'observed, mean={obs_mean:.2f}, sd={obs_sd:.2f}')
plt.xlabel("count"); plt.ylabel("frequency")
plt.plot(defects['value'], X_counts, label=f'poisson, mean={np.mean(X):.2f}, sd={np.std(X):.2f}')
plt.legend();