Guess the probability¶
i.e., Maximum Likelihood with the Binomial
We've got a weird-looking coin, and we'd like to figure out the probability that it comes up Heads when it's flipped. How do we do this?
(More generally: from independent draws of a 0/1-valued random variable $Y$ with $\P\{Y = 1\} = p$, infer $p$.)
Let's say out of 20 flips it came up Heads 15 times. What's your best guess at $p$?
Well, since the number of heads, $X$, is $\text{Binomial}(20, p)$, $$ \P\{ X = 15 \} = \binom{20}{15} p^{15} (1-p)^5. $$ Let's call this $L(p)$.
Maximizing $L(p)$ is the same as maximizing $\log L(p)$, and $$ \frac{d}{dp} \log L(p) = \frac{15}{p} - \frac{5}{1-p} , $$ which is equal to 0 at $p = 3/4$. (also check: $d^2/dp^2 \log L(p) \le 0$)
Okay, so our guess is $p = 0.75$ "by maximum likelihood". How sure are we?
def L(p):
C = math.factorial(20) / (math.factorial(15) * math.factorial(5))
return C * (p ** 15) * ((1-p) ** 5)
pvals = np.linspace(0, 1, 51)
plt.plot(pvals, L(pvals))
plt.xlabel("probability of heads (p)")
plt.ylabel("likelihood")
plt.axvline(0.75, c='red');
Guess the mean¶
i.e., Maximum Likelihood with the Gaussian
I weighed 5 apples off my tree yesterday; their weights were 112g, 145g, 131g, 98g, and 104g. I've got about 200 apples on the tree; what's our estimate of their total weight?
x = [112, 145, 131, 98, 104]
np.mean(x)
118.0
Well, the average weight of those was 118g, so... probably around 200 * 118g = 23600g = 23.6kg?
More formally: let's assume the distribution of the weights (across all the apples) is Normal. If the mean is $\mu$ grams and standard deviation $\sigma$ grams, then the chance* of getting an apple that weighs $x$ grams is $$ L(x; \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2 \sigma^2}\right) . $$ (okay, this is actually a probability density, but let's go with it)
So, the chance of getting apples weighing $x_1, \ldots, x_n$ is $$ \prod_{i=1}^n L(x_i; \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}^n} \exp\left(- \frac{\sum_{i=1}^n (x_i - \mu)^2}{2 \sigma^2} \right) . $$
So, the log-likelihood is $$ \log L(x; \mu, \sigma) = - \frac{1}{2 \sigma^2}\sum_{i=1}^n (x_i - \mu)^2 - n \log(2 \pi \sigma^2)/2 , $$ and $$\begin{aligned} \frac{d}{d\mu} \log L(x; \mu, \sigma) &= \frac{1}{2\sigma^2}\sum_{i=1}^n 2 (x_i - \mu) \\ &= \frac{n}{\sigma^2}\left( \left(\frac{1}{n} \sum_{i=1}^n x_i\right) - \mu \right) \end{aligned}$$ Notice that $\bar x = \sum_{i=1}^n x_i / n$ is the sample mean.
So, $L(x;\mu,\sigma)$ is maximized if $\mu = \bar x$. (But first: check this is a maximum, not a minimum.) Remarkably, this doesn't depend on $\sigma$!
We've learned that:
Given independent samples from a Normal($\mu$, $\sigma$) distribution, the maximum-likelihood estimate for $\mu$ is the sample mean.
Maximizing the likelihood of a Normal distribution requires minimizing the sum of squared errors.
The maximum likelihood estimate for the mean apple weight on my tree is 118g.
Again: how certain are we? Now, the answer depends on $\sigma$.
muvals = np.linspace(100, 140, 51)
x = np.array([112, 145, 131, 98, 104])
def normal_L(mu, sigma):
C = np.sqrt(2 * math.pi * sigma**2)
return np.prod(np.exp(- (x - mu)**2 / (2 * sigma**2)) / C)
sigvals = [5, 25]
fig, axes = plt.subplots(1, len(sigvals))
for sig, ax in zip(sigvals, axes):
ax.plot(muvals, [normal_L(m, sig) for m in muvals])
ax.axvline(np.mean(x), c='r')
ax.set_title(f"sigma={sig}")
Exercise: Make the plots using:
from scipy.stats import norm
mu, sigma = 150, 50
print(normal_L(mu, sigma), np.prod(norm.pdf(x, loc=mu, scale=sigma)))
8.552412635759266e-12 8.552412635759269e-12
fig, (ax1, ax2) = plt.subplots(ncols=2)
xx = np.linspace(0, 300, 601)
pdf_x = norm.pdf(xx, loc=mu, scale=sigma)
cdf_x = norm.cdf(xx, loc=mu, scale=sigma)
ax1.plot(xx, pdf_x)
ax2.plot(xx, cdf_x);
Likelihood surfaces¶
So, what did we just do? We
- formulated a generative model that seems likely to fit the data with some free parameters that describe what we want to know about,
- wrote down the likelihood (i.e., probability) of generating our actual data as a function of the parameter(s), and
- said that "gee, seems like a good guess of the parameter(s) values are the ones that make our data look most probable".
Formally: given data $D$ and a model $M$ with parameters $\theta \in A$ and likelihood function $$ L_M(D|\theta) = \P_M\{D|\theta\} , $$ a maximum likelihood estimate of $\theta$ is $$ \theta^* = \text{argmax}_{\theta \in A}\{ L_M(D|\theta) \}, $$ i.e., $\theta^*$ is the parameter values that maximize the likelihood.
Example: the gamma distribution¶
Let's say we have hourly rainfall measurements throughout a winter storm. We'd like to (a) estimate the average rainfall for the storm (across a wider region, taking our location as representative), and (b) fit a distribution to describe the hourly variation.
Let $X_i$ be the number of millimeters of rain that fell in the $i^\text{th}$ hour. Since $X_i \ge 0$, let's take $$ X_i \sim \text{Gamma}(\text{scale}=\theta, \text{shape}=k) . $$ Our goal will be to find an $\theta$ and $k$ that describe the data well. And, here's our data:
rain
array([ 17., 99., 79., 24., 32., 45., 93., 83., 113., 32., 3., 114., 71., 98., 56., 31., 95., 31., 76., 83., 70., 107., 23., 40., 168., 43., 43., 28., 20., 49., 26., 32., 10., 58., 52., 35., 81., 124., 28., 46., 25., 120., 24., 17., 57., 96., 59., 100., 33., 74., 27., 33., 33., 35., 76., 104., 31., 24., 33., 24., 73., 34., 50., 53., 50., 33., 106., 125., 44., 94., 35., 54.])
Maximum likelihood¶
We could look up the likelihood function for the Gamma distribution and then try to solve it for the maximum. But we're just trying to optimize over two parameters, so it's easy to do by computer:
def g(a, b):
return a + b
ab = (2, 3)
abd = {'a': 2, 'b': 3}
g(*ab), g(**abd)
(5, 5)
from scipy.stats import gamma
def logL(theta, k):
lpdfs = gamma.logpdf(rain, a=k, scale=theta) # "a" is the shape parameter here
return np.sum(lpdfs)
from scipy.optimize import minimize
max_L = minimize(lambda x: -logL(*x), x0=(24, 1.8))
mle_theta, mle_k = max_L['x']
max_L
message: Optimization terminated successfully. success: True status: 0 fun: 348.18845385033353 x: [ 2.118e+01 2.711e+00] nit: 10 jac: [ 0.000e+00 0.000e+00] hess_inv: [[ 1.240e+01 -1.389e+00] [-1.389e+00 1.877e-01]] nfev: 42 njev: 14
thetavals = dsci.pretty([10, 40], 20)
kvals = np.round(dsci.pretty([0.5, 3], 10), 2)
Lmap = np.array(
[[logL(theta, k) for theta in thetavals] for k in kvals]
)
fig, ax = plt.subplots()
im = ax.contourf(thetavals, kvals, Lmap)
ax.scatter(mle_theta, mle_k, marker="*", s=500)
cbar = ax.figure.colorbar(im, ax=ax)
ax.set_xlabel("scale (theta)"); ax.set_ylabel("shape (k)")
ax.set_title(f"maximum at theta = {mle_theta:.2f}, k = {mle_k:.2f}");
Comparison: The method of moments estimator for this data is
obs_mean = np.mean(rain)
obs_sd = np.std(rain)
theta = obs_sd**2 / obs_mean - 1
k = obs_mean / theta
print(f"MOM estimates: theta = {theta:.2f}, k = {k:.2f}.")
print(f"MLE estimates: theta = {max_L['x'][0]:.2f}, k = {max_L['x'][1]:.2f}.")
MOM estimates: theta = 19.25, k = 2.98. MLE estimates: theta = 21.18, k = 2.71.
What about our question? Well,
print(f"The observed mean rainfall/hour is {obs_mean:.3f} mm and the MLE-predicted mean is {(mle_theta * mle_k):.3f} mm.")
The observed mean rainfall/hour is 57.417 mm and the MLE-predicted mean is 57.417 mm.
Goodness-of-fit, and uncertainty¶
In summary, we've got (hopefully) good estimates of the parameters in model that we think ought to fit the rainfall data.
Let's continue, asking
- Is it a good model for the data? and
- how certain are we about the parameter estimates?
To do this, we'll simulate data under the model, with the fitted parameters, and ask:
- Does the simulated data look like the real data?
- If we re-do inference, how far off are we from the true values?
rain
array([ 17., 99., 79., 24., 32., 45., 93., 83., 113., 32., 3., 114., 71., 98., 56., 31., 95., 31., 76., 83., 70., 107., 23., 40., 168., 43., 43., 28., 20., 49., 26., 32., 10., 58., 52., 35., 81., 124., 28., 46., 25., 120., 24., 17., 57., 96., 59., 100., 33., 74., 27., 33., 33., 35., 76., 104., 31., 24., 33., 24., 73., 34., 50., 53., 50., 33., 106., 125., 44., 94., 35., 54.])
def sim_rain():
sim_x = rng.gamma(shape=mle_k, scale=mle_theta, size=len(rain))
return sim_x
sim_rain()
array([ 32.15878681, 90.29979452, 120.64970166, 45.43758037, 34.19000996, 67.57743256, 39.5384326 , 64.00954305, 52.14834585, 43.9065727 , 46.80709265, 81.47900541, 68.74555455, 56.99253199, 30.63696993, 38.91539422, 95.92222192, 114.83519765, 152.17226352, 52.46765936, 29.60144015, 28.49692812, 40.3675987 , 117.91079593, 76.95123397, 26.15522173, 55.42350525, 22.43064679, 13.29220479, 28.27998446, 83.12239949, 130.14478902, 66.12441113, 97.01538698, 66.6420072 , 76.79481803, 54.89254181, 63.11907514, 10.50391286, 36.82967131, 120.10053616, 34.35975191, 21.06767437, 92.82459644, 2.84232213, 28.45267454, 57.14225809, 18.44652359, 6.60766187, 99.52041706, 94.72170703, 55.32461894, 53.48483275, 84.96605504, 79.65123689, 34.23844821, 11.63599354, 16.04474254, 36.60528762, 50.94043331, 71.47113594, 22.60486188, 36.01506343, 62.10031397, 52.62537103, 64.87274384, 9.66274055, 36.76248675, 80.39729592, 12.15988846, 31.37825464, 61.25565204])
num_reps = 4
fig, axes = plt.subplots(num_reps + 1, 1, figsize=(8, 15), sharex=True, sharey=True)
bins = dsci.pretty((0, 250), 30)
axes[0].hist(rain, bins=bins)
axes[0].set_title("real data")
for k in range(1, num_reps + 1):
sim_x = sim_rain()
axes[k].hist(sim_x, bins=bins)
axes[k].set_title("simulated data")
axes[-1].set_xlabel("mm of rain/hr");
num_reps = 4
fig, axes = plt.subplots(2, 2, figsize=(10, 8), sharex=True, sharey=True)
for k in range(4):
ax = axes.flatten()[k]
sim_x = sim_rain()
ax.scatter(np.sort(rain), np.sort(sim_x))
ax.set_xlabel("real data")
ax.set_ylabel("simulated data")
ax.set_aspect(1)
def do_inference(rain):
def logL(theta, k):
lpdfs = gamma.logpdf(rain, a=k, scale=theta) # "a" is the shape parameter here
return np.sum(lpdfs)
max_L = minimize(lambda x: -logL(*x), x0=(24, 1.8))
return max_L['x']
{'estimated': do_inference(sim_rain()), 'true value used in simulation': (mle_theta, mle_k)}
{'estimated': array([21.70029262, 2.41889354]), 'true value used in simulation': (21.1799967312572, 2.71089115943792)}
Now we'll simulate a lot of datasets, all using the same choice of $\theta$ and $k$, and for each dataset use our MLE code above to estimate $\theta$ and $k$. Then, by looking at how wide the distribution of estimated values are about the true values, we can see how far off our estimations are likely to be.
num_reps = 200
estim_array = np.zeros((num_reps, 2))
for k in range(num_reps):
estim_array[k,:] = do_inference(sim_rain())
fig, (ax0, ax1) = plt.subplots(1, 2)
ax0.hist(estim_array[:,0], bins=20)
ax0.axvline(mle_theta, label="true value", c='red')
ax0.set_xlabel("estimated values of theta")
ax0.legend()
ax1.hist(estim_array[:,1], bins=20)
ax1.axvline(mle_k, label="true value", c='red')
ax1.set_xlabel("estimated values of k")
ax1.legend();
In summary, our estimated value for theta (the scale parameter) is about 21, but with a wide range of uncertainty - it could reasonably be anything between about 12.5 and 30. For the shape parameter (k), the maximum likelihood estimate is about 2.7, with a reasonable range between about 2.0 and 4.5.
(What exactly we're doing here - getting a confidence interval - is something we'll talk about later.)