i.e., Maximum Likelihood with the Binomial
We've got a weird-looking coin, and we'd like to figure out the probability that it comes up Heads when it's flipped. How do we do this?
(More generally: from independent draws of a 0/1-valued random variable $Y$ with $\P\{Y = 1\} = p$, infer $p$.)
Let's say out of 20 flips it came up Heads 15 times. What's your best guess at $p$?
Well, since the number of heads, $X$, is $\text{Binomial}(20, p)$, $$
\P\{ X = 15 \}
=
\binom{20}{15} p^{15} (1-p)^5.
$$ Let's call this $L(p)$.
Maximizing $L(p)$ is the same as maximizing $\log L(p)$, and $$ \frac{d}{dp} \log L(p) = \frac{15}{p} - \frac{5}{1-p} , $$ which is equal to 0 at $p = 3/4$.
Okay, so our guess is $p = 0.75$ "by maximum likelihood". How sure are we?
def L(p):
C = np.math.factorial(20) / (np.math.factorial(15) * np.math.factorial(5))
return C * (p ** 15) * ((1-p) ** 5)
pvals = np.linspace(0, 1, 51)
plt.plot(pvals, L(pvals))
plt.xlabel("probability of heads (p)")
plt.ylabel("likelihood")
plt.axvline(0.75, c='red');
i.e., Maximum Likelihood with the Gaussian
I weighed 5 apples off my tree yesterday; their weights were 112g, 145g, 131g, 98g, and 104g. I've got about 200 apples on the tree; what's our estimate of their total weight?
x = [112, 145, 131, 98, 104]
np.mean(x)
118.0
Well, the average weight of those was 118g, so... probably around 200 * 118g = 23600g = 23.6kg?
More formally: let's assume the distribution of the weights (across all the apples) is Normal. If the mean is $\mu$ grams and standard deviation $\sigma$ grams, then the chance of getting an apple that weighs $x$ grams is $$ L(x; \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2 \sigma^2}\right) . $$(okay, this is actually a probability density, but let's go with it)*
So, the chance of getting apples weighing $x_1, \ldots, x_n$ is
$$ \prod_{i=1}^n L(x_i; \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}^n} \exp\left(- \frac{\sum_{i=1}^n (x_i - \mu)^2}{2 \sigma^2} \right) . $$So, the log-likelihood is $$ \log L(x; \mu, \sigma) = - \frac{1}{2 \sigma^2}\sum_{i=1}^n (x_i - \mu)^2 - n \log(2 \pi \sigma^2)/2 , $$ and $$\begin{aligned} \frac{d}{d\mu} \log L(x; \mu, \sigma) &=
So, $L(x;\mu,\sigma)$ is maximized if $\mu = \bar x$. (But first: check this is a maximum, not a minimum.) Remarkably, this doesn't depend on $\sigma$!
We've learned that:
Given independent samples from a Normal($\mu$, $\sigma$) distribution, the maximum-likelihood estimate for $\mu$ is the sample mean.
Maximizing the likelihood of a Normal distribution requires minimizing the sum of squared errors.
The maximum likelihood estimate for the mean apple weight on my tree is 118g.
Again: how certain are we? Now, the answer depends on $\sigma$.
muvals = np.linspace(100, 140, 51)
x = np.array([112, 145, 131, 98, 104])
def normal_L(mu, sigma):
C = np.sqrt(2 * np.math.pi * sigma**2)
return np.prod(np.exp(- (x - mu)**2 / (2 * sigma**2)) / C)
sigvals = [5, 25]
fig, axes = plt.subplots(1, len(sigvals))
for sig, ax in zip(sigvals, axes):
ax.plot(muvals, [normal_L(m, sig) for m in muvals])
ax.axvline(np.mean(x), c='r')
ax.set_title(f"sigma={sig}")
Exercise: Make the plots using:
from scipy.stats import norm
mu, sigma = 150, 50
print(normal_L(mu, sigma), np.prod(norm.pdf(x, loc=mu, scale=sigma)))
8.552412635759266e-12 8.552412635759269e-12
muvals = np.linspace(100, 140, 51)
x = np.array([112, 145, 131, 98, 104])
sigvals = [5, 25]
fig, axes = plt.subplots(1, len(sigvals))
for sig, ax in zip(sigvals, axes):
ax.plot(muvals, [np.prod(norm.pdf(x, loc=m, scale=sig)) for m in muvals])
ax.axvline(np.mean(x), c='r')
ax.set_title(f"sigma={sig}")
So, what did we just do? We
Formally: given data $D$ and a model $M$ with parameters $\theta \in A$ and likelihood function $$ L_M(D|\theta) = \P_M\{D|\theta\} , $$ a maximum likelihood estimate of $\theta$ is $$ \theta^* = \text{argmax}_{\theta \in A}\{ L_M(D|\theta) \}, $$ i.e., $\theta^*$ is the parameter values that maximize the likelihood.
Let's say we have hourly rainfall measurements throughout a winter storm. We'd like to (a) estimate the average rainfall for the storm (across a wider region, taking our location as representative), and (b) fit a distribution to describe the hourly variation.
Let $X_i$ be the number of millimeters of rain that fell in the $i^\text{th}$ hour. Since $X_i \ge 0$, let's take $$ X_i \sim \text{Gamma}(\text{scale}=\theta, \text{shape}=k) . $$ Our goal will be to find an $\theta$ and $k$ that describe the data well. And, here's our data:
rain
array([ 17., 99., 79., 24., 32., 45., 93., 83., 113., 32., 3., 114., 71., 98., 56., 31., 95., 31., 76., 83., 70., 107., 23., 40., 168., 43., 43., 28., 20., 49., 26., 32., 10., 58., 52., 35., 81., 124., 28., 46., 25., 120., 24., 17., 57., 96., 59., 100., 33., 74., 27., 33., 33., 35., 76., 104., 31., 24., 33., 24., 73., 34., 50., 53., 50., 33., 106., 125., 44., 94., 35., 54.])
We could look up the likelihood function for the Gamma distribution and then try to solve it for the maximum. But we're just trying to optimize over two parameters, so it's easy to do by computer:
from scipy.stats import gamma
def logL(theta, k):
lpdfs = gamma.logpdf(rain, a=k, scale=theta) # "a" is the shape parameter here
return np.sum(lpdfs)
from scipy.optimize import minimize
max_L = minimize(lambda x: -logL(*x), x0=(24, 1.8))
mle_theta, mle_k = max_L['x']
max_L
fun: 348.1884538503342 hess_inv: array([[14.27925121, -1.48033603], [-1.48033603, 0.18579412]]) jac: array([0.00000000e+00, 7.62939453e-06]) message: 'Optimization terminated successfully.' nfev: 33 nit: 9 njev: 11 status: 0 success: True x: array([21.17999438, 2.71089168])
thetavals = pretty([10, 40], 20)
kvals = np.round(pretty([0.5, 3], 10), 2)
Lmap = np.array(
[[logL(theta, k) for theta in thetavals] for k in kvals]
)
fig, ax = plt.subplots()
im = ax.imshow(Lmap)
cbar = ax.figure.colorbar(im, ax=ax)
ax.set_xlabel("scale (theta)"); ax.set_ylabel("shape (k)")
ax.set_xticks(np.arange(len(thetavals)), labels=thetavals)
ax.set_yticks(np.arange(len(kvals)), labels=kvals)
ax.set_title(f"maximum at theta = {mle_theta:.2f}, k = {mle_k:.2f}");
Comparison: The method of moments estimator for this data is
obs_mean = np.mean(rain)
obs_sd = np.std(rain)
theta = obs_sd**2 / obs_mean - 1
k = obs_mean / theta
print(f"MOM estimates: theta = {theta:.2f}, k = {k:.2f}.")
print(f"MLE estimates: theta = {max_L['x'][0]:.2f}, k = {max_L['x'][1]:.2f}.")
MOM estimates: theta = 19.25, k = 2.98. MLE estimates: theta = 21.18, k = 2.71.
What about our question? Well,
print(f"The observed mean rainfall/hour is {obs_mean:.3f} mm and the MLE-predicted mean is {(mle_theta * mle_k):.3f} mm.")
The observed mean rainfall/hour is 57.417 mm and the MLE-predicted mean is 57.417 mm.
This gives a good guess. But: how sure are we? Next: one way of approaching this question - $p$-values.