We've seen a few flavors of "linear model". First, the 'standard': $$\begin{aligned} Y &= Xb + \epsilon \\ \epsilon &\sim \text{Normal}(\text{mean}=0, \text{sd}=\sigma) , \end{aligned}$$ which is equivalent to $$\begin{aligned} Y &\sim \text{Normal}(\text{mean}= Xb, \text{sd}=\sigma) . \end{aligned}$$
Robust: $$\begin{aligned} Y &\sim \text{Cauchy}(\text{location}= Xb, \text{scale}=\sigma) . \end{aligned}$$
And, in homework, Poisson: $$\begin{aligned} Y &\sim \text{Poisson}(\text{mean}=\exp(Xb)) . \end{aligned}$$
What are these "models" for?
understanding how signal and noise combine in your data, and how to extract the information you want
simulating data, to test methods and build intuition
providing a framework to describe the data (to fit the model, e.g., by maximum likelihood)
describing how good predictions based on a fitted model are likely to be
You could, for instance, decide your data come from a Poisson model and yet still fit it by least squares (i.e., maximum likelihood with a Normal model). But: the "correct" model (i.e., one that better describes the noise distribution) should provide more accurate results.
In each case, we said:
A generalized linear model has three ingredients:
For instance: $$\begin{aligned} Y &\sim \text{Poisson}(\text{mean}=\exp(Xb)) \end{aligned}$$ is a Poisson GLM with a log link function (since $\log(\E[Y]) = Xb$).
For instance: $$\begin{aligned} Y &\sim \text{Normal}(\text{mean}= Xb, \text{sd}=\sigma) . \end{aligned}$$ is a Normal GLM with an identity link function.
Probably the most common GLM besides Normal is the "Binomial GLM with a logistic link", i.e.: $$\begin{aligned} Y_i &\sim \text{Binomial}(\text{size}=N_i, \text{prob}=\theta_i) \\ \theta_i &= \frac{1}{1 + e^{-Xb}} . \end{aligned}$$
Here $f(x) = 1/(1 + e^{-x})$ is the logistic function, which takes the linear predictor (which can be any value) and gives us back a number between 0 and 1:
xx = np.linspace(-6, 6, 101)
plt.plot(xx, 1/(1 + np.exp(-xx)));
Taller dandelion flowers spread their seeds farther... but are also more likely to be eaten. We survey a field before and after a herd of cows come through, and determine that the chance that a dandelion that is $x$ inches high is $$ p(x) = \frac{1}{1 + e^{-(x - 5)}} . $$
xvals = np.linspace(0, 12, 61)
pvals = 1/(1 + np.exp(-(xvals - 5)))
plt.plot(xvals, pvals)
plt.xlabel("height (inches)"); plt.ylabel("probability of being eaten");
How might we determine this? Here's some hypothetical data: circles are uneaten, x's are eaten, along with the theoretical curve, as well as the proportion eaten in one inch height bins:
n = 300
x = rng.uniform(low=0, high=12, size=n)
eaten = (rng.random(n) < 1/(1 + np.exp(-(x-5))))
fig, ax = plt.subplots()
[ax.axvline(k, c='lightgrey') for k in range(13)];
ax.scatter(x[eaten], 1.0 + rng.uniform(-.05, .05, np.sum(eaten)), marker = "x")
ax.scatter(x[~eaten], rng.uniform(-.05, .05, np.sum(~eaten)), marker = "o")
ax.plot(xvals, pvals)
ax.set_xlabel("height"); ax.set_ylabel("eaten?")
props = [np.mean(eaten[np.floor(x) == k]) for k in range(12)]
ax.scatter(np.arange(12) + 0.5, props, s=200);
Identify the family, link function, and linear predictor of the following GLM: $$\begin{aligned} Y &\sim \text{Poisson}(\text{mean}=(b_0 + b_1 X)^2) . \end{aligned}$$ Then, simulate 150 data points from it, and plot, taking $Y$ to be the number of misspelled words per page in an essay and $X$ to be the number of hours past 8pm.
n = 150
X = rng.uniform(low=0, high=8, size=n)
Y = rng.poisson(lam=(np.sqrt(10) + 0.5 * X)**2)
plt.scatter(X, Y); plt.xlabel("hours past 8pm"); plt.ylabel("misspelled words per page");
Let's build a model of lung cancer incidence, based loosely on the results of Tammemagi et al 2011. Suppose we have a study of smokers aged 50-80 years old, for whom we find out (a) their age, (b) how many "pack-years" did they smoke during their life (which ranges from 0 to 250 but mostly less than 50), and (c) whether they develop lung cancer over the 10-year study. The overall cancer rate is about 7% (tangential note: it is about 30x lower in non-smokers). Suppose that the probability of developing lung cancer of someone of age $a$ and a total number of pack-years smoked $s$ is $$\begin{aligned} p(a, s) = \frac{1}{1 + \exp(-(-4.1 + .02 a + .01 s))} . \end{aligned}$$ Incidence goes up with age and pack-years.
Here is a visual depiction of the model:
aa = np.linspace(50, 80, 101)
for s in [0, 10, 30, 50, 100]:
plt.plot(aa, 1/(1 + np.exp(-(-4.1 + .02 * aa + .01 * s))), label=f"pack-years: {s}")
plt.xlabel("age"); plt.ylabel("cancer incidence")
plt.legend();
Our plan:
First, let's pick distributions for age and pack-years:
n = 10_000
age = np.round(rng.uniform(low=50, high=80, size=n), 1)
pack_years = np.round(rng.gamma(shape=6, scale=0.1*age), 0)
fig, (ax0, ax1) = plt.subplots(1, 2)
ax0.hist(age, bins=pretty(age, 40))
ax0.set_xlabel("age (years)"); ax0.set_ylabel("frequency")
ax1.hist(pack_years, bins=pretty(pack_years, 40))
ax1.set_xlabel("pack-years smoked"); ax1.set_ylabel("frequency");
The joint distribution:
plt.scatter(age, pack_years, s=1)
plt.xlabel("age (years)"); plt.ylabel("pack-years smoked");
Now, let's decide whether they develop cancer based on age and years smoked:
theta = -4.1 + .02 * age + .01 * pack_years
prob_cancer = 1/(1 + np.exp(-theta))
cancer = rng.binomial(n=1, p=prob_cancer)
plt.scatter(theta, cancer + rng.uniform(low=-.05, high=.05, size=n), s=1);
To fit the model we could use our old friend scipy.optimize.minimize
,
but instead we'll use scikit-learn:
from sklearn.linear_model import LogisticRegression
X = np.vstack([age, pack_years]).T
lfit = LogisticRegression(penalty='none').fit(X, cancer)
b_age, b_py = lfit.coef_[0]
print(f"The linear predictor is {lfit.intercept_[0]:.3} + {b_age:.3} * age + {b_py:.3} * pack_years .")
The linear predictor is -3.99 + 0.0189 * age + 0.00923 * pack_years .
Here's the fit (it's pretty good!):
aa = np.linspace(50, 80, 101)
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
for s, c in zip([0, 10, 30, 50, 100], colors):
plt.plot(aa, 1/(1 + np.exp(-(-4.1 + .02 * aa + .01 * s))), label=f"pack-years: {s}", c=c)
plt.plot(aa, lfit.predict_proba(np.vstack([aa, np.repeat(s, len(aa))]).T)[:,1], linestyle=":", c=c)
plt.xlabel("age"); plt.ylabel("cancer incidence")
plt.legend();