Generalized Linear Models¶

Peter Ralph

https://uodsci.github.io/dsci345

InĀ [1]:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd
from dsci345 import pretty

rng = np.random.default_rng()

$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}} \newcommand{\cov}{\text{cov}} \newcommand{\cor}{\text{cor}}$$ This is here so we can use \P and \E and \var and \cov and \cor and \sd in LaTeX below.

Linear models: recap¶

We've seen a few flavors of "linear model". First, the 'standard': $$\begin{aligned} Y &= Xb + \epsilon \\ \epsilon &\sim \text{Normal}(\text{mean}=0, \text{sd}=\sigma) , \end{aligned}$$ which is equivalent to $$\begin{aligned} Y &\sim \text{Normal}(\text{mean}= Xb, \text{sd}=\sigma) . \end{aligned}$$

Robust: $$\begin{aligned} Y &\sim \text{Cauchy}(\text{location}= Xb, \text{scale}=\sigma) . \end{aligned}$$

And, in homework, Poisson: $$\begin{aligned} Y &\sim \text{Poisson}(\text{mean}=\exp(Xb)) . \end{aligned}$$

What are these "models" for?

  • understanding how signal and noise combine in your data, and how to extract the information you want

  • simulating data, to test methods and build intuition

  • providing a framework to describe the data (to fit the model, e.g., by maximum likelihood)

  • describing how good predictions based on a fitted model are likely to be

You could, for instance, decide your data come from a Poisson model and yet still fit it by least squares (i.e., maximum likelihood with a Normal model). But: the "correct" model (i.e., one that better describes the noise distribution) should provide more accurate results.

Generalizing¶

In each case, we said:

  1. The response, $Y$, is random with some distribution,
  2. whose mean is determined by a linear combination of the predictors, $X$.

Ingredients¶

A generalized linear model has three ingredients:

  1. a response distribution for $Y$ (the "family"),
  2. a linear predictor, $Xb$, and
  3. a link function $h( )$ connecting the linear predictor to the mean of the response, usually $h(\E[Y]) = Xb$.

For instance: $$\begin{aligned} Y &\sim \text{Poisson}(\text{mean}=\exp(Xb)) \end{aligned}$$ is a Poisson GLM with a log link function (since $\log(\E[Y]) = Xb$).

For instance: $$\begin{aligned} Y &\sim \text{Normal}(\text{mean}= Xb, \text{sd}=\sigma) . \end{aligned}$$ is a Normal GLM with an identity link function.

Logistic models¶

Probably the most common GLM besides Normal is the "Binomial GLM with a logistic link", i.e.: $$\begin{aligned} Y_i &\sim \text{Binomial}(\text{size}=N_i, \text{prob}=\theta_i) \\ \theta_i &= \frac{1}{1 + e^{-Xb}} . \end{aligned}$$

Here $f(x) = 1/(1 + e^{-x})$ is the logistic function, which takes the linear predictor (which can be any value) and gives us back a number between 0 and 1:

InĀ [2]:
xx = np.linspace(-6, 6, 101)
plt.plot(xx, 1/(1 + np.exp(-xx)));
No description has been provided for this image

Example: keep your head down¶

Taller dandelion flowers spread their seeds farther... but are also more likely to be eaten. We survey a field before and after a herd of cows come through, and determine that the chance that a dandelion that is $x$ inches high is eaten is $$ p(x) = \frac{1}{1 + e^{-(x - 5)}} . $$

InĀ [3]:
xvals = np.linspace(0, 12, 61)
pvals = 1/(1 + np.exp(-(xvals - 5)))
plt.plot(xvals, pvals)
plt.xlabel("height (inches)"); plt.ylabel("probability of being eaten");
No description has been provided for this image

How might we determine this? Here's some hypothetical data: circles are uneaten, x's are eaten, along with the theoretical curve, as well as the proportion eaten in one inch height bins:

InĀ [4]:
n = 300
x = rng.uniform(low=0, high=12, size=n)
eaten = (rng.random(n) < 1/(1 + np.exp(-(x-5))))
fig, ax = plt.subplots()
[ax.axvline(k, c='lightgrey') for k in range(13)];
ax.scatter(x[eaten], 1.0 + rng.uniform(-.05, .05, np.sum(eaten)), marker = "x")
ax.scatter(x[~eaten], rng.uniform(-.05, .05, np.sum(~eaten)), marker = "o")
ax.plot(xvals, pvals)
ax.set_xlabel("height"); ax.set_ylabel("eaten?")
props = [np.mean(eaten[np.floor(x) == k]) for k in range(12)]
ax.scatter(np.arange(12) + 0.5, props, s=200);
No description has been provided for this image
InĀ [5]:
from sklearn.linear_model import LogisticRegression
X = np.vstack([x]).T
lfit = LogisticRegression(penalty=None).fit(X, eaten)
InĀ [6]:
lfit
Out[6]:
LogisticRegression(penalty=None)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(penalty=None)
InĀ [7]:
lfit.coef_, lfit.intercept_
Out[7]:
(array([[0.92826693]]), array([-4.18889562]))
InĀ [8]:
x[:10], lfit.predict_proba(X)[:10]
Out[8]:
(array([10.73287297,  2.08649262, 11.65510899,  3.99240506,  8.28596169,
         5.82241048,  4.99844798,  4.26507183,  4.03432138,  6.58957577]),
 array([[0.00309745, 0.99690255],
        [0.90482924, 0.09517076],
        [0.00131822, 0.99868178],
        [0.61842725, 0.38157275],
        [0.02923583, 0.97076417],
        [0.22866664, 0.77133336],
        [0.38912343, 0.61087657],
        [0.55719123, 0.44280877],
        [0.60920424, 0.39079576],
        [0.12697327, 0.87302673]]))
InĀ [9]:
fig, ax = plt.subplots()
xx = np.linspace(0, 12, 121)
ax.plot(xx, lfit.predict_proba(np.column_stack([xx])), label=['predicted not eaten', 'predicted eaten'])
ax.plot(xx, 1/(1 + np.exp(-(xx - 5))), label='true relationship')
ax.set_xlabel("height (in)"); ax.set_ylabel("probability of being eaten")
ax.legend();
No description has been provided for this image

Your turn¶

Identify the family, link function, and linear predictor of the following GLM: $$\begin{aligned} Y &\sim \text{Poisson}(\text{mean}=(b_0 + b_1 X)^2) . \end{aligned}$$ Then, simulate 150 data points from it, and plot, taking $Y$ to be the number of misspelled words per page in an essay and $X$ to be the number of hours past 8pm.

InĀ [Ā ]:
 

Example: incidence¶

Let's build a model of lung cancer incidence, based loosely on the results of Tammemagi et al 2011. Suppose we have a study of smokers aged 50-80 years old, for whom we find out (a) their age, (b) how many "pack-years" did they smoke during their life (which ranges from 0 to 250 but mostly less than 50), and (c) whether they develop lung cancer over the 10-year study. The overall cancer rate is about 7% (tangential note: it is about 30x lower in non-smokers). Suppose that the probability of developing lung cancer of someone of age $a$ and a total number of pack-years smoked $s$ is $$\begin{aligned} p(a, s) = \frac{1}{1 + \exp(-(-4.1 + .02 a + .01 s))} . \end{aligned}$$ Incidence goes up with age and pack-years.

Here is a visual depiction of the model:

InĀ [10]:
aa = np.linspace(50, 80, 101)
for s in [0, 10, 30, 50, 100]:
    plt.plot(aa, 1/(1 + np.exp(-(-4.1 + .02 * aa + .01 * s))), label=f"pack-years: {s}")
plt.xlabel("age"); plt.ylabel("cancer incidence")
plt.legend();
No description has been provided for this image

Our plan:

  1. Simulate 10,000 data points (people) from this model.
  2. Fit the model.
  3. See how close we get to the true parameters.

First, let's pick distributions for age and pack-years:

InĀ [11]:
n = 10_000
age = np.round(rng.uniform(low=50, high=80, size=n), 1)
pack_years = np.round(rng.gamma(shape=6, scale=0.1*age), 0)

fig, (ax0, ax1) = plt.subplots(1, 2)
ax0.hist(age, bins=pretty(age, 40))
ax0.set_xlabel("age (years)"); ax0.set_ylabel("frequency")
ax1.hist(pack_years, bins=pretty(pack_years, 40))
ax1.set_xlabel("pack-years smoked"); ax1.set_ylabel("frequency");
No description has been provided for this image

The joint distribution:

InĀ [12]:
plt.scatter(age, pack_years, s=1)
plt.xlabel("age (years)"); plt.ylabel("pack-years smoked");
No description has been provided for this image

Now, let's decide whether they develop cancer based on age and years smoked:

InĀ [13]:
theta = -4.1 + .02 * age + .01 * pack_years
prob_cancer =  1/(1 + np.exp(-theta))
cancer = rng.binomial(n=1, p=prob_cancer)

plt.scatter(theta, cancer + rng.uniform(low=-.05, high=.05, size=n), s=1);
No description has been provided for this image

To fit the model we could use our old friend scipy.optimize.minimize, but instead we'll use scikit-learn:

InĀ [14]:
from sklearn.linear_model import LogisticRegression
X = np.vstack([age, pack_years]).T
lfit = LogisticRegression(penalty='none').fit(X, cancer)
b_age, b_py = lfit.coef_[0]
print(f"The linear predictor is {lfit.intercept_[0]:.3} + {b_age:.3} * age + {b_py:.3} * pack_years .")
---------------------------------------------------------------------------
InvalidParameterError                     Traceback (most recent call last)
Cell In[14], line 3
      1 from sklearn.linear_model import LogisticRegression
      2 X = np.vstack([age, pack_years]).T
----> 3 lfit = LogisticRegression(penalty='none').fit(X, cancer)
      4 b_age, b_py = lfit.coef_[0]
      5 print(f"The linear predictor is {lfit.intercept_[0]:.3} + {b_age:.3} * age + {b_py:.3} * pack_years .")

File ~/micromamba/envs/ds345/lib/python3.12/site-packages/sklearn/base.py:1382, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1377 partial_fit_and_fitted = (
   1378     fit_method.__name__ == "partial_fit" and _is_fitted(estimator)
   1379 )
   1381 if not global_skip_validation and not partial_fit_and_fitted:
-> 1382     estimator._validate_params()
   1384 with config_context(
   1385     skip_parameter_validation=(
   1386         prefer_skip_nested_validation or global_skip_validation
   1387     )
   1388 ):
   1389     return fit_method(estimator, *args, **kwargs)

File ~/micromamba/envs/ds345/lib/python3.12/site-packages/sklearn/base.py:436, in BaseEstimator._validate_params(self)
    428 def _validate_params(self):
    429     """Validate types and values of constructor parameters
    430 
    431     The expected type and values must be defined in the `_parameter_constraints`
   (...)    434     accepted constraints.
    435     """
--> 436     validate_parameter_constraints(
    437         self._parameter_constraints,
    438         self.get_params(deep=False),
    439         caller_name=self.__class__.__name__,
    440     )

File ~/micromamba/envs/ds345/lib/python3.12/site-packages/sklearn/utils/_param_validation.py:98, in validate_parameter_constraints(parameter_constraints, params, caller_name)
     92 else:
     93     constraints_str = (
     94         f"{', '.join([str(c) for c in constraints[:-1]])} or"
     95         f" {constraints[-1]}"
     96     )
---> 98 raise InvalidParameterError(
     99     f"The {param_name!r} parameter of {caller_name} must be"
    100     f" {constraints_str}. Got {param_val!r} instead."
    101 )

InvalidParameterError: The 'penalty' parameter of LogisticRegression must be a str among {'l1', 'l2', 'elasticnet'} or None. Got 'none' instead.

Here's the fit (it's pretty good!):

InĀ [15]:
aa = np.linspace(50, 80, 101)
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
for s, c in zip([0, 10, 30, 50, 100], colors):
    plt.plot(aa, 1/(1 + np.exp(-(-4.1 + .02 * aa + .01 * s))), label=f"pack-years: {s}", c=c)
    plt.plot(aa, lfit.predict_proba(np.vstack([aa, np.repeat(s, len(aa))]).T)[:,1], linestyle=":", c=c)
plt.xlabel("age"); plt.ylabel("cancer incidence")
plt.legend();
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[15], line 5
      3 for s, c in zip([0, 10, 30, 50, 100], colors):
      4     plt.plot(aa, 1/(1 + np.exp(-(-4.1 + .02 * aa + .01 * s))), label=f"pack-years: {s}", c=c)
----> 5     plt.plot(aa, lfit.predict_proba(np.vstack([aa, np.repeat(s, len(aa))]).T)[:,1], linestyle=":", c=c)
      6 plt.xlabel("age"); plt.ylabel("cancer incidence")
      7 plt.legend();

File ~/micromamba/envs/ds345/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:1428, in LogisticRegression.predict_proba(self, X)
   1423 ovr = self.multi_class in ["ovr", "warn"] or (
   1424     self.multi_class in ["auto", "deprecated"]
   1425     and (self.classes_.size <= 2 or self.solver == "liblinear")
   1426 )
   1427 if ovr:
-> 1428     return super()._predict_proba_lr(X)
   1429 else:
   1430     decision = self.decision_function(X)

File ~/micromamba/envs/ds345/lib/python3.12/site-packages/sklearn/linear_model/_base.py:389, in LinearClassifierMixin._predict_proba_lr(self, X)
    382 def _predict_proba_lr(self, X):
    383     """Probability estimation for OvR logistic regression.
    384 
    385     Positive class probabilities are computed as
    386     1. / (1. + np.exp(-self.decision_function(X)));
    387     multiclass is handled by normalizing that over all classes.
    388     """
--> 389     prob = self.decision_function(X)
    390     expit(prob, out=prob)
    391     if prob.ndim == 1:

File ~/micromamba/envs/ds345/lib/python3.12/site-packages/sklearn/linear_model/_base.py:351, in LinearClassifierMixin.decision_function(self, X)
    348 check_is_fitted(self)
    349 xp, _ = get_namespace(X)
--> 351 X = validate_data(self, X, accept_sparse="csr", reset=False)
    352 scores = safe_sparse_dot(X, self.coef_.T, dense_output=True) + self.intercept_
    353 return (
    354     xp.reshape(scores, (-1,))
    355     if (scores.ndim > 1 and scores.shape[1] == 1)
    356     else scores
    357 )

File ~/micromamba/envs/ds345/lib/python3.12/site-packages/sklearn/utils/validation.py:2965, in validate_data(_estimator, X, y, reset, validate_separately, skip_check_array, **check_params)
   2962     out = X, y
   2964 if not no_val_X and check_params.get("ensure_2d", True):
-> 2965     _check_n_features(_estimator, X, reset=reset)
   2967 return out

File ~/micromamba/envs/ds345/lib/python3.12/site-packages/sklearn/utils/validation.py:2829, in _check_n_features(estimator, X, reset)
   2826     return
   2828 if n_features != estimator.n_features_in_:
-> 2829     raise ValueError(
   2830         f"X has {n_features} features, but {estimator.__class__.__name__} "
   2831         f"is expecting {estimator.n_features_in_} features as input."
   2832     )

ValueError: X has 2 features, but LogisticRegression is expecting 1 features as input.
No description has been provided for this image