Homework 9: Moar linear models.¶

Instructions: Please answer the following questions and submit your work by editing this jupyter notebook and submitting it on Canvas. Questions may involve math, programming, or neither, but you should make sure to explain your work: i.e., you should usually have a cell with at least a few sentences explaining what you are doing.

Also, please be sure to always specify units of any quantities that have units, and label axes of plots (again, with units when appropriate).

In [1]:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(123)

1. Logistic hypotheticals¶

Make up a situation in which you might have data like: $$ Y_i \sim \text{Binomial}\left(1, \frac{1}{1 + e^{-a + b X_i}} \right) . $$ (In other words, $Y_i = 0$ or 1 with logistic probabilities.)

(a) Describe the situation in words, including choosing values for $a$ and $b$. Make a plot of $ \frac{1}{1 + e^{-a + b x}}$ against $x$.

(b) Simulate 1000 observations from this model (with values for $X$ drawn from some reasonable distribution).

(c) Fit a logistic linear model to your simulated data. Identify the estimates of $a$ and $b$ (they should be close to the real values!).

2. A new whale¶

In the file data/whales.csv (direct link: github) are the (average) body mass (in kg) and length (in m) of 43 modern cetacean species, from this dataset. Suppose that someone has found fossils of a new species of whale that is about 8m long, and they'd like to estimate how much it weighed, based on the length-weight relationship of modern species.

(a) Read in the data and make a plot of mass against length. Also make a plot of log(mass) against log(length).

(b) Fit a linear model to predict log(mass) with log(length), i.e., find $a$ and $b$ so that $$ \log(\text{mass}) \approx a + b \log(\text{length}) , $$ and add the resulting best-fit line to the plot of log(mass) against log(length).

(c) The predicted mass of a whale of length $\ell$ is $e^{a + b \log(\ell)}$. Add this line to the original plot of mass against length.

(d) What is the predicted weight of the 8m long whale? How far off do you estimate this prediction to be? (You can get the margin of error either by looking at the plot or computing the standard deviations of the residuals on a log scale, and transforming.)

3. Find the whales¶

Recall the word count data (direct link: github) that we used as an example for PCA: there, we constructed the array wordmat so that wordmat[i,j] contains the number of times that words[j] appears in passage i, using the following code:

In [12]:
from collections import defaultdict
import pandas as pd

pfile = open("data/passages.txt", "r")
passages = pfile.read().split("\n")[:-1]
sources = pd.read_table("data/passage_sources.tsv")
words = np.unique(" ".join(passages).split(" "))[1:]
def tabwords(x, words):
    d = defaultdict(int)
    for w in x.split(" "):
        d[w] += 1
    out = np.array([d[w] for w in words])
    return out

wordmat = np.array([tabwords(x, words) for x in passages])

We'll use this matrix to address the following problem: predict the number of times that the words "whale" or "whales" appears in a given passage, using occurrance counts of the other words. To make this a little easier, we won't use all words, just those that appear between 100 and 5,000 times:

In [13]:
wordfreq = wordmat.sum(axis=0)
midwords = np.logical_and(np.logical_and(wordfreq > 100, wordfreq < 5000),
                          ~np.isin(words, ['whale', 'whales']))
has_whales = (wordmat[:,np.where(words == "whale")[0]] + wordmat[:,np.where(words == "whales")[0]]).flatten()
wordmat = wordmat[:, midwords]
words = words[midwords]

Your goal is to use sklearn.PoissonRegressor to fit a Poisson GLM that predicts has_whales from X. To do this:

(a) Choose 90% of the data for training and the remaining 10% for testing. Write functions that fit a model with the training data, and compute the standard deviation of the residuals for the testing data from a fitted model. An argument to the first function should be alpha, which is passed on to PoissonRegressor.

(b) Fit the model with the argument alpha=0. This model should fit very poorly. Explain why.

(c) Fit models across a range of values of alpha, and display the size of their residuals (on testing data) as a table or plot. What is a good value for alpha?

(d) Describe the best-fitting value of alpha from (c): how well does it predict, which words have the largest positive and negative coefficients in the model, and what does this mean? Are there any surprises?