Homework 7: Linear models.¶
Instructions: Please answer the following questions and submit your work by editing this jupyter notebook and submitting it on Canvas. Questions may involve math, programming, or neither, but you should make sure to explain your work: i.e., you should usually have a cell with at least a few sentences explaining what you are doing.
Also, please be sure to always specify units of any quantities that have units, and label axes of plots (again, with units when appropriate).
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(123)
1. All models are wrong¶
Suppose that you have data that has heteroskedasticity: the standard deviation is not constant: $$\begin{aligned} X_I &\sim \text{Poisson}(\text{mean}=2) \\ Y_i &\sim \text{Normal}(\text{mean}= a X_i + b, \text{sd}=a X_i / 4) . \end{aligned}$$ Because of the heteroskedasticity, this does not satisfy the usual assumptions of a linear model. However, you'd like to use the standard (least squares) linear model to analyze the data; how well does this work in practice? To answer this question
(a) Write a function that simulates a dataset from the model above. The sample size, $n$, should be an argument to the function, as well as $a$ and $b$.
(b) Simulate one data set with $n=100$, $a=5$, and $b=1$, and fit a (standard, least squares) linear model to the data. You can use either the formula from class or scikit-learn. Plot or otherwise depict the data and predicted values from the model (e.g., the line on top of the scatter).
(c) Do the same thing as in (b) for at least 100 additional simulated datasets, then report across these datasets how well the values of $a$ and $b$ were estimated (i.e., how close the estimates were to the true values of $a=5$ and $b=1$). Summarize the results.
2. Eyecatchers¶
Colleagues in web design have proposed adding dynamic eyecatchers to your web page: specifically, animated ducks that dance across the screen. You're skeptical that this is a good idea, but run a small study: sixty-five randomly chosen visitors get shown a random number of ducks, and you record how many seconds the visitors stay on the page. Here are the data:
num_ducks = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4])
num_seconds = np.array([22.5, 10.1, 31.2, 29.4, 26.9, 39.2, 17.1, 17.1, 34. , 11.4, 0.3,
9.2, 17.9, 16.6, 11.6, 34.6, 14.3, 29.2, 11.2, 26. , 25. , 12.6,
19.6, 6.8, 14.7, 12.2, 2.1, 14. , 12.3, 15.8, 16.9, 10.3, 6.3,
9.8, 0. , 11.3, 7.4, 7.2, 5.7, 4.9, 3.4, 7.4, 2.9, 4.3,
6.3, 5.5, 7.2, 10.7, 6.5, 9. , 1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. ])
(a) Fit a linear model to these data to predict how the average amount of time spent on the site depends on the number of seconds. You may use either of the methods shown in class: scikit-learn or direct use of the formula.
(b) Plot or otherwise display the data and the relationship predicted by the model.
(c) What are your recommendations, based on this result? Be sure to give the basis for your recommendation, including the type of model and the result.
3. More mosquitos¶
My kid's class all went camping, and came home with mosquito bites. Thanks to a post-trip poll, we know how many mosquito bites each of the 27 kids had. Here are the numbers:
bites = np.array([4, 5, 4, 2, 4, 8, 4, 6, 7, 5, 4, 0, 5, 7, 5, 3, 2, 0, 3, 4, 5, 3, 6, 1, 2, 3, 5])
Furthermore, a parent has measured the concentration
of some volatile organics in each kids' breath.
Here are the concentrations for the 27 kids (in the same order as bites
, above),
in units of parts per million (ppm):
odor = np.array([ 2.8, 4.4, 6.9, 2.3, 5.9, 10.2, 3.2, 7.6, 6.3, 4.5, 4.3,
0. , 8.2, 5.4, 7.6, 3.3, 3.9, 0.1, 2.7, 4.7, 2.1, 4.3,
11.3, 1.7, 2.8, 2.9, 8.5])
Our goal is to determine how odor affects the number of bites. To do this, we'll fit a Poisson model: if $Y_i$ is the number of bites the $i^\text{th}$ kid got, and $X_i$ is their "odor" value, then we want to fit: $$\begin{aligned} Y_i \sim \text{Poisson}(\text{mean}= \exp(a X_i + b)) , \end{aligned}$$ i.e., find the values of $a$ and $b$ at which this model best fits the data. To do this:
(a) Write down (in math) the negative log-likelihood function:
this should take $a$ and $b$ as arguments,
and return the negative log likelihood of the data (i.e., of bites
and odor
)
under the model above.
(b) Use your function from (a) and scipy.optimize.minimize
to find
the maximum likelihood estimates of $a$ and $b$.
(c) Showing how expected number of bites (using the MLE values of $a$ and $b$ from (b)) increases with odor and how this compares to the observed number of bites (for instance, by plot of the data with a line for the expected number).
Note: we will cover this sort of model more next week, but you already know how to do this - it's just doing maximum likelihood to fit a model with two parameters.