Homework 4: Statistics!¶

Instructions: Please answer the following questions and submit your work by editing this jupyter notebook and submitting it on Canvas. Questions may involve math, programming, or neither, but you should make sure to explain your work: i.e., you should usually have a cell with at least a few sentences explaining what you are doing.

Also, please be sure to always specify units of any quantities that have units, and label axes of plots (again, with units when appropriate).

In [12]:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
rng = np.random.default_rng()

1. A lot or a little¶

You've recorded the amount of time 20 people spent looking at a particular piece of art in a museum, in seconds. Here are the values:

In [10]:

art = np.array([  5.,  19., 313.,  45.,   1.,   1.,   3., 377., 597.,  52.,  90.,
         0., 779., 107., 471., 131.,  81.,   4.,  57.,   0.])  # seconds

(a) Use the method of moments to fit a Gamma distribution to these data. Be sure to simulate from the model you've fit to check you get qualitatively similar data.

(b) The museum would like to know what proportion of people spend more than 5 minutes looking at this piece of art. Use your model fit in part (a) to provide an estimated answer to this question. Use math, and check the answer by simulation.

2. Looking at the logs¶

You've been given data on login attempts to a certain server from Russian IP addresses were observed each hour across a day. Here are the numbers:

In [39]:

logins = np.array([7,  5,  5,  7,  5,  3,  4,  5,  6,  8, 7,  8,  4,  4,  7,  7,  4,
        4,  4,  6,  8,  4,  4,  6])

Use maximum likelihood to fit a Poisson distribution to these data. To do this, you should

(a) evaluate the Poisson likelihood as a function of $\lambda$, the mean of the Poisson, across a grid of values between 0 and 10. Make a plot of these values or otherwise display them.

(b) use an optimization function (like scipy.optimize.minimize()) to find the value of $\lambda$ that maximizes the log-likelihood. (note: the minimize function works better if you use the log likelihood instead of the likelihood!)

(c) compare the data to simulations from the model you've fit. and comment on how well the model fits (for instance: are the data over- or under-dispersed relative to Poisson?).

(d) Under this model, how often do we expect an hour with more than 8 logins? Answer this question with math, and check it with simulation.

3. Modeling proportions¶

The Beta distribution can be used to model proportions: it gives random numbers between 0 and 1, and has two parameters: $\alpha$ and $\beta$. If $X \sim \text{Beta}(\alpha, \beta)$ then $$ \begin{aligned} \mathbb{E}[X] &= \frac{\alpha}{\alpha + \beta} \\ \mathbb{E}[X^2] &= \frac{\alpha (\alpha-1)}{(\alpha + \beta)(\alpha + \beta - 1)} , \end{aligned}$$ and $X$ has probability density $$ f_X(u) = \frac{ u^{\alpha - 1}(1 - u)^{\beta - 1} }{ B(\alpha, \beta) }. $$ This density can be computed (as usual) with scipy.stats.beta.pdf, or by hand; in the latter case, $B(\alpha, \beta)$ can be computed with scipy.special.beta.

Suppose we have data from many different tracts of forest of what proportion of the trees have burned, and we'd like to fit a Beta distribution to the data. These proportions are:

In [8]:

burned = np.array([
    0.04, 0.55, 0.91, 0.64, 0.83, 0.62, 0.98, 0.7, 0.36, 0.73, 0.74, 0.28, 0.35, 0.65, 0.85, 0.9, 0.94,
    0.11, 0.74, 0.48, 0.62, 0.66, 0.51, 0.79, 0.61, 0.66, 0.75, 0.86, 0.52, 0.84, 0.43, 0.61, 0.99, 0.85, 
    0.97, 0.46, 0.75, 0.61, 0.95, 0.76, 0.78, 0.89, 0.79, 0.92, 0.83, 0.84, 0.61, 0.52, 0.82, 0.87, 0.9, 
    0.58, 0.67, 0.42, 0.9, 0.4, 0.95, 0.98, 0.56, 0.94, 0.5, 0.84, 0.58, 0.91, 0.21, 0.54, 0.9, 0.64, 0.48,
    0.82, 0.77, 0.63, 0.84, 0.97, 0.77, 0.96, 0.83, 0.9, 0.96, 0.52, 0.24, 0.92, 0.11, 0.96, 0.85, 0.62, 
    0.96, 0.67, 0.87, 0.78, 0.85, 0.88, 0.88, 0.68, 0.13, 0.9, 0.94, 0.49, 0.74, 0.99
])

(a) Evaluate the likelihood surface for this data on a grid of values across the range $0.5 \le \alpha, \beta \le 4$, and make a plot or otherwise show the resulting likelihoods.

(b) Estimate the values of $\alpha$ and $\beta$ that best fit the data by maximum likelihood. (Note: scipy.optimize.minimize() may have difficulty converging; if so, try different starting locations.)

(c) Using this model, in what proportion of forest tracts do you estimate less than 20% of the trees have burned?