Final: Probability and Statistics for Data Science¶

Instructions: Please answer the following questions and submit your work by editing this jupyter notebook and submitting it on Canvas. Questions may involve math, programming, or neither, but you should make sure to explain your work: i.e., you should usually have a cell with at least a few sentences explaining what you are doing.

Also, please be sure to always specify units of any quantities that have units, and label axes of plots (again, with units when appropriate).

In [1]:

import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(123)

1. Stories¶

Make up stories in which we might reasonably obtain multiple random numbers from each of the following distributions. Be concrete: say exactly what quantity in your story is given by the distribution, specify what the numerical values of any parameters are, and give units when appropriate. Stories should match the distribution (e.g., if the distribution produces only integers, then the quantity in your story should always be an integer).

(a) Exponential(mean=$\mu$)

(b) Poisson(mean=$\mu$)

(c) Binomial(trials=$n$, probability=$p$)

(d) Normal(mean=$\mu$, sd=$\sigma$).

2. The three surveyors¶

Moe, Larry, and Curly are measuring the volume of a box (to figure out how many doughnuts will fit inside). To do this, Moe measures the width, Larry measures the length, and Curly measures the height. However, each make mistakes: if the true width, length, and height are $w$, $\ell$, and $h$ respectively, then the volume they compute is $$ X = M \times L \times C, $$ where $M$, $L$, and $C$ are their three respective measurements; these are independent and $$\begin{aligned} M &\sim \text{Exponential}(\text{mean}=w) \\ L &\sim \text{Normal}(\text{mean}=\ell, \text{sd}=1.5) \\ C &\sim \text{Poisson}(\text{mean}=h) . \end{aligned}$$

In the following, suppose they have measured a box true measurements $w=10$, $\ell=12$, and $h=5$.

(a) Write a function to simulate from $X$, and use it to make a histogram of the distribution of $X$.

(b) Use simulation to estimate $\mathbb{E}[X]$ and $\text{sd}[X]$.

(c) Find $\mathbb{E}[X]$ and $\text{sd}[X]$ using math (i.e., using the probability rules from class). *(Hint: use the definition that $\text{var}[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$.

3. Sleeping bats¶

We are studying how metabolic levels of a type of bat (specifically, Epomophorus wahlbergi) depend on ambient temperature. To do this, we have measured the oxygen consumed by 20 bats over the course of 1 hour, each measured at a different temperature. Here are the data (temperatures in degrees C, oxygen in ml per g of body weight):

In [4]:

temperature = np.array([1.4, 2.8, 1.2, 10., 0.4, 6.3, 1.2, 0.6, 0.7, 2.6, 2.4,
                   0.4 , 1. , 2.3, 2.8, 7.4, 2.6, 1.6, 9.4, 11.4])
oxygen = np.array([12.1, 10.6,  9.1,  7.4,  9.8, 13.3, 11.1, 11.3, 10.3,  9.2,  9.7,
                   9.5,  9.5, 10.8,  7.7,  7.5,  8.6, 11.5,  8.9, 14.1])

Our observation is that although the average oxygen consumed is fairly consistent across temperatures, the variability between bats differs substantially at different temperatures. So, we'd like to fit the following model: if $M$ is the oxygen consumed and $T$ is the temperature, $$ M \sim \text{Normal}(\text{mean}=a, \text{sd}=\exp(b T) ). $$ (In words: although mean oxygen consumed does not change with temperature, the standard deviation does.)

(a) Write the log-likelihood function for this model. The function should have three arguments: the tuple of parameters, $(a, b)$; the array of temperatures; and the array of oxygen consumption measurements.

(b) Estimate the values of $a$ and $b$ from the data by minimizing the negative log-likelihood.

(c) Use simulation (i.e., the "parametric bootstrap") to obtain 95% confidence intervals for your two estimates. Do this by: keeping temperatures fixed; simulating distance from this model using the estimated values of $a$ and $b$, re-applying the estimation procedure, and reporting a 95% interval of the resulting estimates.

4. One hundred and one hats¶

Here are a list of 101 measurements of the diameters of the heads of 101 adults (measured around where the typical hat sits).

In [12]:

hats = np.array([50.8, 57.8, 50.4, 53.8, 61.6, 64.6, 62.4, 65.3, 54.3, 60.1, 57.5,
       64. , 58. , 47.7, 49.2, 59.1, 43.1, 51.9, 50.3, 59.6, 59.6, 45.3,
       59. , 54. , 61.2, 57.7, 64.2, 63.4, 55.1, 55.4, 50.8, 55.9, 54.5,
       51.1, 54.3, 53.8, 50.5, 54.5, 57.1, 59.5, 43.5, 48.6, 57.7, 55.7,
       43.9, 56.5, 57. , 57.7, 52.5, 48.4, 57.4, 60.2, 59. , 54.8, 51.2,
       58. , 51.7, 63.3, 63.1, 51.6, 58. , 49.6, 54.9, 56.6, 52. , 58.7,
       55.7, 54.1, 58.9, 59.6, 59.7, 54.7, 57. , 55.3, 48.3, 57.1, 46.2,
       51.5, 57.2, 52.9, 55.5, 61.3, 56.2, 54.3, 53.7, 60.3, 63.3, 56.5,
       48.2, 55.8, 54.3, 44.1, 56. , 57. , 57.3, 53.7, 58.7, 52.4, 53.8,
       50.1, 50.1])

We are working on developing a flexibly new type of hat that fits a wide range of people and would like to know what the 10% and 90% quantiles of head diameters sizes are. For these data, those numbers are

In [13]:

np.quantile(hats, [0.1, 0.9])

Out[13]:

array([48.6, 61.3])

In other words, 10% of the head diamters are below 48.4cm, 80% are between 48.8cm and 61.3cm, and 10% are above 61.3cm.

However, we'd like to know how good these estimates are. Use the bootstrap (i.e., the usual, resampling-based bootstrap) to find 95% confidence intervals for both.