Final: Probability and Statistics for Data Science¶

Instructions: Please answer the following questions and submit your work by editing this jupyter notebook and submitting it on Canvas. Questions may involve math, programming, or neither, but you should make sure to explain your work: i.e., you should usually have a cell with at least a few sentences explaining what you are doing.

Also, please be sure to always specify units of any quantities that have units, and label axes of plots (again, with units when appropriate).

1. Stories¶

Make up stories in which we might reasonably obtain multiple random numbers from each of the following distributions. Be concrete: say exactly what quantity in your story is given by the distribution, specify what the numerical values of any parameters are, and give units when appropriate. Stories should match the distribution (e.g., if the distribution produces only integers, then the quantity in your story should always be an integer).

(a) Normal(mean=$\mu$, sd=$\sigma$).

(b) Poisson(mean=$\mu$)

(c) Binomial(trials=$n$, probability=$p$)

(d) Exponential(mean=$\mu$)

2. Random areas¶

The automated saw at the plywood factory has gone haywire: it's cutting random widths and heights. $X$ is the width of a piece and $Y$ is the height, with $$\begin{aligned} X &\sim \text{Uniform}(\text{min}=0, \text{max}=6 \text{ft}) \\ Y &\sim \text{Exponential}(\text{mean}=2 \text{ft}). \end{aligned}$$ Let $Z = XY$ be the area of a randomly chosen piece of plywood.

(a) Write a function to simulate from $Z$ and use it to make a histogram of the distribution of $Z$.

(b) Use simulation to estimate $\mathbb{E}[Z]$.

(c) Find $\mathbb{E}[Z]$ using math (i.e., using the probability rules from class).

3. An anenometer estimate¶

We are trying to understand what determines local wind speeds, to develop guidelines for siting small wind turbines. We've measured average wind speeds over a week on poles at various heights off the ground. Here are the heights (in meters) and measured speeds (in m/s):

In [4]:

height = np.array([0.29, 0.57, 0.24, 2.01, 0.09, 1.26, 0.23, 0.11, 0.14, 0.52, 0.49,
                   0.  , 0.19, 0.46, 0.56, 1.47, 0.52, 0.33, 1.88, 2.28])
speed = np.array([12.1, 10.6,  9.5,  7.4,  9.8, 13.3, 11.1, 11.3, 10.3,  9.2,  9.7,
                   9.5,  9.5, 10.8,  7.7,  7.5,  8.6, 11.5,  8.9, 14.1])

Previous work has shown that the following simplified model can be good in circumstances like these: if the speed is $S$ and height is $H$, $$ S \sim \text{Normal}(\text{mean}=a, \text{sd}=\exp(b H) ). $$ (In words: although mean wind speed does not change with height, the standard deviation does.)

(a) Write the log-likelihood function for this model. The function should have three arguments: the tuple of parameters, $(a, b)$; the array of heights; and the array of speeds.

(b) Estimate the values of $a$ and $b$ from the data by minimizing the negative log-likelihood.

(c) Use simulation (i.e., the "parametric bootstrap") to obtain 95% confidence intervals for your two estimates. Do this by: keeping height fixed; simulating speed from this model using the estimated values of $a$ and $b$, re-applying the estimation procedure, and reporting a 95% interval of the resulting estimates.

Note: for part (b), if you want to make a surface plot to make sure your estimated values make sense, I recommend using plt.contourf instead of imshow as in previous code. (However, such a plot is strictly optional.)

4. Boots¶

Here are a list of 101 measurements of the lengths of 101 right boots of adults walking a popular outdoor trail on Mt. Hood.

In [12]:

boots = np.array([27.7, 23.2, 26. , 25.9, 25.3, 28.2, 23.7, 24.7, 22.6, 26.1, 26.5,
       24.3, 27.1, 25.1, 25.6, 27.3, 26.5, 24.5, 27. , 26.3, 25.2, 22.8,
       25.9, 24.3, 24.3, 23.7, 24.3, 25.5, 25.1, 28.2, 28.3, 26.4, 27.1,
       25.6, 24.8, 28.3, 28. , 23.1, 23.5, 25.8, 26.7, 24.6, 25.2, 25.1,
       27.7, 24.3, 25.8, 22.5, 26.6, 26.4, 25.9, 23.3, 23. , 25. , 26.5,
       27.8, 23.7, 24.2, 27.2, 24.1, 27.1, 25.2, 26.1, 25.3, 25.8, 24.2,
       25.1, 26.4, 26. , 27.9, 24.4, 27. , 25.7, 23.6, 26.2, 27. , 23.8,
       25.6, 26.8, 23.9, 24.3, 24.8, 24.4, 25. , 21.9, 27.8, 26.4, 26.3, 20.1,
       26.6, 26.2, 24.1, 26.8, 23.2, 23.7, 24.2, 23.6, 25.4, 27.1, 24.1, 26.2])

We are trying to design a precision boot cleaner and would like to know what the 10% and 90% quantiles of boot sizes are. For these data, those numbers are

In [13]:

np.quantile(boots, [0.1, 0.9])

Out[13]:

array([23.5, 27.3])

In other words, 10% of the boots are below 23.5cm, 80% are between 23.5cm and 27.3cm, and 10% are above 27.3cm.

However, we'd like to know how good these estimates are. Use the bootstrap (i.e., the usual, resampling-based bootstrap) to find 95% confidence intervals for both.