Homework 4: Statistics!¶

Instructions: Please answer the following questions and submit your work by editing this jupyter notebook and submitting it on Canvas. Questions may involve math, programming, or neither, but you should make sure to explain your work: i.e., you should usually have a cell with at least a few sentences explaining what you are doing.

Also, please be sure to always specify units of any quantities that have units, and label axes of plots (again, with units when appropriate).

1. Mosquito bites¶

My kid's class all went camping, and came home with mosquito bites. Thanks to a post-trip poll, we know how many mosquito bites each of the 27 kids had. Here are the numbers:

Use maximum likelihood to fit a Poisson distribution to these data. To do this, you should

(a) make a plot of the Poisson likelihood as a function of $\lambda$, the mean of the Poisson, and

(b) use an optimization function (like scipy.optimize.minimize()) to find the value of $\lambda$ that maximizes the log-likelihood. (note: the minimize function works better if you use the log likelihood instead of the likelihood!)

(c) check your answer is sensible by comparing the distribution of the data to that expected under the model you've fit.

(d) Under this model, what proportion of kids do we expect to have zero mosquito bites? Answer this question with math, and check it with simulation.

2. Modeling proportions¶

The Beta distribution can be used to model proportions: it gives random numbers between 0 and 1, and has two parameters: $\alpha$ and $\beta$. If $X \sim \text{Beta}(\alpha, \beta)$ then $$ \begin{aligned} \mathbb{E}[X] &= \frac{\alpha}{\alpha + \beta} \\ \mathbb{E}[X^2] &= \frac{\alpha (\alpha-1)}{(\alpha + \beta)(\alpha + \beta - 1)} , \end{aligned}$$ and $X$ has probability density $$ f_X(u) = \frac{ u^{\alpha - 1}(1 - u)^{\beta - 1} }{ B(\alpha, \beta) }. $$ This density can be computed (as usual) with scipy.stats.beta.pdf, or by hand; in the latter case, $B(\alpha, \beta)$ can be computed with scipy.special.beta.

Suppose we have data from many different tracts of forest of what proportion of the trees have burned, and we'd like to fit a Beta distribution to the data. These proportions are:

(a) Make a plot of the likelihood surface for this data over the range $0.5 \le \alpha, \beta \le 4$.

(b) Estimate the values of $\alpha$ and $\beta$ that best fit the data by maximum likelihood.

(c) Using this model, in what proportion of forest tracts do you estimate less than 20% of the trees have burned?

3. Student's tea distribution¶

We are intending to do a study of how baked goods affect liquid intake: we plan to give 100 students a cup of tea along with either crumpets or fruit, and measure how many ounces of tea each student drinks in 5 minutes. Based on trial runs, we expect the amount of tea drunk per student has a SD of 3 oz. 50 students each are assigned randomly to the "crumpet" and "fruit" groups. You'll do a power analysis, finding out the probability that a study correctly identifies a difference of a certain size in mean amounts of tea drunk between the two groups.

(a) Write a function that produces a simulation of the study. The function should have two arguments: mean_crumpet and mean_fruit, which are the mean amount of tea drunk per student in the two groups, respectively. Assume that the number of ounces of tea drunk by a student is $$ \min(8, \max(0, X)), $$ where $X \sim \text{Normal}(\text{mean}=\mu, \text{sd}=3)$. (In other words, draw the number of ounces from a Normal distribution, but truncate it to fall between 0 and 8 ounces.) The function should return something of the form ((mc, sdc), (mf, sdf)), where mc and sdc are sample mean and sample SD of the "crumpet" group, respectively, and mf and sdf are sample mean and sample SD of the "fruit" group.

(b) The two-sample $t$ statistic for a pair of samples $x$ and $y$, each of size $n$, is $$ t = \frac{ \bar x - \bar y }{ \sqrt{s_x^2 + s_y^2} / \sqrt{n} }, $$ where $\bar x$ and $s_x$ are the sample mean and sample SD of the first sample, and $\bar y$ and $s_y$ are the same for the second sample. Plot a histogram of the distribution of $t$ statistics from at least 1,000 simulated studies in which both groups have mean 4 oz.

(c) Find the critical value of the distribution from (b), i.e., the value $t^*$ such that $\mathbb{P}\{ |t| < t^* \} = 0.95$ under the assumptions from part (b). (Do this by calculating the value from a $t$ distribution with $2n-2$ degrees of freedom, not by simulation.)

(d) Suppose in reality the 'crumpet' group drinks an average of 5oz of tea, while the 'fruit' group drinks an average of 3oz. Now, plot a histogram of the distribution of $t$ statistics in this case, overlaid on the histogram from (b).

(e) Find the probability that a dataset produced from the model of part (d) finds a $t$ statistic smaller than the value of $t^*$ from part (c), either by calculation or by simulation.

4. Statistics statistics¶

We have given 25 statistics students a standardized test on statistics concepts both before and after taking a statistics class. Across the class, the mean improvement was 32 points (out of 100), with an SD of 30 points. Assuming that the $t$ test is appropriate, what can we conclude from this study? Please interpret the results, including degree of uncertainty.

5. Correlation¶

Explain this joke in a way someone not in this class can understand it: xckd:552