Homework 5: Hypotheses, and hypotheticals.¶

Instructions: Please answer the following questions and submit your work by editing this jupyter notebook and submitting it on Canvas. Questions may involve math, programming, or neither, but you should make sure to explain your work: i.e., you should usually have a cell with at least a few sentences explaining what you are doing.

Also, please be sure to always specify units of any quantities that have units, and label axes of plots (again, with units when appropriate).

In [4]:

import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
rng = np.random.default_rng()

1. Rain¶

Is it rainier in Eugene or Springfield? In data/eug_spr_rain.csv you'll find data on daily rainfall, in inches, at NOAA weather stations in Eugene on Queens East Street and on Dixie Drive in Springfield. You can read in the data as follows:

In [2]:

import pandas as pd
rain = pd.read_csv("data/eug_spr_rain.csv").set_index("date")

(a) Look at the data: make numerical or graphical summaries of the daily totals in each location, and how they relate to each other.

(b) Compute the daily difference (Eugene minus Springfield) in rainfall, and summarize that distribution.

(c) On what proportion of the days did it rain more in Eugene than Springfield? How about more in Springfield than in Eugene?

(d) Compute the $t$ statistic for the Eugene minus Springfield difference, and get a $p$-value for the two-sided test (i.e., the probability that the $t$ distribution is larger in absolute value than the number you calculated).

(e) What is your conclusion? Write a few sentences reporting the results, including the statistical tests and real-world interpretations. Be sure to include takeaways and context (e.g., what was the average rainfall?), and address possible concerns (are any assumptions of the $t$ test likely violated)?

2. To the $t$¶

You now know a few facts about the $t$ distribution:

The $t$ statistic, computed from a sample of $n$ independent draws from a distribution $X$ with mean $\mu$, is approximately described by Student's $t$ distribution with $n-1$ degrees of freedom.
The previous statement is exact if $X$ is Normal.

In particular: define $t_*(n)$ so that a draw from Student's $t$ distribution with $n-1$ degrees of freedom is larger than $t_*(n)$ with probability 95%. Then, if $X_1, \ldots, X_n$ are independent draws from some distribution with $t$ statistic $T$ calculated using the true mean of that distribution, then $\mathbb{P}(T > t_*(n)) \approx 0.05$. If $X_1, \ldots, X_n$ are draws from the Normal distribution then this is exact.

What does that "approximately" mean? You have the tools to find out.

(a) For values of $n$ between $n=2$ and $n=200$, draw $n$ samples from a Normal(mean=0, sd=1) distribution, and compute the $t$ statistic. Do this 100,000 times and report what percentage of the time these values are larger than $t_*(n)$. You should get values pretty close to 0.05 for all values of $n$.

(b) Do the same with an Exponential(1) distribution (remember to subtract $\mu=1$ when computing the $t$ statistic).

(c) Now, do the same as in part (a) but with $\mu=2$.

(d) Explain the practical consequence of (b) and (c) for someone who does a lot of $t$ tests: which one tells you about false positive rates, and which one tells you about statistical power?

Note: in computing the $t$ statistic, be sure to use np.std(..., ddof=1)! Note: To get $t_*(n)$ use scipy.stats.t.ppf().

3. Imaginary data¶

Make up a situation in which we'd have measured at least 3 quantitative variables in at least 500 observations. You should have some positively correlated pairs of variables and some negatively correlated pairs. It does not have to be realistic or serious.

(a) Describe it in words.

(b) Simulate some data that looks at least roughly like what you'd expect real data to look like.

(c) Make plots of the data: histograms of each variable, and scatter plots of each pair of variables.

(d) Compute the correlations between each of your simulated variables (with np.corrcoef( )) and explain why correlations are positive or negative.

Note: By "looks at least roughly like you'd expect", I mean that variables should be in real units and not totally unreasonable values. So, counts should be actually integers, weights should not be negative numbers, etcetera. For instance, if one of your variable is "number of pieces of candy obtained by a trick-or-treater", then these should be nonnegative integers, and should not be in the millions. (If it's in the thousands, that's probably not realistic, but close enough.)