Homework 5: Hypotheses, and hypotheticals.¶
Instructions: Please answer the following questions and submit your work by editing this jupyter notebook and submitting it on Canvas. Questions may involve math, programming, or neither, but you should make sure to explain your work: i.e., you should usually have a cell with at least a few sentences explaining what you are doing.
Also, please be sure to always specify units of any quantities that have units, and label axes of plots (again, with units when appropriate).
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng()
1. Rain¶
Is it rainier in Eugene or Springfield?
In data/eug_spr_rain.csv
you'll find data on daily rainfall, in inches,
at NOAA weather stations in Eugene on Queens East Street
and on Dixie Drive in Springfield.
You can read in the data as follows:
import pandas as pd
rain = pd.read_csv("data/eug_spr_rain.csv").set_index("date")
(a) Look at the data: make histograms of the daily totals in each location, and a scatter plot of the two locations against each other.
(b) Compute the daily difference (Eugene minus Springfield) in rainfall, and make a histogram of that.
(c) On what proportion of the days did it rain more in Eugene than Springfield? How about more in Springfield than in Eugene?
(d) Compute the $t$ statistic for the Eugene minus Springfield difference, and get a $p$-value for the two-sided test (i.e., the probability that the $t$ distribution is larger in absolute value than the number you calculated).
(e) What is your conclusion? Write a few sentences reporting the results, including the statistical tests and real-world interpretations. Be sure to include takeaways and context (e.g., what was the average rainfall?).
2. Imaginary data¶
Make up a situation in which we'd have measured at least 3 quantitative variables in at least 500 observations. You should have some positively correlated pairs of variables and some negatively correlated pairs. It does not have to be realistic or serious.
(a) Describe it in words.
(b) Simulate some data that looks at least roughly like what you'd expect real data to look like.
(c) Make plots of the data: histograms of each variable, and scatter plots of each pair of variables.
(d) Compute the correlations between each of your simulated variables
(with np.corrcoef( )
)
and explain why correlations are positive or negative.
Note: By "looks at least roughly like you'd expect", I mean that variables should be in real units and not totally unreasonable values. So, counts should be actually integers, weights should not be negative numbers, etcetera. For instance, if one of your variable is "number of pieces of candy obtained by a trick-or-treater", then these should be nonnegative integers, and should not be in the millions. (If it's in the thousands, that's probably not realistic, but close enough.)