import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd
from dsci345 import pretty
rng = np.random.default_rng()
$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}} \newcommand{\cov}{\text{cov}} \newcommand{\cor}{\text{cor}}$$
This is here so we can use \P
and \E
and \var
and \cov
and \cor
and \sd
in LaTeX below.
Prediction and Estimation¶
Suppose I've collected data on rent in Eugene. Consider the following questions:
- How much does rent cost in the Bethel neighborhood this year?
- How much is rent changing by, per year?
Question: What would good answers to these questions look like?
We could answer these questions by
- reporting empirical averages from the actual data
- fitting a black-box predictor and asking it questions
- fitting a model with explicit terms for neighborhood and year
(Discuss pros and cons and how these would actually work.)
A third question¶
- How sure are you about those answers?
- How much does rent cost in the Bethel neighborhood this year?
Note that this has two notions of uncertainty: rents come in a range of values, but also there will be uncertainty in our estimates of that range.
Uncertainty, by math¶
Reminder: confidence intervals¶
If we have $n$ independent samples from some source, and the sample mean and SD are $\bar x$ and $s$ respectively, then a 95% confidence interval is $$ \bar x - t_n s / \sqrt{n} \qquad \text{to} \qquad \bar x + t_n s / \sqrt{n} , $$ where $t_n$ is the 97.5% quantile for the $t$ distribution with $n-1$ degrees of freedom.
Said another way: $\bar x$ is our estimate of the mean, and math gives us an estimate of the uncertainty behind our estimate.
- How much does rent cost in the Bethel neighborhood this year?
(Take the mean of the rents in Bethel this year.)
- How much is rent changing by, per year?
(Take the mean difference in rent of places that go up for rent in adjacent years.)
- How sure are you about those answers?
(Get 95% CIs for them.)
A model¶
For example: suppose we fit a model where the price of the $i^\text{th}$ apartment is $$\begin{aligned} p_i &= \beta_{n_i} + \beta_\text{year} y_i + \beta_\text{area} a_i + \epsilon_i \\ \epsilon_i &\sim \text{Normal}(0, \sigma) , \end{aligned}$$ where
- $p_i$ is the price (in dollars)
- $n_i$ is the neighborhood,
- $\beta_n$ is the mean price in neighborhood $n$
- $y_i$ is the year
- $\beta_\text{year}$ is the amount that rent changes by, per year, on average
- $a_i$ is the area in square feet
- $\beta_\text{area}$ is the amount that rent goes up per additional square foot, on average
- $\epsilon_i$ is the residual, and
- $\sigma$ is how much otherwise similar apartments differ in price
- How much does rent cost in the Bethel neighborhood this year?
(Our estimate of $\beta_\text{Bethel}$ plus terms for year and area.)
- How much is rent changing by, per year?
(Our estimate of $\beta_\text{year}$.)
- How sure are you about those answers?
???
First, let's read in the data. (Note: it's fake data.) (github link)
df = pd.read_csv("data/housing.csv", index_col=0)
df
hood | year | area | rent | |
---|---|---|---|---|
0 | river | 2014 | 489.0 | 942.0 |
1 | bethel | 2010 | 610.0 | 370.0 |
2 | churchill | 2017 | 385.0 | 669.0 |
3 | bethel | 2012 | 757.0 | 624.0 |
4 | bethel | 2019 | 504.0 | 990.0 |
... | ... | ... | ... | ... |
395 | bethel | 2012 | 238.0 | 579.0 |
396 | campus | 2016 | 287.0 | 759.0 |
397 | churchill | 2016 | 526.0 | 679.0 |
398 | campus | 2014 | 418.0 | 868.0 |
399 | bethel | 2018 | 386.0 | 983.0 |
400 rows × 4 columns
Next, let's shift the variables, so that for instance $\beta_\text{Bethel}$ is the average price of a 500 ft${}^2$ apartment in 2010 (instead of a 0 ft${}^2$ apartment in the year 0):
df['area'] -= 500
df['year'] -= 2010
Now, fit the model, using patsy to make the design matrix (see board):
import patsy
outcome, predictors = patsy.dmatrices("rent ~ 0 + hood + year + area", df)
fit = np.linalg.lstsq(predictors, outcome, rcond=None)
estimates = { k.lstrip('hood[').rstrip("]") : v for k, v in zip(predictors.design_info.column_names, fit[0].ravel())}
estimates
{'bethel': np.float64(556.0088681522523), 'campus': np.float64(649.8851743401467), 'churchill': np.float64(584.0968576228599), 'river': np.float64(610.4626011683347), 'whittaker': np.float64(643.2011227181563), 'year': np.float64(31.21843122618344), 'area': np.float64(0.10063620684218222)}
np.asarray(predictors)[:10,:]
array([[ 0., 0., 0., 1., 0., 4., -11.], [ 1., 0., 0., 0., 0., 0., 110.], [ 0., 0., 1., 0., 0., 7., -115.], [ 1., 0., 0., 0., 0., 2., 257.], [ 1., 0., 0., 0., 0., 9., 4.], [ 0., 0., 0., 0., 1., 5., -16.], [ 0., 0., 0., 1., 0., 6., 93.], [ 1., 0., 0., 0., 0., 4., -196.], [ 1., 0., 0., 0., 0., 3., 163.], [ 0., 1., 0., 0., 0., 2., 81.]])
df
hood | year | area | rent | |
---|---|---|---|---|
0 | river | 4 | -11.0 | 942.0 |
1 | bethel | 0 | 110.0 | 370.0 |
2 | churchill | 7 | -115.0 | 669.0 |
3 | bethel | 2 | 257.0 | 624.0 |
4 | bethel | 9 | 4.0 | 990.0 |
... | ... | ... | ... | ... |
395 | bethel | 2 | -262.0 | 579.0 |
396 | campus | 6 | -213.0 | 759.0 |
397 | churchill | 6 | 26.0 | 679.0 |
398 | campus | 4 | -82.0 | 868.0 |
399 | bethel | 8 | -114.0 | 983.0 |
400 rows × 4 columns
Uncertainty, by simulation¶
Okay, so we'd like to know how far off our guesses of $\beta_\text{Bethel}$ or $\beta_a$ are from "the truth".
Well, one way to do this is to simulate data where we know the truth, and then see how far off our guesses are. If the simulations describe the real data well, then this should give us a good estimate.
This is sometimes known as the parametric bootstrap. (We'll meet it's more famous cousin, "the" bootstrap, next.) We actually did it, last class. Let's do it again!
Okay, from our "real data" we estimated a certain set of estimated parameter values. To make the simulated datasets look as much like the real data as possible, we should simulate using those parameter values.
def sim_rents(df, beta):
# TODO: estimate sigma
sim_df = df.copy()
sim_df['rent'] = np.round(
np.array([beta[h] for h in sim_df.hood]) + beta['year'] * sim_df.year + beta['area'] * sim_df.area
+ rng.normal(scale=150, size=len(df))
)
return sim_df
def fit_model(df):
outcome, predictors = patsy.dmatrices("rent ~ 0 + hood + year + area", df)
fit = np.linalg.lstsq(predictors, outcome, rcond=None)
estimates = { k.lstrip('hood[').rstrip("]") : v for k, v in zip(predictors.design_info.column_names, fit[0].ravel())}
return estimates
estimates
{'bethel': np.float64(556.0088681522523), 'campus': np.float64(649.8851743401467), 'churchill': np.float64(584.0968576228599), 'river': np.float64(610.4626011683347), 'whittaker': np.float64(643.2011227181563), 'year': np.float64(31.21843122618344), 'area': np.float64(0.10063620684218222)}
experiments = pd.DataFrame([
fit_model(sim_rents(df, beta=estimates))
for _ in range(1000)
])
experiments
bethel | campus | churchill | river | whittaker | year | area | |
---|---|---|---|---|---|---|---|
0 | 575.902510 | 642.604986 | 592.798342 | 608.078287 | 697.180635 | 28.779585 | 0.113843 |
1 | 557.178776 | 660.341787 | 586.334886 | 628.076765 | 644.009012 | 31.364434 | 0.035168 |
2 | 563.137220 | 668.938388 | 582.832586 | 630.124568 | 644.535599 | 32.107249 | 0.092571 |
3 | 544.430006 | 648.920584 | 595.158346 | 598.784393 | 629.969406 | 32.162263 | 0.145045 |
4 | 555.357145 | 659.746291 | 606.171082 | 628.264967 | 642.935342 | 29.826930 | 0.095766 |
... | ... | ... | ... | ... | ... | ... | ... |
995 | 544.890722 | 634.565771 | 574.393009 | 595.705698 | 639.830165 | 33.709383 | 0.050875 |
996 | 562.275564 | 655.898478 | 587.764530 | 631.991290 | 635.503826 | 31.053371 | 0.188562 |
997 | 552.190036 | 666.794021 | 546.301255 | 605.447758 | 650.142317 | 31.925727 | 0.185316 |
998 | 587.795985 | 704.718662 | 597.207021 | 607.655886 | 653.100338 | 27.555325 | 0.115970 |
999 | 553.359636 | 652.144653 | 580.140778 | 664.082251 | 679.938914 | 26.670145 | 0.098013 |
1000 rows × 7 columns
plt.hist(experiments.bethel)
plt.xlabel("mean bethel rent"); plt.ylabel("frequency");
plt.axvline(estimates['bethel'], color='red');
Conclusion: The mean rent in Bethel for 500 ft${}^2$ apartments in 2010 as about \$556, with a 95% confidence interval of about \$518 to \$595.
Widths of other 95% confidence intervals are below.
estimates
{'bethel': np.float64(556.0088681522523), 'campus': np.float64(649.8851743401467), 'churchill': np.float64(584.0968576228599), 'river': np.float64(610.4626011683347), 'whittaker': np.float64(643.2011227181563), 'year': np.float64(31.21843122618344), 'area': np.float64(0.10063620684218222)}
experiments.quantile(0.975) - experiments.quantile(0.025)
bethel 77.541967 campus 85.234240 churchill 81.627153 river 81.111593 whittaker 81.389827 year 10.005483 area 0.205705 dtype: float64
Uncertainty, by our bootstraps¶
However, sometimes we don't have (or can't be bothered to make) a good generative model to simulate from. Or we tried and it was hard to get realistic-looking data. What to do then?
Well, what we'd like to do is to draw a bunch of new datasets from wherever we got the first one.
This seems unrealistic... but, our best guess at what that would be like if we could... is our dataset itself!
The bootstrap¶
Suppose we have a method that estimates something. To get an estimate of uncertainty of that estimate,
Obtain a new dataset, of the same size as the original, by resampling with replacement observations from the original dataset.
Apply the method to this 'bootstrapped' dataset.
Repeat lots of times.
Then, the middle 95% of the estimates gives you the "bootstrap 95% confidence interval".
Exercise:¶
Here are 10 numbers.
array([176, 255, 53, 94, 61, 119, 42, 109, 0, 30])
- Compute the median.
- 1,000 times, resample 10 numbers, with replacement, from this list of numbers, and compute the median.
- Make a histogram of those 1,000 medians, with the original median as a vertical line.
# IN CLASS
x = np.array([176, 255, 53, 94, 61, 119, 42, 109, 0, 30])
med_x = np.median(x)
boot_meds = [np.median(rng.choice(x, size=10, replace=True)) for _ in range(1000)]
fig, ax = plt.subplots()
ax.hist(boot_meds, bins=pretty(boot_meds, 20))
ax.axvline(med_x, c='red');
# 95% confidence interval
(np.quantile(boot_meds, 0.025), np.quantile(boot_meds, 0.975))
(np.float64(41.5), np.float64(142.5))
Here's a (very) simple method to obtain a bootstrap resampled version of a data frame:
def bootstrap(df):
n = df.shape[0]
return df.loc[rng.choice(n, n)]
And, here, let's apply thefit_model
method above to 1000 of these.
boots = pd.DataFrame([
fit_model(bootstrap(df))
for _ in range(1000)
])
boots
bethel | campus | churchill | river | whittaker | year | area | |
---|---|---|---|---|---|---|---|
0 | 562.472230 | 607.100486 | 565.395919 | 590.802400 | 666.817819 | 31.778218 | 0.126840 |
1 | 529.192742 | 640.495432 | 593.828156 | 620.172014 | 627.839688 | 30.404554 | 0.155942 |
2 | 574.811506 | 655.464745 | 606.884968 | 649.632581 | 665.138330 | 29.370616 | 0.024600 |
3 | 532.740067 | 631.887855 | 583.826112 | 618.851433 | 663.472104 | 29.516356 | 0.124679 |
4 | 562.512608 | 666.967288 | 598.311342 | 623.518098 | 639.886522 | 29.613112 | 0.078457 |
... | ... | ... | ... | ... | ... | ... | ... |
995 | 569.668469 | 658.942641 | 574.674532 | 606.367452 | 622.423225 | 29.863527 | 0.136098 |
996 | 567.254152 | 662.668325 | 549.584022 | 588.496222 | 639.206063 | 34.323386 | 0.024327 |
997 | 559.954882 | 638.365579 | 599.215047 | 605.086299 | 663.665607 | 32.544599 | 0.088130 |
998 | 566.461808 | 682.573878 | 573.381698 | 613.338983 | 643.444693 | 31.993623 | 0.042107 |
999 | 564.783341 | 675.236614 | 576.947170 | 640.135260 | 628.659703 | 29.753753 | 0.149073 |
1000 rows × 7 columns
And, here is the distribution of estimates of $\beta_\text{Bethel}$. It is similar, but not identical, to our other distribution above (as expected).
plt.hist(boots.bethel)
plt.xlabel("mean bethel rent"); plt.ylabel("frequency");
plt.axvline(estimates['bethel'], color='red');
Conclusion: The mean rent in Bethel for 500 ft${}^2$ apartments in 2010 as about \$556, with a 95% confidence interval of about \$518 to \$595.
Widths of other 95% confidence intervals are below.
boots.quantile(0.975) - boots.quantile(0.025)
bethel 81.652334 campus 85.080632 churchill 82.110892 river 78.229027 whittaker 73.397059 year 10.116929 area 0.189463 dtype: float64