Hello again, Uncertainty¶

Peter Ralph

https://uodsci.github.io/dsci345

In [1]:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd
from dsci345 import pretty

rng = np.random.default_rng()

$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}} \newcommand{\cov}{\text{cov}} \newcommand{\cor}{\text{cor}}$$ This is here so we can use \P and \E and \var and \cov and \cor and \sd in LaTeX below.

Prediction and Estimation¶

Suppose I've collected data on rent in Eugene. Consider the following questions:

  1. How much does rent cost in the Bethel neighborhood this year?
  2. How much is rent changing by, per year?

Question: What would good answers to these questions look like?

We could answer these questions by

  • reporting empirical averages from the actual data
  • fitting a black-box predictor and asking it questions
  • fitting a model with explicit terms for neighborhood and year

(Discuss pros and cons and how these would actually work.)

A third question¶

  1. How sure are you about those answers?
  1. How much does rent cost in the Bethel neighborhood this year?

Note that this has two notions of uncertainty: rents come in a range of values, but also there will be uncertainty in our estimates of that range.

Uncertainty, by math¶

Reminder: confidence intervals¶

If we have $n$ independent samples from some source, and the sample mean and SD are $\bar x$ and $s$ respectively, then a 95% confidence interval is $$ \bar x - t_n s / \sqrt{n} \qquad \text{to} \qquad \bar x + t_n s / \sqrt{n} , $$ where $t_n$ is the 97.5% quantile for the $t$ distribution with $n-1$ degrees of freedom.

Said another way: $\bar x$ is our estimate of the mean, and math gives us an estimate of the uncertainty behind our estimate.

  1. How much does rent cost in the Bethel neighborhood this year?

(Take the mean of the rents in Bethel this year.)

  1. How much is rent changing by, per year?

(Take the mean difference in rent of places that go up for rent in adjacent years.)

  1. How sure are you about those answers?

(Get 95% CIs for them.)

A model¶

For example: suppose we fit a model where the price of the $i^\text{th}$ apartment is $$\begin{aligned} p_i &= \beta_{n_i} + \beta_\text{year} y_i + \beta_\text{area} a_i + \epsilon_i \\ \epsilon_i &\sim \text{Normal}(0, \sigma) , \end{aligned}$$ where

  • $p_i$ is the price (in dollars)
  • $n_i$ is the neighborhood,
  • $\beta_n$ is the mean price in neighborhood $n$
  • $y_i$ is the year
  • $\beta_\text{year}$ is the amount that rent changes by, per year, on average
  • $a_i$ is the area in square feet
  • $\beta_\text{area}$ is the amount that rent goes up per additional square foot, on average
  • $\epsilon_i$ is the residual, and
  • $\sigma$ is how much otherwise similar apartments differ in price
  1. How much does rent cost in the Bethel neighborhood this year?

(Our estimate of $\beta_\text{Bethel}$ plus terms for year and area.)

  1. How much is rent changing by, per year?

(Our estimate of $\beta_\text{year}$.)

  1. How sure are you about those answers?

???

Aside: how do we fit this model?¶

We'll do it in two lines, with patsy and np.linalg.lstsq. (Reminder: since the residuals are Normal, least squares is the same as maximum likelihood.)

First, let's read in the data. (Note: it's fake data.) (github link)

In [2]:
df = pd.read_csv("data/housing.csv", index_col=0)
df
Out[2]:
hood year area rent
0 river 2014 489.0 942.0
1 bethel 2010 610.0 370.0
2 churchill 2017 385.0 669.0
3 bethel 2012 757.0 624.0
4 bethel 2019 504.0 990.0
... ... ... ... ...
395 bethel 2012 238.0 579.0
396 campus 2016 287.0 759.0
397 churchill 2016 526.0 679.0
398 campus 2014 418.0 868.0
399 bethel 2018 386.0 983.0

400 rows × 4 columns

Next, let's shift the variables, so that for instance $\beta_\text{Bethel}$ is the average price of a 500 ft${}^2$ apartment in 2010 (instead of a 0 ft${}^2$ apartment in the year 0):

In [3]:
df['area'] -= 500
df['year'] -= 2010

Now, fit the model, using patsy to make the design matrix (see board):

In [4]:
import patsy
outcome, predictors = patsy.dmatrices("rent ~ 0 + hood + year + area", df)
fit = np.linalg.lstsq(predictors, outcome, rcond=None)

estimates = { k.lstrip('hood[').rstrip("]") : v for k, v in zip(predictors.design_info.column_names, fit[0].ravel())}

estimates
Out[4]:
{'bethel': np.float64(556.0088681522523),
 'campus': np.float64(649.8851743401467),
 'churchill': np.float64(584.0968576228599),
 'river': np.float64(610.4626011683347),
 'whittaker': np.float64(643.2011227181563),
 'year': np.float64(31.21843122618344),
 'area': np.float64(0.10063620684218222)}

What's happening under the hood? Compare these:

In [5]:
np.asarray(predictors)[:10,:]
Out[5]:
array([[   0.,    0.,    0.,    1.,    0.,    4.,  -11.],
       [   1.,    0.,    0.,    0.,    0.,    0.,  110.],
       [   0.,    0.,    1.,    0.,    0.,    7., -115.],
       [   1.,    0.,    0.,    0.,    0.,    2.,  257.],
       [   1.,    0.,    0.,    0.,    0.,    9.,    4.],
       [   0.,    0.,    0.,    0.,    1.,    5.,  -16.],
       [   0.,    0.,    0.,    1.,    0.,    6.,   93.],
       [   1.,    0.,    0.,    0.,    0.,    4., -196.],
       [   1.,    0.,    0.,    0.,    0.,    3.,  163.],
       [   0.,    1.,    0.,    0.,    0.,    2.,   81.]])
In [6]:
df
Out[6]:
hood year area rent
0 river 4 -11.0 942.0
1 bethel 0 110.0 370.0
2 churchill 7 -115.0 669.0
3 bethel 2 257.0 624.0
4 bethel 9 4.0 990.0
... ... ... ... ...
395 bethel 2 -262.0 579.0
396 campus 6 -213.0 759.0
397 churchill 6 26.0 679.0
398 campus 4 -82.0 868.0
399 bethel 8 -114.0 983.0

400 rows × 4 columns

Estimating sigma¶

We'd also like to get an estimate of the residual SD. There are fancier ways to do this, but this is pretty general-purpose:

In [7]:
df['estimated_rent'] = predictors.dot(fit[0])
resids = df['rent'] - df['estimated_rent']
sigma_hat = np.std(resids)
print(f"Estimated residual SD: $ {sigma_hat:.0f}")
Estimated residual SD: $ 152

Uncertainty, by simulation¶

Okay, so we'd like to know how far off our guesses of $\beta_\text{Bethel}$ or $\beta_a$ are from "the truth".

Well, one way to do this is to simulate data where we know the truth, and then see how far off our guesses are. If the simulations describe the real data well, then this should give us a good estimate.

This is sometimes known as the parametric bootstrap. (We'll meet it's more famous cousin, "the" bootstrap, next.)

Okay, from our "real data" we estimated a certain set of estimated parameter values. To make the simulated datasets look as much like the real data as possible, we should simulate using those parameter values.

In [8]:
def sim_rents(df, beta):
    sim_df = df.copy()
    sim_df['rent'] = np.round(
        np.array([beta[h] for h in sim_df.hood]) + beta['year'] * sim_df.year
        + beta['area'] * sim_df.area
        + rng.normal(scale=sigma_hat, size=len(df))
    )
    return sim_df
In [9]:
def fit_model(df):
    outcome, predictors = patsy.dmatrices("rent ~ 0 + hood + year + area", df)
    fit = np.linalg.lstsq(predictors, outcome, rcond=None)
    estimates = { k.lstrip('hood[').rstrip("]") : v for k, v in zip(predictors.design_info.column_names, fit[0].ravel())}
    return estimates
In [10]:
estimates
Out[10]:
{'bethel': np.float64(556.0088681522523),
 'campus': np.float64(649.8851743401467),
 'churchill': np.float64(584.0968576228599),
 'river': np.float64(610.4626011683347),
 'whittaker': np.float64(643.2011227181563),
 'year': np.float64(31.21843122618344),
 'area': np.float64(0.10063620684218222)}
In [11]:
experiments = pd.DataFrame([
    fit_model(sim_rents(df, beta=estimates))
    for _ in range(1000)
])
experiments
Out[11]:
bethel campus churchill river whittaker year area
0 547.406421 670.609800 600.753279 583.564215 617.367968 33.504621 0.057606
1 530.424771 650.253583 559.307861 591.066776 641.236669 34.495097 0.034868
2 574.834840 646.729395 592.010077 609.810278 629.972488 29.652640 0.106134
3 544.165310 645.741261 534.525703 591.968196 614.144066 34.066007 0.152077
4 533.882676 661.595462 570.344993 566.978911 627.848782 35.211414 0.083685
... ... ... ... ... ... ... ...
995 565.566020 667.794611 554.681243 620.752172 643.050331 28.409680 0.167910
996 549.099752 641.988149 577.999852 587.271058 629.994196 35.448140 0.106501
997 572.786680 653.304288 586.770003 621.779600 648.146958 29.950980 0.113830
998 576.160035 652.786124 600.535508 598.441471 636.904570 30.336501 0.128517
999 565.139001 653.583382 609.157945 644.094537 660.257037 31.458917 0.001970

1000 rows × 7 columns

In [12]:
plt.hist(experiments.bethel)
plt.xlabel("mean bethel rent"); plt.ylabel("frequency");
plt.axvline(estimates['bethel'], color='red');
No description has been provided for this image

Conclusion: The mean rent in Bethel for 500 ft${}^2$ apartments in 2010 as about \$556, with a 95% confidence interval of about \$518 to \$595.

Widths of other 95% confidence intervals are below.

In [13]:
estimates
Out[13]:
{'bethel': np.float64(556.0088681522523),
 'campus': np.float64(649.8851743401467),
 'churchill': np.float64(584.0968576228599),
 'river': np.float64(610.4626011683347),
 'whittaker': np.float64(643.2011227181563),
 'year': np.float64(31.21843122618344),
 'area': np.float64(0.10063620684218222)}

Uncertainty, by our bootstraps¶

However, sometimes we don't have (or can't be bothered to make) a good generative model to simulate from. Or we tried and it was hard to get realistic-looking data. What to do then?

Well, what we'd like to do is to draw a bunch of new datasets from wherever we got the first one.

This seems unrealistic... but, our best guess at what that would be like if we could... is our dataset itself!

The bootstrap¶

Suppose we have a method that estimates something. To get an estimate of uncertainty of that estimate,

  1. Obtain a new dataset, of the same size as the original, by resampling with replacement observations from the original dataset.

  2. Apply the method to this 'bootstrapped' dataset.

  3. Repeat lots of times.

Then, the middle 95% of the estimates gives you the "bootstrap 95% confidence interval".

Exercise:¶

Here are 10 numbers.

array([176, 255,  53,  94,  61, 119,  42, 109,   0,  30])
  1. Compute the median.
  2. 1,000 times, resample 10 numbers, with replacement, from this list of numbers, and compute the median.
  3. Make a histogram of those 1,000 medians, with the original median as a vertical line.
In [14]:
# IN CLASS
x = np.array([176, 255,  53,  94,  61, 119,  42, 109,   0,  30])
boots = np.array([np.median(rng.choice(x, size=10, replace=True)) for _ in range (1000)])
In [15]:
# 95% confidence interval
(np.quantile(boots, 0.025), np.quantile(boots, 0.975))
Out[15]:
(np.float64(42.0), np.float64(147.5))

Back to the rent data¶

Here's a (very) simple method to obtain a bootstrap resampled version of a data frame:

In [16]:
def bootstrap(df):
    n = df.shape[0]
    return df.loc[rng.choice(n, n)]

And, here, let's apply thefit_model method above to 1000 of these.

In [17]:
boots = pd.DataFrame([
    fit_model(bootstrap(df))
    for _ in range(1000)
])
boots
Out[17]:
bethel campus churchill river whittaker year area
0 535.786120 618.048422 558.560808 600.017486 602.712920 35.433554 0.095852
1 603.336925 669.249377 613.899895 645.548419 662.239795 26.206914 0.169840
2 590.096060 633.026026 617.391151 596.626279 599.380174 32.467683 0.096663
3 555.780373 638.342284 581.726375 587.133166 670.136499 29.930777 0.037896
4 553.686771 673.583554 596.207321 622.284485 650.321216 32.070465 0.161060
... ... ... ... ... ... ... ...
995 543.412261 660.448845 592.409876 592.717119 635.377252 33.029347 0.205894
996 549.497815 643.699130 596.472072 617.731849 646.693757 31.016663 0.072668
997 557.108581 634.641716 574.823951 594.471707 639.087387 29.494211 0.153454
998 567.694939 643.679695 570.951787 605.229010 615.965994 33.871298 0.083341
999 536.428807 621.251545 556.084364 588.359684 633.649054 33.473752 0.183038

1000 rows × 7 columns

And, here is the distribution of estimates of $\beta_\text{Bethel}$. It is similar, but not identical, to our other distribution above (as expected).

In [18]:
plt.hist(boots.bethel)
plt.xlabel("mean bethel rent"); plt.ylabel("frequency");
plt.axvline(estimates['bethel'], color='red');
No description has been provided for this image

Conclusion: The mean rent in Bethel for 500 ft${}^2$ apartments in 2010 as about \$556, with a 95% confidence interval of about \$518 to \$595.

Widths of other 95% confidence intervals are below.

In [19]:
boots.quantile(0.975) - boots.quantile(0.025)
Out[19]:
bethel       77.582244
campus       91.176056
churchill    80.555130
river        79.382906
whittaker    80.716595
year         10.297261
area          0.196454
dtype: float64