Hello again, Uncertainty¶

Peter Ralph

https://uodsci.github.io/dsci345

In [1]:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd
from dsci345 import pretty

rng = np.random.default_rng()

$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}} \newcommand{\cov}{\text{cov}} \newcommand{\cor}{\text{cor}}$$ This is here so we can use \P and \E and \var and \cov and \cor and \sd in LaTeX below.

Prediction and Estimation¶

Suppose I've collected data on rent in Eugene. Consider the following questions:

  1. How much does rent cost in the Bethel neighborhood this year?
  2. How much is rent changing by, per year?

Question: What would good answers to these questions look like?

We could answer these questions by

  • reporting empirical averages from the actual data
  • fitting a black-box predictor and asking it questions
  • fitting a model with explicit terms for neighborhood and year

(Discuss pros and cons and how these would actually work.)

A third question¶

  1. How sure are you about those answers?
  1. How much does rent cost in the Bethel neighborhood this year?

Note that this has two notions of uncertainty: rents come in a range of values, but also there will be uncertainty in our estimates of that range.

Uncertainty, by math¶

Reminder: confidence intervals¶

If we have $n$ independent samples from some source, and the sample mean and SD are $\bar x$ and $s$ respectively, then a 95% confidence interval is $$ \bar x - t_n s / \sqrt{n} \qquad \text{to} \qquad \bar x + t_n s / \sqrt{n} , $$ where $t_n$ is the 97.5% quantile for the $t$ distribution with $n-1$ degrees of freedom.

Said another way: $\bar x$ is our estimate of the mean, and math gives us an estimate of the uncertainty behind our estimate.

  1. How much does rent cost in the Bethel neighborhood this year?

(Take the mean of the rents in Bethel this year.)

  1. How much is rent changing by, per year?

(Take the mean difference in rent of places that go up for rent in adjacent years.)

  1. How sure are you about those answers?

(Get 95% CIs for them.)

A model¶

For example: suppose we fit a model where the price of the $i^\text{th}$ apartment is $$\begin{aligned} p_i &= \beta_{n_i} + \beta_\text{year} y_i + \beta_\text{area} a_i + \epsilon_i \\ \epsilon_i &\sim \text{Normal}(0, \sigma) , \end{aligned}$$ where

  • $p_i$ is the price (in dollars)
  • $n_i$ is the neighborhood,
  • $\beta_n$ is the mean price in neighborhood $n$
  • $y_i$ is the year
  • $\beta_\text{year}$ is the amount that rent changes by, per year, on average
  • $a_i$ is the area in square feet
  • $\beta_\text{area}$ is the amount that rent goes up per additional square foot, on average
  • $\epsilon_i$ is the residual, and
  • $\sigma$ is how much otherwise similar apartments differ in price
  1. How much does rent cost in the Bethel neighborhood this year?

(Our estimate of $\beta_\text{Bethel}$ plus terms for year and area.)

  1. How much is rent changing by, per year?

(Our estimate of $\beta_\text{year}$.)

  1. How sure are you about those answers?

???

Aside: how do we fit this model?¶

We'll do it in two lines, with patsy and np.linalg.lstsq. (Reminder: since the residuals are Normal, least squares is the same as maximum likelihood.)

First, let's read in the data. (Note: it's fake data.) (github link)

In [2]:
df = pd.read_csv("data/housing.csv", index_col=0)
df
Out[2]:
hood year area rent
0 river 2014 489.0 942.0
1 bethel 2010 610.0 370.0
2 churchill 2017 385.0 669.0
3 bethel 2012 757.0 624.0
4 bethel 2019 504.0 990.0
... ... ... ... ...
395 bethel 2012 238.0 579.0
396 campus 2016 287.0 759.0
397 churchill 2016 526.0 679.0
398 campus 2014 418.0 868.0
399 bethel 2018 386.0 983.0

400 rows × 4 columns

Next, let's shift the variables, so that for instance $\beta_\text{Bethel}$ is the average price of a 500 ft${}^2$ apartment in 2010 (instead of a 0 ft${}^2$ apartment in the year 0):

In [3]:
df['area'] -= 500
df['year'] -= 2010

Now, fit the model, using patsy to make the design matrix (see board):

In [4]:
import patsy
outcome, predictors = patsy.dmatrices("rent ~ 0 + hood + year + area", df)
fit = np.linalg.lstsq(predictors, outcome, rcond=None)

estimates = { k.lstrip('hood[').rstrip("]") : v for k, v in zip(predictors.design_info.column_names, fit[0].ravel())}

estimates
Out[4]:
{'bethel': np.float64(556.0088681522523),
 'campus': np.float64(649.8851743401467),
 'churchill': np.float64(584.0968576228599),
 'river': np.float64(610.4626011683347),
 'whittaker': np.float64(643.2011227181563),
 'year': np.float64(31.21843122618344),
 'area': np.float64(0.10063620684218222)}
In [5]:
np.asarray(predictors)[:10,:]
Out[5]:
array([[   0.,    0.,    0.,    1.,    0.,    4.,  -11.],
       [   1.,    0.,    0.,    0.,    0.,    0.,  110.],
       [   0.,    0.,    1.,    0.,    0.,    7., -115.],
       [   1.,    0.,    0.,    0.,    0.,    2.,  257.],
       [   1.,    0.,    0.,    0.,    0.,    9.,    4.],
       [   0.,    0.,    0.,    0.,    1.,    5.,  -16.],
       [   0.,    0.,    0.,    1.,    0.,    6.,   93.],
       [   1.,    0.,    0.,    0.,    0.,    4., -196.],
       [   1.,    0.,    0.,    0.,    0.,    3.,  163.],
       [   0.,    1.,    0.,    0.,    0.,    2.,   81.]])
In [6]:
df
Out[6]:
hood year area rent
0 river 4 -11.0 942.0
1 bethel 0 110.0 370.0
2 churchill 7 -115.0 669.0
3 bethel 2 257.0 624.0
4 bethel 9 4.0 990.0
... ... ... ... ...
395 bethel 2 -262.0 579.0
396 campus 6 -213.0 759.0
397 churchill 6 26.0 679.0
398 campus 4 -82.0 868.0
399 bethel 8 -114.0 983.0

400 rows × 4 columns

Uncertainty, by simulation¶

Okay, so we'd like to know how far off our guesses of $\beta_\text{Bethel}$ or $\beta_a$ are from "the truth".

Well, one way to do this is to simulate data where we know the truth, and then see how far off our guesses are. If the simulations describe the real data well, then this should give us a good estimate.

This is sometimes known as the parametric bootstrap. (We'll meet it's more famous cousin, "the" bootstrap, next.)

Okay, from our "real data" we estimated a certain set of estimated parameter values. To make the simulated datasets look as much like the real data as possible, we should simulate using those parameter values.

In [7]:
def sim_rents(df, beta):
    # TODO: estimate sigma
    sim_df = df.copy()
    sim_df['rent'] = np.round(
        np.array([beta[h] for h in sim_df.hood]) + beta['year'] * sim_df.year
        + beta['area'] * sim_df.area
        + rng.normal(scale=150, size=len(df))
    )
    return sim_df
In [8]:
def fit_model(df):
    outcome, predictors = patsy.dmatrices("rent ~ 0 + hood + year + area", df)
    fit = np.linalg.lstsq(predictors, outcome, rcond=None)
    estimates = { k.lstrip('hood[').rstrip("]") : v for k, v in zip(predictors.design_info.column_names, fit[0].ravel())}
    return estimates
In [9]:
estimates
Out[9]:
{'bethel': np.float64(556.0088681522523),
 'campus': np.float64(649.8851743401467),
 'churchill': np.float64(584.0968576228599),
 'river': np.float64(610.4626011683347),
 'whittaker': np.float64(643.2011227181563),
 'year': np.float64(31.21843122618344),
 'area': np.float64(0.10063620684218222)}
In [10]:
experiments = pd.DataFrame([
    fit_model(sim_rents(df, beta=estimates))
    for _ in range(1000)
])
experiments
Out[10]:
bethel campus churchill river whittaker year area
0 533.492097 641.054411 577.588954 593.179317 638.185475 32.163165 0.032569
1 574.572277 667.759187 595.797701 647.783456 705.270210 27.760569 0.107175
2 553.549603 707.172589 567.599455 625.551145 645.908753 29.276295 0.045647
3 560.001821 640.772141 568.461616 601.715683 610.548337 33.429219 0.116936
4 551.271028 664.447607 582.445337 597.355046 671.718753 27.838210 0.178589
... ... ... ... ... ... ... ...
995 579.054218 662.593979 581.442602 600.067738 640.242324 29.303955 0.068018
996 551.156364 669.155511 611.647735 642.359660 665.796143 26.749064 0.095768
997 591.980817 610.074641 606.648483 612.052551 645.979702 29.191974 0.076391
998 570.263555 634.841921 563.774140 588.337918 640.850963 35.357078 0.050328
999 582.057310 642.819219 580.076694 617.251420 630.826407 32.823066 0.011303

1000 rows × 7 columns

In [11]:
plt.hist(experiments.bethel)
plt.xlabel("mean bethel rent"); plt.ylabel("frequency");
plt.axvline(estimates['bethel'], color='red');
No description has been provided for this image

Conclusion: The mean rent in Bethel for 500 ft${}^2$ apartments in 2010 as about \$556, with a 95% confidence interval of about \$518 to \$595.

Widths of other 95% confidence intervals are below.

In [12]:
estimates
Out[12]:
{'bethel': np.float64(556.0088681522523),
 'campus': np.float64(649.8851743401467),
 'churchill': np.float64(584.0968576228599),
 'river': np.float64(610.4626011683347),
 'whittaker': np.float64(643.2011227181563),
 'year': np.float64(31.21843122618344),
 'area': np.float64(0.10063620684218222)}

Uncertainty, by our bootstraps¶

However, sometimes we don't have (or can't be bothered to make) a good generative model to simulate from. Or we tried and it was hard to get realistic-looking data. What to do then?

Well, what we'd like to do is to draw a bunch of new datasets from wherever we got the first one.

This seems unrealistic... but, our best guess at what that would be like if we could... is our dataset itself!

The bootstrap¶

Suppose we have a method that estimates something. To get an estimate of uncertainty of that estimate,

  1. Obtain a new dataset, of the same size as the original, by resampling with replacement observations from the original dataset.

  2. Apply the method to this 'bootstrapped' dataset.

  3. Repeat lots of times.

Then, the middle 95% of the estimates gives you the "bootstrap 95% confidence interval".

Exercise:¶

Here are 10 numbers.

array([176, 255,  53,  94,  61, 119,  42, 109,   0,  30])
  1. Compute the median.
  2. 1,000 times, resample 10 numbers, with replacement, from this list of numbers, and compute the median.
  3. Make a histogram of those 1,000 medians, with the original median as a vertical line.
In [13]:
# IN CLASS
x = np.array([176, 255,  53,  94,  61, 119,  42, 109,   0,  30])
boots = np.array([np.median(rng.choice(x, size=10, replace=True)) for _ in range (1000)])
In [14]:
# 95% confidence interval
(np.quantile(boots, 0.025), np.quantile(boots, 0.975))
Out[14]:
(np.float64(36.0), np.float64(119.0))

Back to the rent data¶

Here's a (very) simple method to obtain a bootstrap resampled version of a data frame:

In [15]:
def bootstrap(df):
    n = df.shape[0]
    return df.loc[rng.choice(n, n)]

And, here, let's apply thefit_model method above to 1000 of these.

In [16]:
boots = pd.DataFrame([
    fit_model(bootstrap(df))
    for _ in range(1000)
])
boots
Out[16]:
bethel campus churchill river whittaker year area
0 568.064604 647.906919 572.522506 628.785380 618.256371 33.622293 0.056935
1 566.397927 621.467612 586.455942 601.118356 656.145444 30.894811 0.119537
2 584.502240 613.010580 561.360131 609.489202 628.502952 31.620692 0.020535
3 560.740185 617.053438 593.109358 635.674673 632.473419 31.971218 0.189305
4 571.756057 677.722318 608.139147 614.600792 662.441502 26.206923 0.090265
... ... ... ... ... ... ... ...
995 541.233461 674.393450 588.214759 655.214241 653.555276 29.426819 0.046536
996 560.239981 613.215393 558.599596 597.414493 624.957982 38.334844 0.126020
997 541.226271 636.723186 562.281299 569.201520 633.184012 34.037509 0.146713
998 534.160013 643.611452 589.062351 578.953027 632.279690 32.189726 0.165540
999 597.568280 688.612910 628.627646 641.071724 669.827791 24.545432 0.101566

1000 rows × 7 columns

And, here is the distribution of estimates of $\beta_\text{Bethel}$. It is similar, but not identical, to our other distribution above (as expected).

In [17]:
plt.hist(boots.bethel)
plt.xlabel("mean bethel rent"); plt.ylabel("frequency");
plt.axvline(estimates['bethel'], color='red');
No description has been provided for this image

Conclusion: The mean rent in Bethel for 500 ft${}^2$ apartments in 2010 as about \$556, with a 95% confidence interval of about \$518 to \$595.

Widths of other 95% confidence intervals are below.

In [18]:
boots.quantile(0.975) - boots.quantile(0.025)
Out[18]:
bethel       81.873471
campus       94.305973
churchill    81.924064
river        80.584423
whittaker    79.389308
year         10.581510
area          0.201494
dtype: float64