Hello again, Uncertainty¶

Peter Ralph

https://uodsci.github.io/dsci345

In [1]:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd
from dsci345 import pretty

rng = np.random.default_rng()

$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}} \newcommand{\cov}{\text{cov}} \newcommand{\cor}{\text{cor}}$$ This is here so we can use \P and \E and \var and \cov and \cor and \sd in LaTeX below.

Prediction and Estimation¶

Suppose I've collected data on rent in Eugene. Consider the following questions:

  1. How much does rent cost in the Bethel neighborhood this year?
  2. How much is rent changing by, per year?

Question: What would good answers to these questions look like?

We could answer these questions by

  • reporting empirical averages from the actual data
  • fitting a black-box predictor and asking it questions
  • fitting a model with explicit terms for neighborhood and year

(Discuss pros and cons and how these would actually work.)

A third question¶

  1. How sure are you about those answers?
  1. How much does rent cost in the Bethel neighborhood this year?

Note that this has two notions of uncertainty: rents come in a range of values, but also there will be uncertainty in our estimates of that range.

Uncertainty, by math¶

Reminder: confidence intervals¶

If we have $n$ independent samples from some source, and the sample mean and SD are $\bar x$ and $s$ respectively, then a 95% confidence interval is $$ \bar x - t_n s / \sqrt{n} \qquad \text{to} \qquad \bar x + t_n s / \sqrt{n} , $$ where $t_n$ is the 97.5% quantile for the $t$ distribution with $n-1$ degrees of freedom.

Said another way: $\bar x$ is our estimate of the mean, and math gives us an estimate of the uncertainty behind our estimate.

  1. How much does rent cost in the Bethel neighborhood this year?

(Take the mean of the rents in Bethel this year.)

  1. How much is rent changing by, per year?

(Take the mean difference in rent of places that go up for rent in adjacent years.)

  1. How sure are you about those answers?

(Get 95% CIs for them.)

A model¶

For example: suppose we fit a model where the price of the $i^\text{th}$ apartment is $$\begin{aligned} p_i &= \beta_{n_i} + \beta_\text{year} y_i + \beta_\text{area} a_i + \epsilon_i \\ \epsilon_i &\sim \text{Normal}(0, \sigma) , \end{aligned}$$ where

  • $p_i$ is the price (in dollars)
  • $n_i$ is the neighborhood,
  • $\beta_n$ is the mean price in neighborhood $n$
  • $y_i$ is the year
  • $\beta_\text{year}$ is the amount that rent changes by, per year, on average
  • $a_i$ is the area in square feet
  • $\beta_\text{area}$ is the amount that rent goes up per additional square foot, on average
  • $\epsilon_i$ is the residual, and
  • $\sigma$ is how much otherwise similar apartments differ in price
  1. How much does rent cost in the Bethel neighborhood this year?

(Our estimate of $\beta_\text{Bethel}$ plus terms for year and area.)

  1. How much is rent changing by, per year?

(Our estimate of $\beta_\text{year}$.)

  1. How sure are you about those answers?

???

Aside: how do we fit this model?¶

We'll do it in two lines, with patsy and np.linalg.lstsq. (Reminder: since the residuals are Normal, least squares is the same as maximum likelihood.)

First, let's read in the data. (Note: it's fake data.) (github link)

In [2]:
df = pd.read_csv("data/housing.csv", index_col=0)
df
Out[2]:
hood year area rent
0 river 2014 489.0 942.0
1 bethel 2010 610.0 370.0
2 churchill 2017 385.0 669.0
3 bethel 2012 757.0 624.0
4 bethel 2019 504.0 990.0
... ... ... ... ...
395 bethel 2012 238.0 579.0
396 campus 2016 287.0 759.0
397 churchill 2016 526.0 679.0
398 campus 2014 418.0 868.0
399 bethel 2018 386.0 983.0

400 rows × 4 columns

Next, let's shift the variables, so that for instance $\beta_\text{Bethel}$ is the average price of a 500 ft${}^2$ apartment in 2010 (instead of a 0 ft${}^2$ apartment in the year 0):

In [3]:
df['area'] -= 500
df['year'] -= 2010

Now, fit the model, using patsy to make the design matrix (see board):

In [4]:
import patsy
outcome, predictors = patsy.dmatrices("rent ~ 0 + hood + year + area", df)
fit = np.linalg.lstsq(predictors, outcome, rcond=None)

estimates = { k.lstrip('hood[').rstrip("]") : v for k, v in zip(predictors.design_info.column_names, fit[0].ravel())}

estimates
Out[4]:
{'bethel': np.float64(556.0088681522523),
 'campus': np.float64(649.8851743401467),
 'churchill': np.float64(584.0968576228599),
 'river': np.float64(610.4626011683347),
 'whittaker': np.float64(643.2011227181563),
 'year': np.float64(31.21843122618344),
 'area': np.float64(0.10063620684218222)}
In [5]:
np.asarray(predictors)[:10,:]
Out[5]:
array([[   0.,    0.,    0.,    1.,    0.,    4.,  -11.],
       [   1.,    0.,    0.,    0.,    0.,    0.,  110.],
       [   0.,    0.,    1.,    0.,    0.,    7., -115.],
       [   1.,    0.,    0.,    0.,    0.,    2.,  257.],
       [   1.,    0.,    0.,    0.,    0.,    9.,    4.],
       [   0.,    0.,    0.,    0.,    1.,    5.,  -16.],
       [   0.,    0.,    0.,    1.,    0.,    6.,   93.],
       [   1.,    0.,    0.,    0.,    0.,    4., -196.],
       [   1.,    0.,    0.,    0.,    0.,    3.,  163.],
       [   0.,    1.,    0.,    0.,    0.,    2.,   81.]])
In [6]:
df
Out[6]:
hood year area rent
0 river 4 -11.0 942.0
1 bethel 0 110.0 370.0
2 churchill 7 -115.0 669.0
3 bethel 2 257.0 624.0
4 bethel 9 4.0 990.0
... ... ... ... ...
395 bethel 2 -262.0 579.0
396 campus 6 -213.0 759.0
397 churchill 6 26.0 679.0
398 campus 4 -82.0 868.0
399 bethel 8 -114.0 983.0

400 rows × 4 columns

Uncertainty, by simulation¶

Okay, so we'd like to know how far off our guesses of $\beta_\text{Bethel}$ or $\beta_a$ are from "the truth".

Well, one way to do this is to simulate data where we know the truth, and then see how far off our guesses are. If the simulations describe the real data well, then this should give us a good estimate.

This is sometimes known as the parametric bootstrap. (We'll meet it's more famous cousin, "the" bootstrap, next.) We actually did it, last class. Let's do it again!

Okay, from our "real data" we estimated a certain set of estimated parameter values. To make the simulated datasets look as much like the real data as possible, we should simulate using those parameter values.

In [7]:
def sim_rents(df, beta):
    # TODO: estimate sigma
    sim_df = df.copy()
    sim_df['rent'] = np.round(
        np.array([beta[h] for h in sim_df.hood]) + beta['year'] * sim_df.year + beta['area'] * sim_df.area
        + rng.normal(scale=150, size=len(df))
    )
    return sim_df
In [8]:
def fit_model(df):
    outcome, predictors = patsy.dmatrices("rent ~ 0 + hood + year + area", df)
    fit = np.linalg.lstsq(predictors, outcome, rcond=None)
    estimates = { k.lstrip('hood[').rstrip("]") : v for k, v in zip(predictors.design_info.column_names, fit[0].ravel())}
    return estimates
In [9]:
estimates
Out[9]:
{'bethel': np.float64(556.0088681522523),
 'campus': np.float64(649.8851743401467),
 'churchill': np.float64(584.0968576228599),
 'river': np.float64(610.4626011683347),
 'whittaker': np.float64(643.2011227181563),
 'year': np.float64(31.21843122618344),
 'area': np.float64(0.10063620684218222)}
In [10]:
experiments = pd.DataFrame([
    fit_model(sim_rents(df, beta=estimates))
    for _ in range(1000)
])
experiments
Out[10]:
bethel campus churchill river whittaker year area
0 575.902510 642.604986 592.798342 608.078287 697.180635 28.779585 0.113843
1 557.178776 660.341787 586.334886 628.076765 644.009012 31.364434 0.035168
2 563.137220 668.938388 582.832586 630.124568 644.535599 32.107249 0.092571
3 544.430006 648.920584 595.158346 598.784393 629.969406 32.162263 0.145045
4 555.357145 659.746291 606.171082 628.264967 642.935342 29.826930 0.095766
... ... ... ... ... ... ... ...
995 544.890722 634.565771 574.393009 595.705698 639.830165 33.709383 0.050875
996 562.275564 655.898478 587.764530 631.991290 635.503826 31.053371 0.188562
997 552.190036 666.794021 546.301255 605.447758 650.142317 31.925727 0.185316
998 587.795985 704.718662 597.207021 607.655886 653.100338 27.555325 0.115970
999 553.359636 652.144653 580.140778 664.082251 679.938914 26.670145 0.098013

1000 rows × 7 columns

In [11]:
plt.hist(experiments.bethel)
plt.xlabel("mean bethel rent"); plt.ylabel("frequency");
plt.axvline(estimates['bethel'], color='red');
No description has been provided for this image

Conclusion: The mean rent in Bethel for 500 ft${}^2$ apartments in 2010 as about \$556, with a 95% confidence interval of about \$518 to \$595.

Widths of other 95% confidence intervals are below.

In [12]:
estimates
Out[12]:
{'bethel': np.float64(556.0088681522523),
 'campus': np.float64(649.8851743401467),
 'churchill': np.float64(584.0968576228599),
 'river': np.float64(610.4626011683347),
 'whittaker': np.float64(643.2011227181563),
 'year': np.float64(31.21843122618344),
 'area': np.float64(0.10063620684218222)}
In [13]:
experiments.quantile(0.975) - experiments.quantile(0.025)
Out[13]:
bethel       77.541967
campus       85.234240
churchill    81.627153
river        81.111593
whittaker    81.389827
year         10.005483
area          0.205705
dtype: float64

Uncertainty, by our bootstraps¶

However, sometimes we don't have (or can't be bothered to make) a good generative model to simulate from. Or we tried and it was hard to get realistic-looking data. What to do then?

Well, what we'd like to do is to draw a bunch of new datasets from wherever we got the first one.

This seems unrealistic... but, our best guess at what that would be like if we could... is our dataset itself!

The bootstrap¶

Suppose we have a method that estimates something. To get an estimate of uncertainty of that estimate,

  1. Obtain a new dataset, of the same size as the original, by resampling with replacement observations from the original dataset.

  2. Apply the method to this 'bootstrapped' dataset.

  3. Repeat lots of times.

Then, the middle 95% of the estimates gives you the "bootstrap 95% confidence interval".

Exercise:¶

Here are 10 numbers.

array([176, 255,  53,  94,  61, 119,  42, 109,   0,  30])
  1. Compute the median.
  2. 1,000 times, resample 10 numbers, with replacement, from this list of numbers, and compute the median.
  3. Make a histogram of those 1,000 medians, with the original median as a vertical line.
In [14]:
# IN CLASS
x = np.array([176, 255,  53,  94,  61, 119,  42, 109,   0,  30])
med_x = np.median(x)
In [15]:
boot_meds = [np.median(rng.choice(x, size=10, replace=True)) for _ in range(1000)]
In [16]:
fig, ax = plt.subplots()
ax.hist(boot_meds, bins=pretty(boot_meds, 20))
ax.axvline(med_x, c='red');
No description has been provided for this image
In [17]:
# 95% confidence interval
(np.quantile(boot_meds, 0.025), np.quantile(boot_meds, 0.975))
Out[17]:
(np.float64(41.5), np.float64(142.5))

Here's a (very) simple method to obtain a bootstrap resampled version of a data frame:

In [18]:
def bootstrap(df):
    n = df.shape[0]
    return df.loc[rng.choice(n, n)]

And, here, let's apply thefit_model method above to 1000 of these.

In [19]:
boots = pd.DataFrame([
    fit_model(bootstrap(df))
    for _ in range(1000)
])
boots
Out[19]:
bethel campus churchill river whittaker year area
0 562.472230 607.100486 565.395919 590.802400 666.817819 31.778218 0.126840
1 529.192742 640.495432 593.828156 620.172014 627.839688 30.404554 0.155942
2 574.811506 655.464745 606.884968 649.632581 665.138330 29.370616 0.024600
3 532.740067 631.887855 583.826112 618.851433 663.472104 29.516356 0.124679
4 562.512608 666.967288 598.311342 623.518098 639.886522 29.613112 0.078457
... ... ... ... ... ... ... ...
995 569.668469 658.942641 574.674532 606.367452 622.423225 29.863527 0.136098
996 567.254152 662.668325 549.584022 588.496222 639.206063 34.323386 0.024327
997 559.954882 638.365579 599.215047 605.086299 663.665607 32.544599 0.088130
998 566.461808 682.573878 573.381698 613.338983 643.444693 31.993623 0.042107
999 564.783341 675.236614 576.947170 640.135260 628.659703 29.753753 0.149073

1000 rows × 7 columns

And, here is the distribution of estimates of $\beta_\text{Bethel}$. It is similar, but not identical, to our other distribution above (as expected).

In [20]:
plt.hist(boots.bethel)
plt.xlabel("mean bethel rent"); plt.ylabel("frequency");
plt.axvline(estimates['bethel'], color='red');
No description has been provided for this image

Conclusion: The mean rent in Bethel for 500 ft${}^2$ apartments in 2010 as about \$556, with a 95% confidence interval of about \$518 to \$595.

Widths of other 95% confidence intervals are below.

In [21]:
boots.quantile(0.975) - boots.quantile(0.025)
Out[21]:
bethel       81.652334
campus       85.080632
churchill    82.110892
river        78.229027
whittaker    73.397059
year         10.116929
area          0.189463
dtype: float64