Hello again, Uncertainty¶

Fall 2022: Peter Ralph

https://uodsci.github.io/dsci345

Prediction and Estimation¶

Suppose I've collected data on rent in Eugene. Consider the following questions:

  1. How much does rent cost in the Bethel neighborhood this year?
  2. How much is rent changing by, per year?

Question: What would good answers to these questions look like?

We could answer these questions by

  • reporting empirical averages from the actual data
  • fitting a black-box predictor and asking it questions
  • fitting a model with explicit terms for neighborhood and year

(Discuss pros and cons and how these would actually work.)

A third question¶

  1. How sure are you about those answers?
  1. How much does rent cost in the Bethel neighborhood this year?

Note that this has two notions of uncertainty: rents come in a range of values, but also there will be uncertainty in our estimates of that range.

Uncertainty, by math¶

Reminder: confidence intervals¶

If we have $n$ independent samples from some source, and the sample mean and SD are $\bar x$ and $s$ respectively, then a 95% confidence interval is $$ \bar x - t_n s \qquad \text{to} \qquad \bar x + t_n s , $$ where $t_n$ is the 97.5% quantile for the $t$ distribution with $n-2$ degrees of freedom.

Said another way: $\bar x$ is our estimate of the mean, and math gives us an estimate of the uncertainty behind our estimate.

  1. How much does rent cost in the Bethel neighborhood this year?

(Take the mean of the rents in Bethel this year.)

  1. How much is rent changing by, per year?

(Take the mean difference in rent of places that go up for rent in adjacent years.)

  1. How sure are you about those answers?

(Get 95% CIs for them.)

A model¶

For example: suppose we fit a model where the price of the $i^\text{th}$ apartment is $$\begin{aligned} p_i &= \beta_{n_i} + \beta_\text{year} y_i + \beta_\text{area} a_i + \epsilon_i \\ \epsilon_i &\sim \text{Normal}(0, \sigma) , \end{aligned}$$ where

  • $p_i$ is the price (in dollars)
  • $n_i$ is the neighborhood,
  • $\beta_n$ is the mean price in neighborhood $n$
  • $y_i$ is the year
  • $\beta_\text{year}$ is the amount that rent changes by, per year, on average
  • $a_i$ is the area in square feet
  • $\beta_\text{area}$ is the amount that rent goes up per additional square foot, on average
  • $\epsilon_i$ is the residual, and
  • $\sigma$ is how much otherwise similar apartments differ in price
  1. How much does rent cost in the Bethel neighborhood this year?

(Our estimate of $\beta_\text{Bethel}$.)

  1. How much is rent changing by, per year?

(Our estimate of $\beta_\text{year}$.)

  1. How sure are you about those answers?

???

Aside: how do we fit this model?¶

We'll do it in two lines, with patsy and np.linalg.lstsq.

First, let's read in the data. (Note: it's fake data.) (github link)

In [2]:
df = pd.read_csv("data/housing.csv", index_col=0)
df
Out[2]:
hood year area rent
0 river 2014 489.0 942.0
1 bethel 2010 610.0 370.0
2 churchill 2017 385.0 669.0
3 bethel 2012 757.0 624.0
4 bethel 2019 504.0 990.0
... ... ... ... ...
395 bethel 2012 238.0 579.0
396 campus 2016 287.0 759.0
397 churchill 2016 526.0 679.0
398 campus 2014 418.0 868.0
399 bethel 2018 386.0 983.0

400 rows × 4 columns

Next, let's shift the variables, so that for instance $\beta_\text{Bethel}$ is the average price of a 500 ft${}^2$ apartment in 2010 (instead of a 0 ft${}^2$ apartment in the year 0):

In [3]:
df['area'] -= 500
df['year'] -= 2010

Now, fit the model, using patsy to make the design matrix (see board):

In [27]:
import patsy
outcome, predictors = patsy.dmatrices("rent ~ 0 + hood + year + area", df)
fit = np.linalg.lstsq(predictors, outcome, rcond=None)

estimates = { k.lstrip('hood[').rstrip("]") : v for k, v in zip(predictors.design_info.column_names, fit[0].ravel())}

estimates
Out[27]:
{'bethel': 556.0088681522523,
 'campus': 649.8851743401476,
 'churchill': 584.0968576228607,
 'river': 610.4626011683356,
 'whittaker': 643.2011227181572,
 'year': 31.218431226183213,
 'area': 0.10063620684218222}
In [21]:
np.asarray(predictors)[:10,:]
Out[21]:
array([[   0.,    0.,    0.,    1.,    0.,    4.,  -11.],
       [   1.,    0.,    0.,    0.,    0.,    0.,  110.],
       [   0.,    0.,    1.,    0.,    0.,    7., -115.],
       [   1.,    0.,    0.,    0.,    0.,    2.,  257.],
       [   1.,    0.,    0.,    0.,    0.,    9.,    4.],
       [   0.,    0.,    0.,    0.,    1.,    5.,  -16.],
       [   0.,    0.,    0.,    1.,    0.,    6.,   93.],
       [   1.,    0.,    0.,    0.,    0.,    4., -196.],
       [   1.,    0.,    0.,    0.,    0.,    3.,  163.],
       [   0.,    1.,    0.,    0.,    0.,    2.,   81.]])
In [22]:
df
Out[22]:
hood year area rent
0 river 4 -11.0 942.0
1 bethel 0 110.0 370.0
2 churchill 7 -115.0 669.0
3 bethel 2 257.0 624.0
4 bethel 9 4.0 990.0
... ... ... ... ...
395 bethel 2 -262.0 579.0
396 campus 6 -213.0 759.0
397 churchill 6 26.0 679.0
398 campus 4 -82.0 868.0
399 bethel 8 -114.0 983.0

400 rows × 4 columns

Uncertainty, by simulation¶

Okay, so we'd like to know how far off our guesses of $\beta_\text{Bethel}$ or $a$ are from "the truth".

Well, one way to do this is to simulate data where we know the truth, and then see how far off our guesses are. If the simulations describe the real data well, then this should give us a good estimate.

This is sometimes known as the parametric bootstrap. (We'll meet it's more famous cousin, "the" bootstrap, next.) We actually did it, last class. Let's do it again!

Okay, from our "real data" we estimated a certain set of estimated parameter values. To make the simulated datasets look as much like the real data as possible, we should simulate using those parameter values.

In [5]:
def sim_rents(n, beta):
    # TODO: change this to use existing 'hood', 'year', and 'area' columns
    hoods = ["bethel", "churchill", "whittaker", "river", "campus"]
    df = pd.DataFrame({
        "hood" : rng.choice(hoods, replace=True, size=n),
        "year" : rng.choice(np.arange(10), replace=True, size=n), # years after 2010
        "area" : np.round(rng.normal(scale=150, size=n)), # relative to 500 ft2
    })
    df['rent'] = np.round(
        np.array([beta[h] for h in df.hood]) + beta['year'] * df.year + beta['area'] * df.area
        + rng.normal(scale=150, size=n)
    )
    return df
In [6]:
def fit_model(df):
    outcome, predictors = patsy.dmatrices("rent ~ 0 + hood + year + area", df)
    fit = np.linalg.lstsq(predictors, outcome, rcond=None)
    estimates = { k.lstrip('hood[').rstrip("]") : v for k, v in zip(predictors.design_info.column_names, fit[0].ravel())}
    return estimates
In [28]:
experiments = pd.DataFrame([
    fit_model(sim_rents(n=500, beta=estimates))
    for _ in range(1000)
])
experiments
Out[28]:
bethel campus churchill river whittaker year area
0 553.129388 636.782335 566.790531 629.834955 631.315944 33.044571 0.148637
1 546.389645 653.301417 535.018897 611.376905 650.518137 33.038017 0.088296
2 558.117169 676.164650 603.882570 609.233100 631.745660 31.031249 0.027875
3 578.508743 650.071314 596.373620 629.513868 651.443396 28.992217 0.026167
4 561.942113 665.732217 606.196788 593.298654 665.981267 28.483740 0.093528
... ... ... ... ... ... ... ...
995 538.166340 604.119317 530.887685 599.281834 643.743802 34.672483 0.062561
996 561.093625 666.110372 585.613416 599.759788 649.020167 30.866896 0.064493
997 542.858757 617.818254 585.387913 608.054368 643.774205 33.102289 0.169135
998 548.272569 646.682739 591.073084 638.657332 647.991229 29.784093 0.044494
999 536.700051 651.061483 580.233709 594.350300 634.795511 31.948951 0.051544

1000 rows × 7 columns

In [30]:
plt.hist(experiments.bethel)
plt.xlabel("mean bethel rent"); plt.ylabel("frequency");
plt.axvline(estimates['bethel'], color='red')
Out[30]:
<matplotlib.lines.Line2D at 0x7fc7b0cd7940>

Conclusion: The mean rent in Bethel for 500 ft${}^2$ apartments in 2010 as about \$556, with a 95% confidence interval of about \\$517 to \$594.

Widths of other 95% confidence intervals are below.

In [13]:
experiments.quantile(0.975) - experiments.quantile(0.025)
Out[13]:
bethel       77.119173
campus       69.725638
churchill    69.460268
river        70.343363
whittaker    71.668009
year          9.279153
area          0.181946
dtype: float64

Uncertainty, by our bootstraps¶

However, sometimes we don't have (or can't be bothered to make) a good generative model to simulate from. Or we tried and it was hard to get realistic-looking data. What to do then?

Well, what we'd like to do is to draw a bunch of new datasets from wherever we got the first one.

This seems unrealistic... but, our best guess at what that would be like if we could... is our dataset itself!

The bootstrap¶

Suppose we have a method that estimates something. To get an estimate of uncertainty of that estimate,

  1. Obtain a new dataset, of the same size as the original, by resampling with replacement observations from the original dataset.

  2. Apply the method to this 'bootstrapped' dataset.

  3. Repeat lots of times.

Then, the middle 95% of the estimates gives you the "bootstrap 95% confidence interval".

Exercise:¶

Here are 10 numbers.

array([176, 255,  53,  94,  61, 119,  42, 109,   0,  30])
  1. Compute the median.
  2. 1,000 times, resample 10 numbers, with replacement, from this list of numbers, and compute the median.
  3. Make a histogram of those 1,000 medians, with the original median as a vertical line.
In [45]:
x = np.array([176, 255,  53,  94,  61, 119,  42, 109,   0,  30])
obs_median = np.median(x)

def boot_med(x, num_boots):
    out = np.zeros(num_boots)
    for k in range(num_boots):
        bx = rng.choice(x, size=len(x))
        out[k] = np.median(bx)
    return out

boot_med(x, 10)
Out[45]:
array([ 42. ,  57. ,  73.5,  94. , 101.5,  94. ,  77.5,  41.5,  57. ,
        53. ])
In [48]:
boot_reps = boot_med(x, 1000)
plt.hist(boot_reps)
plt.title(
    f"95% CI for the median: {np.quantile(boot_reps, 0.025)} to {np.quantile(boot_reps, 0.975)}"
);

Here's a (very) simple method to obtain a bootstrap resampled version of a data frame:

In [9]:
def bootstrap(df):
    n = df.shape[0]
    return df.loc[rng.choice(n, n)]

And, here, let's apply thefit_model method above to 1000 of these.

In [10]:
boots = pd.DataFrame([
    fit_model(bootstrap(df))
    for _ in range(1000)
])
boots
Out[10]:
bethel campus churchill river whittaker year area
0 566.178533 673.454074 572.670839 650.981769 631.659294 29.871761 0.052712
1 569.876770 661.566126 573.540746 602.964247 618.121608 32.437806 0.153979
2 560.932447 646.536876 584.596285 646.603953 660.974635 29.749610 0.081730
3 547.287149 643.902694 579.080515 608.663384 603.393940 31.501915 0.045602
4 539.682865 646.525045 609.888808 627.118967 657.588515 30.051460 0.180413
... ... ... ... ... ... ... ...
995 539.602579 659.099466 572.510073 600.792865 610.673226 31.952523 0.120829
996 548.591574 601.780609 571.831343 631.447685 648.958472 31.437289 0.178433
997 547.403821 644.033426 582.261111 635.851331 669.104292 31.075540 -0.007286
998 583.384967 653.739912 577.262960 619.867603 650.517642 30.868396 0.096619
999 549.247441 621.678332 571.637133 575.658515 632.486326 34.983206 0.056221

1000 rows × 7 columns

And, here is the distribution of estimates of $\beta_\text{Bethel}$. It is similar, but not identical, to our other distribution above (as expected).

In [49]:
plt.hist(boots.bethel)
plt.xlabel("mean bethel rent"); plt.ylabel("frequency");
plt.axvline(estimates['bethel'], color='red');

Conclusion: The mean rent in Bethel for 500 ft${}^2$ apartments in 2010 as about \$556, with a 95% confidence interval of about \\$516 to \$595.

Widths of other 95% confidence intervals are below.

In [17]:
boots.quantile(0.975) - experiments.quantile(0.025)
Out[17]:
bethel       79.772687
campus       80.884375
churchill    76.273014
river        75.963004
whittaker    72.911259
year          9.611364
area          0.196990
dtype: float64