When there's too many right answersΒΆ

Peter Ralph

https://uodsci.github.io/dsci345

InΒ [1]:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd
import patsy
from dsci345 import pretty

rng = np.random.default_rng(seed=123)

$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}} \newcommand{\cov}{\text{cov}} \newcommand{\cor}{\text{cor}}$$ This is here so we can use \P and \E and \var and \cov and \cor and \sd in LaTeX below.

InΒ [2]:
x = rng.uniform(high=10, size=101)
x.sort()
y = 10 * (1.2 * (x < np.pi) + 2 * np.logical_and(x > np.pi, x < 8) + (10 - x) * (x > 8)) - 12
y += rng.normal(scale=2, size=len(x))
df = pd.DataFrame({'t': x, 'y': y})

A motivating problem: interpolationΒΆ

Suppose we've got a time series of noisy observations like the following, and we'd like to infer the underlying signal:

InΒ [3]:
plt.scatter(df['t'], df['y'])
plt.xlabel("time (t)"); plt.ylabel("response (y)");
No description has been provided for this image

Does the right answer look like this?

InΒ [4]:
plt.scatter(df['t'], df['y'])
plt.plot(df['t'], df['y'])
plt.xlabel("time (t)"); plt.ylabel("response (y)");
No description has been provided for this image

Or like this?

InΒ [5]:
ey = 10 * (1.2 * (x < np.pi) + 2 * np.logical_and(x > np.pi, x < 8) + (10 - x) * (x > 8)) - 12
plt.plot(df['t'], ey)
plt.scatter(df['t'], df['y'])
plt.xlabel("time (t)"); plt.ylabel("response (y)");
No description has been provided for this image

Or what?

InΒ [6]:
plt.plot(df['t'], np.convolve(ey, np.ones(11)/11)[5:-5])
plt.scatter(df['t'], df['y'])
plt.xlabel("time (t)"); plt.ylabel("response (y)");
No description has been provided for this image

A quick introduction to splinesΒΆ

What we'd like to do is to fit a model like $$ y_i = \beta_1 f_1(t_i) + \cdots + \beta_k f_k(t_i) + \epsilon_i , $$ where $f_1(t), \ldots, f_k(t)$ are "nice smooth functions".

A good way to get a bunch of "nice smooth functions" is from a "spline basis", like this one (thanks, patsy!):

InΒ [7]:
s = np.hstack([
    (10/k) * patsy.dmatrix(f"bs(t, df=k, degree=3, include_intercept=True) - 1", df)
    for k in (4, 8, 12, 16, 24, 32, 48, 64)
])
 
plt.plot(df['t'], s);
No description has been provided for this image

By adding up linear combinations of these functions we can make a pretty wide range of curves. Here's a few randomly chosen ones:

InΒ [8]:
coefs.shape, s.shape, df.shape
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 coefs.shape, s.shape, df.shape

NameError: name 'coefs' is not defined
InΒ [9]:
coefs = rng.normal(size=(s.shape[1], 4))
plt.plot(df['t'], s.dot(coefs));
No description has been provided for this image

Well, we know how to fit that model! Least squares, here we come!

InΒ [10]:
sfit, _, _, _ = np.linalg.lstsq(s, y, rcond=None)

plt.scatter(df['t'], df['y']);
plt.plot(df['t'], s.dot(sfit));
No description has been provided for this image

Hm - that looks a bit too... jagged? Overfit, a bit?

Too many knobsΒΆ

One problem is that there's too many degrees of freedom - we're trying to infer 208 parameters with only 101 data points.

But, all those knobs make our inferred curves nice and flexible! Which ones do we need?

One solution would be to reduce things down to a smaller number of knobs (i.e., only use a smaller number of basis functions).

Another solution is to adjust our expectations: never mind, we don't actually want the best possible solution, we just want a pretty good one, please? And, we'd like it to be reasonable?

Regularization, againΒΆ

Recall we're fitting this model: $$ y_i = \beta_1 f_1(t_i) + \cdots + \beta_k f_k(t_i) + \epsilon_i , $$ which suggests finding $\beta$ to minimize the loss function $$ \sum_i \left( y_i - \sum_{j=1}^k \beta_j f_j(t_i) \right)^2 . $$

To "encourage smoothness" we might add to this a penalty that depends on the wiggliness of the functions, say $$ + \alpha \sum_j \beta_j^2 \int f_j''(t)^2 dt . $$

We don't have the information about second derivatives easily available here, so I've cheated a little in how I set up the functions, and we can just add a "ridge" penalty to do roughly the same thing: $$ + \alpha \sum_j \beta_j^2 . $$

Here are results from using the "ridge" regularization, at different strengths. Which looks the best?

InΒ [11]:
from sklearn.linear_model import Ridge

plt.scatter(df['t'], df['y'])
plt.plot(df['t'], s.dot(sfit), label='unpenalized')
for a in [0.1, 2, 50, 1000]:
    rfit = Ridge(alpha=a).fit(s, y)
    rpred = s.dot(rfit.coef_) + rfit.intercept_
    plt.plot(df['t'], rpred, label=f'ridge (alpha={a})');
plt.legend();
No description has been provided for this image

How do we find a good strength of regularization? Crossvalidation!

InΒ [12]:
def do_xval(alpha, test):
    rfit = Ridge(alpha=alpha).fit(s[~test,:], y[~test])
    rpred = s.dot(rfit.coef_)
    return np.sqrt(np.mean((y[test] - rpred[test])**2))

def xval(alpha, folds):
    return np.mean([do_xval(alpha, folds==j) for j in np.unique(folds)])

folds = np.repeat(np.arange(10), 11)[:101]
rng.shuffle(folds)

avals = np.linspace(0.2, 5, 31)
mse = np.array([xval(a, folds) for a in avals])
a_min = avals[np.argmin(mse)]

plt.plot(avals, mse);
plt.scatter(a_min, mse[np.argmin(mse)])
plt.xlabel("alpha"); plt.ylabel("root mean squared testing error");
No description has been provided for this image

The winner:

InΒ [13]:
plt.scatter(df['t'], df['y'])
plt.plot(df['t'], s.dot(sfit), label='unpenalized')
rfit = Ridge(alpha=a_min).fit(s, y)
rpred = s.dot(rfit.coef_) + rfit.intercept_
plt.plot(df['t'], rpred, label=f'ridge (alpha={a_min:.4})');
plt.legend();
No description has been provided for this image

What happened there?ΒΆ

We wanted a flexible model, where we didn't have to pre-specify a specific, simple form for the answer.

But, "flexible" meant there were lots of good solutions, and the best solutions were too close (suffered from overfitting). (The inference problem is ill-posed.)

So, we had to be clever about how to choose a reasonable solution, out of the many good ones.

This same tension between flexibility and overfitting is common to many methods in statistics and machine learning.