Our first linear model was motivated by the probability model yi=xi⋅β+ϵiϵi∼Normal(mean=0,sd=σ), which led us to choosing β by minimizing the sum of squared errors, ∑i(yi−xi⋅β)2.
A natural question is: how much do the results depend on the "distributional assumptions", i.e., the "Normal( )"?
The Cauchy distribution is another probability distribution: it can take any value (like the Normal), its histogram looks like a hump (like the Normal), and it has both "location" and "scale" parameters (like the Normal). Let's have a look at the model yi=xi⋅β+ϵiϵi∼Cauchy(loc=0,scale=σ).
First we'll simulate from the model and see how well doing the usual method works.
n = 200
k = 2
x = rng.uniform(size=(n, k))
beta = [10, -2]
y = x.dot(beta) + rng.standard_cauchy(size=n)
from sklearn import linear_model
line_reg = linear_model.LinearRegression(fit_intercept=True)
line_reg.fit(x, y)
line_reg.coef_
array([ 0.04686807, -9.46323921])
y_hat = line_reg.predict(x)
fig, ax = plt.subplots()
ax.scatter(y, y_hat)
ax.set_xlim(-20, 20)
ax.set_xlabel("observed value")
ax.set_ylabel("predicted value")
ax.set_title(f"MSE = {np.median((y - y_hat)**2):.2f}")
ax.axline((0,0), slope=1);
What to do? Well, the Cauchy density with scale σ is f(u)=1πσ(1+(u/σ)2).
So, the likelihood of the data are n∏i=1f(yi−ˆyi), and so the log-likelihood, with ˆy=Xβ, is −n∑i=1log(1+(yi−xiβ)/σ)−nlog(πσ).
from scipy.optimize import minimize
def logl(u):
beta = u[:-1]
sigma = u[-1]
yhat = x.dot(beta)
return np.sum(np.log(1 + ((y - yhat) / sigma)**2)) + n * np.log(np.pi * sigma)
res = minimize(logl, [1, 1, 1])
est_beta = res['x'][:-1]
est_sigma = res['x'][-1]
new_yhat = x.dot(est_beta)
fig, ax = plt.subplots()
ax.scatter(y, new_yhat)
ax.set_xlim(-20, 20)
ax.set_xlabel("observed value")
ax.set_ylabel("predicted value")
ax.set_title(f"MSE = {np.median((y - new_yhat)**2):.2f}")
ax.axline((0,0), slope=1);
fig, ax = plt.subplots()
ax.scatter(y, new_yhat)
<matplotlib.collections.PathCollection at 0x7f1bfe04a460>
We adjusted the loss function: a more general way of describing what we're doing is that we're trying to minimize the loss ∑iL(yi−ˆyi), where L() is a function that quantifies how "bad" an "error" of a given size is.
The standard linear model used L(u)=u2; the Cauchy model led us to use L(u)=log(1+u2).
What's the difference? Well, here's the PDFs of the two:
from scipy.stats import norm, cauchy
fig, ax = plt.subplots()
xvals = np.linspace(-5, 5, 101)
ax.plot(xvals, norm.pdf(xvals), label='Normal')
ax.plot(xvals, cauchy.pdf(xvals), label='Cauchy')
ax.legend();
Those don't look too different. But have a look at the (negative) log PDF (which is what we use for a loss function:
from scipy.stats import norm, cauchy
fig, ax = plt.subplots()
xvals = np.linspace(-5, 5, 101)
ax.plot(xvals, -1 * norm.logpdf(xvals), label='Normal')
ax.plot(xvals, -1 * cauchy.logpdf(xvals), label='Cauchy')
ax.legend();