Let's look at how newborn (i.e., "neonate") body mass is related to adult body mass, across a bunch of mammal species (in the order Carnivora, including dogs, cats, bears, weasels, and seals). (data link)
carnivores = pd.read_csv("data/carnivora_sizes.csv").set_index("Binomial")
carnivores
Family | NeonateBodyMass_g | AdultBodyMass_g | AgeatEyeOpening_d | WeaningAge_d | |
---|---|---|---|---|---|
Binomial | |||||
Canis aureus | Canidae | 211.82 | 9658.70 | 7.50 | 61.30 |
Canis latrans | Canidae | 200.01 | 11989.10 | 11.94 | 43.71 |
Canis lupus | Canidae | 412.31 | 31756.51 | 14.01 | 44.82 |
Canis mesomelas | Canidae | 177.20 | 8247.30 | NaN | 34.10 |
Callorhinus ursinus | Otariidae | 5354.80 | 55464.82 | 0.00 | 108.69 |
... | ... | ... | ... | ... | ... |
Vulpes lagopus | Canidae | 69.17 | 3584.37 | 15.03 | 49.50 |
Vulpes velox | Canidae | 39.94 | 2088.00 | 12.50 | 47.08 |
Vulpes vulpes | Canidae | 100.49 | 4820.36 | 14.01 | 50.71 |
Vulpes zerda | Canidae | 28.04 | 1317.13 | 16.95 | 65.56 |
Zalophus californianus | Otariidae | 6347.89 | 137194.86 | 0.00 | 319.01 |
145 rows × 5 columns
plt.scatter(carnivores.AdultBodyMass_g, carnivores.NeonateBodyMass_g, s=100);
plt.xlabel("adult body mass"); plt.ylabel("neonate body mass");
But - there's some "substructure"!
colors = {f: c for f, c in zip(carnivores.Family.unique(), plt.cm.tab20.colors)}
for f in colors:
sub = carnivores[carnivores.Family == f]
plt.scatter(sub.AdultBodyMass_g, sub.NeonateBodyMass_g, color=colors[f], s=100, label=f)
plt.legend();
Let's fit a linear model and see what it's predictions are like:
from sklearn.linear_model import LinearRegression as lm
fit1 = lm().fit(carnivores[['AdultBodyMass_g']], carnivores.NeonateBodyMass_g)
for f in colors:
sub = carnivores[carnivores.Family == f]
plt.scatter(sub.AdultBodyMass_g, sub.NeonateBodyMass_g, color=colors[f], s=100, label=f)
pred = pd.DataFrame({ "AdultBodyMass_g": np.linspace(0, 1.6e6, 21) })
pred['NeonateBodyMass_g'] = fit1.predict(pred)
plt.plot(pred.AdultBodyMass_g, pred.NeonateBodyMass_g)
plt.xlabel("adult mass"); plt.ylabel("neonate mass"); plt.legend();
Two good generic diagnostic plots are:
What's the idea?
Here's data that are nonlinear but we don't know it:
X = np.array([np.linspace(-4, 4, 101)]).T
y = rng.normal(loc=X**2, scale=1)
xy_lm = lm().fit(X, y)
yhat = xy_lm.predict(X)
plt.scatter(X[:,0], y)
plt.plot(X[:,0], yhat);
And, this shows up in the fit-versus-residuals plot:
resids = y - yhat
plt.scatter(yhat, resids);
plt.axhline(0, color='red')
plt.xlabel("fitted values"); plt.ylabel("residuals");
Solution: add a quadratic term!
XX2 = np.hstack([X, X**2])
xy_lm = lm().fit(XX2, y)
yhat = xy_lm.predict(XX2)
plt.scatter(X[:,0], y); plt.plot(X[:,0], yhat); plt.xlabel("x"); plt.ylabel('y');
resids = y - yhat
plt.scatter(yhat, resids);
plt.axhline(0, color='red')
plt.xlabel("fitted values"); plt.ylabel("residuals");
Let's look at this in our data:
carnivores['PredictedNeonateMass'] = fit1.predict(
carnivores[['AdultBodyMass_g']]
)
carnivores['residual'] = carnivores['NeonateBodyMass_g'] - carnivores['PredictedNeonateMass']
carnivores
Family | NeonateBodyMass_g | AdultBodyMass_g | AgeatEyeOpening_d | WeaningAge_d | PredictedNeonateMass | residual | logAdultBodyMass_g | logNeonateBodyMass_g | |
---|---|---|---|---|---|---|---|---|---|
Binomial | |||||||||
Canis aureus | Canidae | 211.82 | 9658.70 | 7.50 | 61.30 | 916.205446 | -704.385446 | 9.175614 | 5.355737 |
Canis latrans | Canidae | 200.01 | 11989.10 | 11.94 | 43.71 | 1003.630922 | -803.620922 | 9.391753 | 5.298367 |
Canis lupus | Canidae | 412.31 | 31756.51 | 14.01 | 44.82 | 1745.209715 | -1332.899715 | 10.365853 | 6.021775 |
Canis mesomelas | Canidae | 177.20 | 8247.30 | NaN | 34.10 | 863.256460 | -686.056460 | 9.017641 | 5.177279 |
Callorhinus ursinus | Otariidae | 5354.80 | 55464.82 | 0.00 | 108.69 | 2634.632251 | 2720.167749 | 10.923504 | 8.585749 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Vulpes lagopus | Canidae | 69.17 | 3584.37 | 15.03 | 49.50 | 688.325602 | -619.155602 | 8.184338 | 4.236567 |
Vulpes velox | Canidae | 39.94 | 2088.00 | 12.50 | 47.08 | 632.188948 | -592.248948 | 7.643962 | 3.687378 |
Vulpes vulpes | Canidae | 100.49 | 4820.36 | 14.01 | 50.71 | 734.694042 | -634.204042 | 8.480604 | 4.610058 |
Vulpes zerda | Canidae | 28.04 | 1317.13 | 16.95 | 65.56 | 603.269588 | -575.229588 | 7.183210 | 3.333632 |
Zalophus californianus | Otariidae | 6347.89 | 137194.86 | 0.00 | 319.01 | 5700.752923 | 647.137077 | 11.829158 | 8.755878 |
145 rows × 9 columns
No particular pattern (?), but certainly: heteroskedasticity!
plt.scatter(carnivores.PredictedNeonateMass, carnivores.residual)
plt.axhline(0, color='red');
plt.xlabel("predicted neonate body mass"); plt.ylabel("residual of neonate body mass");
Why is heteroskedasticity a problem? Above, we see that small body masses are all very close to the predicted line. Indeed they have to be! A 5g weasel can't have a 100kg baby. So: these small animals are not affecting the fit of that line very much at all. Maybe that's okay - if we're interested in a relationship that holds only for big animals, for instance - but we should dig more.
One reason for having bad model fit is that you aren't looking at the data through the right lens.
Most commonly: if the data vary by percent changes, then using standard methods (that assume additive changes) won't do so great. But, the logarithm turns percent changes into additive changes!
Let's try it out.
for f in colors:
sub = carnivores[carnivores.Family == f]
plt.scatter(sub.AdultBodyMass_g, sub.NeonateBodyMass_g, color=colors[f], s=100, label=f)
plt.xscale("log"); plt.yscale("log") # log-log axes
plt.xlabel("adult mass"); plt.ylabel("neonate mass"); plt.legend();
Okay, let's now fit a model to log(adult) and log(neonate) mass:
carnivores['logAdultBodyMass_g'] = np.log(carnivores['AdultBodyMass_g'])
carnivores['logNeonateBodyMass_g'] = np.log(carnivores['NeonateBodyMass_g'])
fit2 = lm().fit(carnivores[['logAdultBodyMass_g']], carnivores.logNeonateBodyMass_g)
carnivores
Family | NeonateBodyMass_g | AdultBodyMass_g | AgeatEyeOpening_d | WeaningAge_d | PredictedNeonateMass | residual | logAdultBodyMass_g | logNeonateBodyMass_g | |
---|---|---|---|---|---|---|---|---|---|
Binomial | |||||||||
Canis aureus | Canidae | 211.82 | 9658.70 | 7.50 | 61.30 | 916.205446 | -704.385446 | 9.175614 | 5.355737 |
Canis latrans | Canidae | 200.01 | 11989.10 | 11.94 | 43.71 | 1003.630922 | -803.620922 | 9.391753 | 5.298367 |
Canis lupus | Canidae | 412.31 | 31756.51 | 14.01 | 44.82 | 1745.209715 | -1332.899715 | 10.365853 | 6.021775 |
Canis mesomelas | Canidae | 177.20 | 8247.30 | NaN | 34.10 | 863.256460 | -686.056460 | 9.017641 | 5.177279 |
Callorhinus ursinus | Otariidae | 5354.80 | 55464.82 | 0.00 | 108.69 | 2634.632251 | 2720.167749 | 10.923504 | 8.585749 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Vulpes lagopus | Canidae | 69.17 | 3584.37 | 15.03 | 49.50 | 688.325602 | -619.155602 | 8.184338 | 4.236567 |
Vulpes velox | Canidae | 39.94 | 2088.00 | 12.50 | 47.08 | 632.188948 | -592.248948 | 7.643962 | 3.687378 |
Vulpes vulpes | Canidae | 100.49 | 4820.36 | 14.01 | 50.71 | 734.694042 | -634.204042 | 8.480604 | 4.610058 |
Vulpes zerda | Canidae | 28.04 | 1317.13 | 16.95 | 65.56 | 603.269588 | -575.229588 | 7.183210 | 3.333632 |
Zalophus californianus | Otariidae | 6347.89 | 137194.86 | 0.00 | 319.01 | 5700.752923 | 647.137077 | 11.829158 | 8.755878 |
145 rows × 9 columns
for f in colors:
sub = carnivores[carnivores.Family == f]
plt.scatter(sub.logAdultBodyMass_g, sub.logNeonateBodyMass_g, color=colors[f], s=100, label=f)
pred2 = pd.DataFrame({
"logAdultBodyMass_g": np.linspace(np.min(carnivores.logAdultBodyMass_g),
np.max(carnivores.logAdultBodyMass_g), 201)
})
pred2['logNeonateBodyMass_g'] = fit2.predict(pred2)
plt.plot(np.log(pred.AdultBodyMass_g), np.log(pred.NeonateBodyMass_g), label='orig fit')
plt.plot(pred2.logAdultBodyMass_g, pred2.logNeonateBodyMass_g, label='log fit')
plt.xlabel("log adult mass"); plt.ylabel("log neonate mass"); plt.legend();
/usr/lib/python3/dist-packages/pandas/core/arraylike.py:364: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
for f in colors:
sub = carnivores[carnivores.Family == f]
fit_sub = lm().fit(sub[['logAdultBodyMass_g']], sub.logNeonateBodyMass_g)
plt.scatter(sub.logAdultBodyMass_g, sub.logNeonateBodyMass_g, color=colors[f], s=100, label=f)
pred_sub = pd.DataFrame({
"logAdultBodyMass_g": np.linspace(np.min(sub.logAdultBodyMass_g),
np.max(sub.logAdultBodyMass_g), 201)
})
pred_sub['logNeonateBodyMass_g'] = fit_sub.predict(pred_sub)
plt.plot(pred_sub.logAdultBodyMass_g, pred_sub.logNeonateBodyMass_g, color=colors[f])
plt.legend();
carnivores[carnivores.Family == "Mustelidae"].plot.scatter("AdultBodyMass_g", "NeonateBodyMass_g")
<AxesSubplot: xlabel='AdultBodyMass_g', ylabel='NeonateBodyMass_g'>
carnivores[carnivores.Family == "Mustelidae"].plot.scatter("logAdultBodyMass_g", "logNeonateBodyMass_g")
<AxesSubplot: xlabel='logAdultBodyMass_g', ylabel='logNeonateBodyMass_g'>
Next steps:
non_na = carnivores[~np.isnan(carnivores.AgeatEyeOpening_d)]
non_na = non_na[~np.isnan(non_na.WeaningAge_d)]
fit3 = lm().fit(non_na[['logAdultBodyMass_g', "AgeatEyeOpening_d", "WeaningAge_d"]], non_na.logNeonateBodyMass_g)
non_na['predicted'] = fit3.predict(non_na[['logAdultBodyMass_g', "AgeatEyeOpening_d", "WeaningAge_d"]])
non_na['resid'] = non_na.logNeonateBodyMass_g - non_na['predicted']
for f in colors:
sub = non_na[non_na.Family == f]
plt.scatter(sub.predicted, sub.resid, color=colors[f], s=100, label=f)
plt.axhline(0, color='red')
plt.legend();