Exercise: Estimating Elephants¶

$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}} \newcommand{\cov}{\text{cov}} \newcommand{\cor}{\text{cor}}$$ This is here so we can use \P and \E and \var and \cov and \cor and \sd in LaTeX below.

Elephants, at birth, are about 1m long measured along their backs, and grow about 10cm/year for the first 20 years, although elephants of the same age differ by 10-20% or so (see Trimble et al). Their rate of growth is also affected by health (e.g., food availability and parasite load). How well can we estimate the age of juvenile elephants (between 10-20 years old) based on their lengths in aerial photographs? Does it help much to take into account food availability?

To see how well we expect this to work, let's simulate some data.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed=123)
In [2]:
n = 100
age = rng.uniform(low=10, high=20, size=n) # in years

What about food availability? Let's measure food availability as a percentage of 'optimal', and suppose that for each 10% that food drops from this point, average elephant size goes down by .15 * .25 m, on average (15% of a standard deviation).

In [3]:
food = rng.gamma(shape=10, scale=8, size=n) # in percent
mean_length = 1 + .1 * age  - .15 *.25 * (100 - food)/10 # in m
length = rng.normal(loc=mean_length, scale=0.25, size=n)
In [4]:
plt.hist(food);
No description has been provided for this image
In [5]:
plt.scatter(age, mean_length)
plt.scatter(age, length, c=food)
plt.xlabel("age (years)"); plt.ylabel("length (m)")
plt.axline((10, 2), slope=.1) # mean size at 100% food
plt.colorbar(label='food avail');
No description has been provided for this image

The inference problem¶

We'd like to infer age based on length. The line that minimizes mean squared error has slope $\sd(Y)/\sd(X) \times \cor[X, Y]$, and has the right mean.

Make the length-vs-age plot and add this line.

In [6]:
np.corrcoef(age, length)
Out[6]:
array([[1.        , 0.66506441],
       [0.66506441, 1.        ]])
In [7]:
a = (np.std(age, ddof=1) / np.std(length, ddof=1) 
     * np.corrcoef(age, length)[0,1])
b = np.mean(age) - a * np.mean(length)
age_hat = a * length + b
In [8]:
a, b
Out[8]:
(np.float64(4.704069567707924), np.float64(3.464646893346048))
In [9]:
plt.scatter(length, age, c=food)
plt.scatter(length, age_hat, c='red')
plt.ylabel("age (years)"); plt.xlabel("length (m)")
plt.axline((2, 10), slope=10) # mean size at 100% food
plt.colorbar(label='food avail');
No description has been provided for this image

How can we answer the question "how well can we estimate age, based on length"? One answer to this is the root mean squared error. Compute this.

In [10]:
resid = age - age_hat
print(f"RMSE = {np.sqrt(np.mean(resid**2))} years")
RMSE = 2.1087153735840225 years

Multivariate inference¶

Now let's use food availability also!

Recall that $b$ solves $$ (x^T x) a = x^T y .$$ (and don't forget the intercept!)

In [11]:
(x.T).dot(x).shape
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 (x.T).dot(x).shape

NameError: name 'x' is not defined
In [12]:
x = np.column_stack([
    np.ones(n),
    length,
    food
])
a = np.linalg.solve((x.T).dot(x), (x.T).dot(age))
In [13]:
a
Out[13]:
array([ 4.5508148 ,  5.12404589, -0.02594002])
In [14]:
age_hat2 = x.dot(a)
In [15]:
resid2 = age - age_hat2
print(f"RMSE = {np.sqrt(np.mean(resid2**2))} years")
RMSE = 2.0013955204713203 years