# Exercise: Estimating Elephants

Elephants, at birth, are about 1m long measured along their backs,
and grow about 10cm/year for the first 20 years,
although elephants of the same age differ by 10-20% or so
(see [Trimble et al](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026614)).
Their rate of growth is also affected
by health (e.g., food availability and parasite load).
How well can we estimate the age of juvenile elephants (between 10-20 years old)
based on their lengths in aerial photographs?
Does it help much to take into account food availability?

To see how well we expect this to work, let's simulate some data.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng()

First, let's draw some ages: maybe... uniform?

In [None]:
n = 100
age = rng.uniform(low=10, high=20, size=n) # in years

Next: food availability.
We're supposing (somewhat unrealistically) that each of these elephants lives in a different location,
and each location has a general average food availability.
Let's measure food availability as a percentage relative to som e reference,
and sample this from a Gamma:

In [None]:
food = rng.gamma(shape=10, scale=8, size=n) # in percent; should have a mean of 80

Now **your turn:** simulate lengths, given age and food availability.
("TODO" to be decided on in class.)
To do this, use 
1. a mean length of 2m for 10-year-old elephants;
2. the growth curve cited above ("10cm/year")
3. the SD cited above (around 10-20%); so an SD of **TODO**m
4. a slope so that for each **TODO**% differing from 100%, mean length changes by **TODO**m.

In [None]:
mean_length = TODO # should be a function of age and food
length = rng.normal(TODO)

**Add to this scatterplot** the line giving mean length as a function of age
for an elephant at 100% food (use `plt.axline( )`):

In [None]:
plt.scatter(age, mean_length)
plt.scatter(age, length, c=food)
plt.xlabel("age (years)"); plt.ylabel("length (m)")
plt.colorbar();

## The inference problem

We'd like to infer age based on length.
The line that minimizes mean squared error
has slope $\sd(Y)/\sd(X) \times \cor[X, Y]$,
and has the right mean.

**Add this to the plot.**

How can we answer the question "how *well* can we estimate age, based on length"?
One answer to this is the *root mean squared error*.
**Compute this.**

## Multivariate inference

Now let's use food availability also!

Recall that $b$ solves
$$ (x^T x) b = x^T y .$$
(and don't forget the intercept!)

**(1)** Construct the matrix `x`: this should be an array with three columns: `age`, `food`, and a column that's all `1` (for the intercept).

**(2)** Construct the matrix $x^T x$ (use `x.T` and `np.matmul( )` or the `.dot( )` method of an array):

**(3)** Construct the vector $x^T y$ (what is $y$?):

**(4)** Solve the equation $(x^T x) b = x^T y$ (use `np.linalg.solve( )`):