For random variables $X$ and $Y$, $$\begin{aligned} \cov[X, Y] &= \E[(X - \E[X])(Y - \E[Y])] \\ &= \E[XY] - \E[X] \E[Y] , \end{aligned}$$ and $$
\text{cor}[X, Y]
=
\frac{
\cov[X, Y]
}{
\sd[X] \sd[Y]
}
$$
Properties:
If $a$ is nonrandom and $Z$ is another random variable, then $$\begin{aligned} \text{(constants)} \qquad & \cov[a, X] = 0 \\ \text{(bilinearity)} \qquad & \cov[a X + Y, Z] = a \; \cov[X, Z] + \cov[Y, Z] . \end{aligned}$$
Let $Z_1$, $Z_2$, and $Z_3$ be independent Normal(0, 1), and $$\begin{aligned} X &= \frac{1}{\sqrt{2}}\left( Z_1 + Z_2 \right) \\ Y &= \frac{1}{\sqrt{2}}\left( Z_2 + Z_3 \right) . \end{aligned}$$ Then $$\begin{aligned} X &\sim \text{Normal}(0, 1) , \\ Y &\sim \text{Normal}(0, 1) , \end{aligned}$$ and $$\begin{aligned} \cov[X, Y] &= \frac{1}{2} \E[(Z_1 + Z_2)(Z_2 + Z_3)] \\ &= \frac{1}{2} . \end{aligned}$$
Z = rng.normal(size=(3, 2000))
X = ( Z[0,:] + Z[1,:] ) / np.sqrt(2)
Y = ( Z[1,:] + Z[2,:] ) / np.sqrt(2)
fig, ax = plt.subplots()
ax.scatter(X, Y); ax.set_aspect(1)
ax.set_xlabel("X"); ax.set_ylabel("Y");
Modify the example so that $\cov[X, Y] = - 1/2$ (and so the plot tilts the other way).
Z = rng.normal(size=(3, 2000))
X = ( Z[0,:] + Z[1,:] ) / np.sqrt(2)
Y = ( Z[2,:] - Z[1,:] ) / np.sqrt(2)
fig, ax = plt.subplots()
ax.scatter(X, Y); ax.set_aspect(1)
ax.set_xlabel("X"); ax.set_ylabel("Y");
print(f"Covariance: {np.cov(X, Y)}")
Covariance: [[ 1.03834308 -0.5120863 ] [-0.5120863 1.03530641]]
Suppose we have a bunch of data, like $$ \begin{bmatrix} X_{11} & X_{12} & \cdots & X_{1k} \\ X_{21} & X_{22} & \cdots & X_{2k}\\ \vdots & \vdots & \ddots & \vdots \\ X_{n1} & \cdots & \cdots &X_{nk} \end{bmatrix} $$ where $$\begin{aligned} X_{i \cdot} &= \text{(one observation)} \\ X_{\cdot j} &= \text{(one variable)} . \end{aligned}$$
How do we describe relationships between the variables?
The sample covariance matrix of a dataset is $$ C_{jk} = \cov[X_{\cdot j}, X_{\cdot k}] $$ (and the sample covariance is computed just like the sample variance).
Say we have measurements of lots of trees, for:
Which correlations will be positive? Negative?
Here is a hopefully interesting but not terribly realistic simulation:
n = 2000
age = rng.exponential(scale=100, size=n)
height = rng.gamma(scale=np.sqrt(age), shape=np.sqrt(age), size=n)
leaves = np.round(rng.normal(loc=height*1000, scale=height*400, size=n))
others = rng.poisson(25 * np.pi / 4 * np.exp(-height/100))
X = np.column_stack([age, height, leaves, others])
np.corrcoef(X.T)
array([[ 1. , 0.93766606, 0.80667024, -0.73721968], [ 0.93766606, 1. , 0.8734101 , -0.74948589], [ 0.80667024, 0.8734101 , 1. , -0.64564577], [-0.73721968, -0.74948589, -0.64564577, 1. ]])
fig, axes = plt.subplots(4, 4)
names = ['age', 'height', 'leaves', 'others']
for i in range(4):
for j in range(4):
ax = axes[i][j]
if i == j:
ax.hist(X[:,i], bins=pretty(X[:,i], 40)); ax.set_xlabel(names[i])
else:
ax.scatter(X[:,j], X[:,i]); ax.set_xlabel(names[j]); ax.set_ylabel(names[i])
plt.tight_layout()
Exercise: Come up with at least three variables that are correlated, including at least one negative correlation, and check your neighbor agrees.
However: how can we produce a particular covariance matrix? Here's one way...
First, recall some facts about the Normal: if $$ X_i \sim \text{Normal}(\text{mean}=\mu_i, \text{sd}=\sigma_i), \qquad 0 \le i \le k-1, $$ are independent, and $a$ and $b$ are nonrandom, then:
$X_1 + \cdots + X_k$ is Normal, with mean $\mu_1 + \cdots + \mu_k$ and SD $\sqrt{\sigma_1^2 + \cdots + \sigma_k^2}$.
$a X_i + b$ is Normal($a \mu_i + b$, $a\sigma_i$).
So: if $Z_1, \ldots, Z_k$ are independent Normal(0,1) and $A$ is an $k \times k$ matrix, then $$ X = AZ $$ is a $k$-dimensional random variable and $$ X \sim \text{Normal}(0, A A^T) .$$
In other words, $$ \cov[ X_i, X_j ] = \cov[(AZ)_i, (AZ)_j] = \left(A A^T \right)_{ij} = \sum_\ell A_{i \ell} A_{j \ell} . $$
So, here's a recipe to simulate from the multivariate Normal: $$ X \sim \text{Normal}(\text{mean}=a, \text{cov}=C) .$$
Simulate 1,000 draws from a Normal with mean 0 and covariance matrix $$ C = \begin{bmatrix} 1 & 0.8 & -0.4 \\ 0.8 & 1 & -0.8 \\ -0.4 & -0.8 & 1 \end{bmatrix} . $$
C = [[1, .8, -.4], [.8, 1, -.8], [-.4, -.8, 1]]
A = np.linalg.cholesky(C)
Z = rng.normal(size=(1000, 3))
X = Z.dot(A.T)
fig, axes = plt.subplots(1,3)
for (i,j), ax in zip([(0, 1), (0, 2), (1, 2)], axes):
ax.scatter(X[:,i], X[:,j])
ax.set_xlabel(i); ax.set_ylabel(j)
np.cov(X.T)
array([[ 0.97467511, 0.77923922, -0.39828824], [ 0.77923922, 0.99191073, -0.80625511], [-0.39828824, -0.80625511, 1.01761391]])
Simulate 1,000 draws from a Normal with mean $\mu = (10, 20, 30)$ and covariance matrix $$ C = \begin{bmatrix} 1 & -1 & -1 \\ -1 & 3 & -3 \\ -1 & -3 & 10 \end{bmatrix} . $$ What is its correlation matrix?
C = [[1, -1, -1], [-1, 3, -3], [-1, -3, 10]]
A = np.linalg.cholesky(C)
X = rng.normal(size=(1000, 3)).dot(A.T) + np.array([10, 20, 30])
fig, axes = plt.subplots(1,3)
for (i,j), ax in zip([(0, 1), (0, 2), (1, 2)], axes):
ax.scatter(X[:,i], X[:,j])
ax.set_xlabel(i); ax.set_ylabel(j)
np.cov(X.T)
array([[ 1.0502733 , -1.10324987, -0.99719679], [-1.10324987, 3.17132212, -2.93232584], [-0.99719679, -2.93232584, 9.83556831]])