Correlation and covariance¶

Peter Ralph

https://uodsci.github.io/dsci345

In [1]:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd
from dsci345 import pretty
rng = np.random.default_rng()

$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}} \newcommand{\cov}{\text{cov}}$$ This is here so we can use \P and \E and \var and \cov and \sd in LaTeX below.

Covariance, and correlation¶

Definitions¶

For random variables $X$ and $Y$, $$\begin{aligned} \cov[X, Y] &= \E[(X - \E[X])(Y - \E[Y])] \\ &= \E[XY] - \E[X] \E[Y] , \end{aligned}$$ and $$ \text{cor}[X, Y] = \frac{ \cov[X, Y] }{ \sd[X] \sd[Y] } $$

Properties:

If $a$ is nonrandom and $Z$ is another random variable, then $$\begin{aligned} \text{(constants)} \qquad & \cov[a, X] = 0 \\ \text{(bilinearity)} \qquad & \cov[a X + Y, Z] = a \; \cov[X, Z] + \cov[Y, Z] . \end{aligned}$$

Example¶

Let $Z_1$, $Z_2$, and $Z_3$ be independent Normal(0, 1), and $$\begin{aligned} X &= \frac{1}{\sqrt{2}}\left( Z_1 + Z_2 \right) \\ Y &= \frac{1}{\sqrt{2}}\left( Z_2 + Z_3 \right) . \end{aligned}$$ Then $$\begin{aligned} X &\sim \text{Normal}(0, 1) , \\ Y &\sim \text{Normal}(0, 1) , \end{aligned}$$ and $$\begin{aligned} \cov[X, Y] &= \frac{1}{2} \E[(Z_1 + Z_2)(Z_2 + Z_3)] \\ &= \frac{1}{2} . \end{aligned}$$

In [2]:
Z = rng.normal(size=(3, 2000))
X = ( Z[0,:] + Z[1,:] ) / np.sqrt(2)
Y = ( Z[1,:] + Z[2,:] ) / np.sqrt(2)
for k in range(10):
    print(np.round(X[k]), np.round(Y[k]))
-1.0 -0.0
1.0 1.0
0.0 -1.0
2.0 1.0
2.0 1.0
-1.0 -0.0
-1.0 0.0
-0.0 0.0
0.0 0.0
-0.0 2.0

Plotting $X$ and $Y$ against each other, we get a roughly elliptical cloud that goes from lower-left to upper-right.

In [3]:
# Cov Plot 1
fig, ax = plt.subplots()
ax.scatter(X, Y); ax.set_aspect(1)
ax.set_xlabel("X"); ax.set_ylabel("Y");
No description has been provided for this image
In [4]:
fig, ax = plt.subplots()
ax.scatter(Z[0,:], Z[1,:]); ax.set_aspect(1)
ax.set_xlabel("Z1"); ax.set_ylabel("Z2");
No description has been provided for this image
In [5]:
np.corrcoef(X, Y)
Out[5]:
array([[1.        , 0.50856107],
       [0.50856107, 1.        ]])

Exercise:¶

Modify the example so that $\cov[X, Y] = - 1/2$ (and so the plot tilts the other way).

In [6]:
X = ( Z[0,:] + Z[1,:] ) / np.sqrt(2)
Y = ( - Z[1,:] + Z[2,:] ) / np.sqrt(2)
np.corrcoef(X,Y)
Out[6]:
array([[ 1.        , -0.47777883],
       [-0.47777883,  1.        ]])

Multivariate data¶

Suppose we have a bunch of data, like $$ \begin{bmatrix} X_{11} & X_{12} & \cdots & X_{1k} \\ X_{21} & X_{22} & \cdots & X_{2k}\\ \vdots & \vdots & \ddots & \vdots \\ X_{n1} & \cdots & \cdots &X_{nk} \end{bmatrix} $$ where $$\begin{aligned} X_{i \cdot} &= \text{(one observation)} \\ X_{\cdot j} &= \text{(one variable)} . \end{aligned}$$

How do we describe relationships between the variables?

The sample covariance matrix of a dataset is $$ C_{jk} = \cov[X_{\cdot j}, X_{\cdot k}] $$ (and the sample covariance is computed just like the sample variance).

Example¶

Say we have measurements of lots of trees, for:

  1. age,
  2. height,
  3. number of leaves, and
  4. number of other trees within 5m.

Which correlations will be positive? Negative?

Exercise: make a model for this situation.

In [7]:
n = 1000
age = rng.gamma(shape=4, scale=200, size=n)

fig, ax = plt.subplots()
ax.hist(age)
Out[7]:
(array([ 91., 275., 248., 164., 120.,  61.,  24.,   8.,   6.,   3.]),
 array([  71.20486809,  332.48106486,  593.75726164,  855.03345841,
        1116.30965518, 1377.58585195, 1638.86204872, 1900.1382455 ,
        2161.41444227, 2422.69063904, 2683.96683581]),
 <BarContainer object of 10 artists>)
No description has been provided for this image
In [8]:
def h(a):
    u = a / 300
    return 30 * np.sqrt(u)

avals = np.linspace(0, 2000, 101)
hvals = h(avals)

fig, ax = plt.subplots()
ax.plot(avals, hvals)
ax.set_xlabel("age (y)")
ax.set_ylabel("height (m)");
No description has been provided for this image
In [9]:
np.fmin(1, [0,23,5,1])
Out[9]:
array([0, 1, 1, 1])
In [10]:
expected_height = h(age)
height = np.fmax(1, rng.normal(
    loc=expected_height,
    scale=expected_height/20,
    size=n
))

fig, ax = plt.subplots()
ax.scatter(age, height)
ax.plot(avals, hvals, label='expeected relationship', c='red')
ax.legend()
ax.set_xlabel("age (y)")
ax.set_ylabel("height (m)");
No description has been provided for this image
In [11]:
leaves = rng.poisson(lam=height, size=n)
others = rng.poisson(
    lam = 20 * np.exp(-height/50),
    size=n,
)
In [12]:
X = np.column_stack([[age, height, leaves, others]]).T
In [13]:
X.shape
Out[13]:
(1000, 4)
In [14]:
fig, axes = plt.subplots(4, 4)
names = ['age', 'height', 'leaves', 'others']
for i in range(4):
    for j in range(4):
        ax = axes[i][j]
        if i == j:
            ax.hist(X[:,i], bins=pretty(X[:,i], 40)); ax.set_xlabel(names[i])
        else:
            ax.scatter(X[:,j], X[:,i]); ax.set_xlabel(names[j]); ax.set_ylabel(names[i])
            
plt.tight_layout()
No description has been provided for this image
In [15]:
np.corrcoef(X.T)
Out[15]:
array([[ 1.        ,  0.96805361,  0.86236693, -0.54441135],
       [ 0.96805361,  1.        ,  0.89097346, -0.57344857],
       [ 0.86236693,  0.89097346,  1.        , -0.50928994],
       [-0.54441135, -0.57344857, -0.50928994,  1.        ]])

The Multivariate Normal distribution¶

However: how can we produce a particular covariance matrix? Here's one way...

First, some facts about the Normal: if $$ X_i \sim \text{Normal}(\text{mean}=\mu_i, \text{sd}=\sigma_i), \qquad 0 \le i \le k-1, $$ are independent, and $a$ and $b$ are nonrandom, then:

  1. $X_1 + \cdots + X_k$ is Normal, with mean $\mu_1 + \cdots + \mu_k$ and SD $\sqrt{\sigma_1^2 + \cdots + \sigma_k^2}$.

  2. $a X_i + b$ is Normal($a \mu_i + b$, $a\sigma_i$).

i.e., the sum of independent Normals is Normal (and so their means and variances sum).

So: if $Z_1, \ldots, Z_k$ are independent Normal(0,1) and $A$ is an $k \times k$ matrix, then $$ X = AZ $$ is a $k$-dimensional random variable and $$ X \sim \text{Normal}(0, A A^T) .$$

In other words, $$ \cov[ X_i, X_j ] = \cov[(AZ)_i, (AZ)_j] = \left(A A^T \right)_{ij} = \sum_\ell A_{i \ell} A_{j \ell} . $$

So, here's a recipe to simulate from the multivariate Normal: $$ X \sim \text{Normal}(\text{mean}=a, \text{cov}=C) .$$

  1. Let $A$ be the Cholesky factor of $C$ (so $C = A A^T$).
  2. Choose $Z$ to be a vector of independent Normal(0, 1).
  3. Let $X = a + AZ$.

Example:¶

Simulate 1,000 draws from a Normal with mean 0 and covariance matrix $$ C = \begin{bmatrix} 1 & 0.8 & -0.4 \\ 0.8 & 1 & -0.8 \\ -0.4 & -0.8 & 1 \end{bmatrix} . $$

In [16]:
n = 1000
C = np.array([[1, 0.8, -0.4],
             [0.8, 1, -0.8],
             [-0.4, -0.8, 1]])
A = np.linalg.cholesky(C)
A
Out[16]:
array([[ 1.       ,  0.       ,  0.       ],
       [ 0.8      ,  0.6      ,  0.       ],
       [-0.4      , -0.8      ,  0.4472136]])
In [17]:
np.matmul(A, A.T)
Out[17]:
array([[ 1. ,  0.8, -0.4],
       [ 0.8,  1. , -0.8],
       [-0.4, -0.8,  1. ]])
In [18]:
# one draw
Z = rng.normal(loc=0, scale=1, size=3)
X = np.matmul(A, Z)
X
Out[18]:
array([-0.31098471, -1.20605552,  1.02279543])
In [19]:
Z = rng.normal(loc=0, scale=1, size=(3, n))
X = np.matmul(A, Z)
print("shape", X.shape)
print("covmat:", np.cov(X, ddof=1))
shape (3, 1000)
covmat: [[ 1.03920227  0.78215934 -0.3714866 ]
 [ 0.78215934  0.96968754 -0.79000986]
 [-0.3714866  -0.79000986  1.02308682]]
In [20]:
fig, (ax0, ax1, ax2) = plt.subplots(1, 3)
ax0.scatter(X[0,:], X[1,:])
ax1.scatter(X[0,:], X[2,:])
ax2.scatter(X[1,:], X[2,:])
Out[20]:
<matplotlib.collections.PathCollection at 0x7f5e2439ca40>
No description has been provided for this image

Exercise:¶

Simulate 1,000 draws from a Normal with mean $\mu = (10, 20, 30)$ and covariance matrix $$ C = \begin{bmatrix} 1 & -1 & -1 \\ -1 & 3 & -3 \\ -1 & -3 & 10 \end{bmatrix} . $$ What is its correlation matrix?

In [ ]: