Instructions: Please answer the following questions and submit your work by editing this jupyter notebook and submitting it on Canvas. Questions may involve math, programming, or neither, but you should make sure to explain your work: i.e., you should usually have a cell with at least a few sentences explaining what you are doing.
Also, please be sure to always specify units of any quantities that have units, and label axes of plots (again, with units when appropriate).
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng()
Consider the matrix $$ A = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 & 1 \end{bmatrix} . $$ Let $Z$ be a vector of five independent draws from the Normal(0, 1) distribution. What is the covariance matrix of $$ X = A Z ? $$ Explain, and check by simulation.
Consider the following model: $$\begin{aligned} U &= \begin{cases} 0 \qquad &\text{with probability 1/2} \\ 1 \qquad &\text{with probability 1/2} \end{cases} \\ X_j &= \text{Normal}\left( \text{mean}= U_i, \text{sd}=7/5 \right) \qquad \text{for}\quad 1 \le j \le 50 . \end{aligned}$$ (In words: $X$ is a 50-dimensional vector of independent draws from a Normal distribution; these all have the same mean, $U$; and this mean is either 0 or 1, with probability 1/2 each.)
(a) Simulate 1,000 independent samples from this model;
the result should be an array of shape (1000, 50)
.
(Note: each row should have it's own, independent, simulated value for $U$.)
Treat this is a matrix of data with 1000 observations of 50 variables.
(b) Plot some of these "variables" against each other, colored by the value of $U$.
(c) Carry out principal components analysis for these data, and show the scree plot, the positions of the 1,000 data points on the first two PCs, and the loadings of the 50 variables on these two PCs.
In class, we did PCA on word count data from passages from three books.
The passages are in the file data/passages.txt
and the sources of each passage are in data/passage_sources.tsv.
You may use the same code from class to read in the data
and produce the matrix of word counts (you'll want wordmat
, words
, and sources
).
(a) Divide each row of the word count matrix by its standard deviation, so that each row has standard deviation equal to 1.
(b) Find the first three PCs of this matrix, as we did in class, except using scikit-learn. Plot these. Your plots should look similar but not the same as those from class, since scikit-learn's implementation differs somewhat, and omits PC1.
(c) Show the lists of fifty words having the largest positive and largest negative loadings on PC3. Speculate about what makes a passage have a large or a small score on PC3.
Note: part of this question is to figure out how what another method gives you maps on to what we discussed in class. Big clues are provided by the sizes of various outputs.