In [None]:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd

rng = np.random.default_rng()

# Fun with Words (text analysis with PCA)

Download:
- [passages.txt](https://github.com/UOdsci/dsci345/raw/main/class_material/slides/data/passages.txt)
- [passage_sources.tsv](https://github.com/UOdsci/dsci345/raw/main/class_material/slides/data/passage_sources.tsv)

and put them in the subdirectory `data/` of wherever this notebook is.

In `passages.txt` short passages from a few different books.

Can we identify the authors of each passage?

The true sources of the passages are in `passage_sources.tsv`.

## Turn the data into a matrix

In [None]:
from collections import defaultdict

pfile = open("data/passages.txt", "r")
passages = pfile.read().split("\n")[:-1]
sources = pd.read_table("data/passage_sources.tsv")
words = np.unique(" ".join(passages).split(" "))[1:]
def tabwords(x, words):
    d = defaultdict(int)
    for w in x.split(" "):
        d[w] += 1
    out = np.array([d[w] for w in words])
    return out

wordmat = np.array([tabwords(x, words) for x in passages])

**(1)** Look at the data: what is `passages`? How about `wordmat`?

## Do PCA

This time we "do PCA" by finding the "singular value decomposition"
(SVD) of the data matrix,
because `scipy.sparse.linalg.svds` lets us *only*
find the PCs we're interested in:
finding *all* would take waaaay too long.

In [None]:
from scipy.sparse.linalg import svds
# center and scale the data
x = wordmat - np.mean(wordmat, axis=1)[:,np.newaxis]
x /= np.std(x, axis=1)[:, np.newaxis]
pcs, evals, evecs = svds(x, k=4)
eord = np.argsort(evals)[::-1]
evals = evals[eord]
evecs = evecs[eord,:]
pcs = pcs[:,eord]

**(2)** Describe in words what each of the lines in the code above does.

## The loadings

Here are the "loadings" of the PCs:

In [None]:
loadings = pd.DataFrame(evecs.T, columns=[f"PC{k}" for k in range(1,5)], index=words)
loadings

## The PCs

The principal components are the columns of `pcs` (e.g., the first PC is `pcs[:,0]`).

**(3)** Make scatterplots of the position of each passage on PCs 1, 2, and 3 against each other,
colored by the `source` of the passage.

## What do the PCs "mean"?

For instance, here are the 20 words that most strongly move a passage in the positive direction on PC1:
```
loadings.sort_values("PC1").tail(20)
```
and the 20 that most strongly move in the negative direction:
```
loadings.sort_values("PC1").head(20)
```

**(4)** Come up with interpretations of PC1, PC2, PC3, and PC4.