Working with text

Author

Peter Ralph

Published

February 9, 2026

A bit more with the federalist papers

We’re not yet done here, because we’d still like to know: Who wrote the disputed papers?

So far, we’ve found that it’s not “completely obvious” (i.e., attribution doesn’t pop out of an unguided dimension reduction for Hamilton vs Madison, although it does for Jay).

The data, again

import json, re, collections
import pandas as pd
import numpy as np
import plotnine as p9

with open("data/federalist.json", 'r') as f:
    text = [json.loads(line) for line in f]

info = pd.DataFrame(
    { k: [t[k] for t in text] for k in ['author', 'date', 'title', 'paper_id', 'venue']}
).assign(length = [len(t['text'].split(" ")) for t in text])

spaCy

But this time

…we’ll use spaCy, installable via pip and available under the MIT license.

Here’s a guide to using spaCy.

Note: this is currently only available with python <= 3.13.

Let’s look at:

What this looks like in practice

First, install spaCy (python <= 3.13!)

import spacy

Then, download a pre-trained model:

!python -m spacy download en_core_web_sm

And use it:

nlp = spacy.load("en_core_web_sm")
doc = nlp(text[0]['text'])
doc[:100]
To the People of the State of New York:

AFTER an unequivocal experience of the inefficacy of the
subsisting federal government, you are called upon to deliberate on
a new Constitution for the United States of America. The subject
speaks its own importance; comprehending in its consequences
nothing less than the existence of the UNION, the safety and welfare
of the parts of which it is composed, the fate of an empire in many
respects the most interesting in the world. It has been frequently

What this gets us

tldr; lots of stuff

for token in doc[:10]:
    print(f"{token.text}\t{token.pos_}\t{token.dep_}")
To  ADP prep
the DET det
People  NOUN    pobj
of  ADP prep
the DET det
State   PROPN   pobj
of  ADP prep
New PROPN   compound
York    PROPN   pobj
:   PUNCT   punct

Back to the data

Which words are used differently?

Mosteller and Wallace found words they thought might be used differently (e.g., “upon”) and then used these to construct a statistical test. Let’s have a look.

But wait: what is “a word”? For instance: “right” can be several parts of speech.

Plan:

  1. Count up how many time each “word” (defined to be: (text, part-of-speech) pair) appears in each author’s writing.

  2. Pull out the ones that differ a lot.

  3. Visualize the texts using just those words.

Counting the words

First, set up the list of words:

docs = [nlp(t['text']) for t in text]
wordlist = [
    (token.lemma_.lower(), token.pos_) for doc in docs for token in doc
    if not (token.is_punct or token.is_space)
]
counts = collections.Counter(wordlist)
words = list(counts.keys())
worddf = pd.DataFrame({
    "word" : [ w[0] for w in words ],
    "pos" : [ w[1] for w in words ],
    "is_stop" : [ doc.vocab[w].is_stop for w, _ in words ],
    "freq" : [ counts[w] for w in words ],
}).sort_values("freq", ascending=False).reset_index(drop=True)
worddf
word pos is_stop freq
0 the DET True 17684
1 of ADP True 11796
2 be AUX True 8360
3 and CCONJ True 5080
4 in ADP True 4443
... ... ... ... ...
7610 relinquishment NOUN False 1
7611 colorable NOUN False 1
7612 thenceforth VERB False 1
7613 verbal ADJ False 1
7614 adverting PROPN False 1

7615 rows × 4 columns

Next, count by author:

def count_words(doc, worddf):
    counts = collections.Counter([
        (token.lemma_.lower(), token.pos_) for token in doc
    ])
    return np.array([ counts[(w,c)] for (w,c) in zip(worddf['word'], worddf['pos']) ])

worddf['hamilton'] = 0
worddf['madison'] = 0

for auth, doc in zip(info['author'], docs):
    if auth in ("HAMILTON", "MADISON"):
        worddf[auth.lower()] += count_words(doc, worddf)

worddf['Hf'] = worddf['hamilton'] / worddf['hamilton'].sum()
worddf['Mf'] = worddf['madison'] / worddf['madison'].sum()
worddf
word pos is_stop freq hamilton madison Hf Mf
0 the DET True 17684 10403 4101 0.091813 0.099873
1 of ADP True 11796 7309 2451 0.064506 0.059690
2 be AUX True 8360 4872 1886 0.042998 0.045931
3 and CCONJ True 5080 2702 1211 0.023847 0.029492
4 in ADP True 4443 2819 865 0.024879 0.021066
... ... ... ... ... ... ... ... ...
7610 relinquishment NOUN False 1 1 0 0.000009 0.000000
7611 colorable NOUN False 1 1 0 0.000009 0.000000
7612 thenceforth VERB False 1 1 0 0.000009 0.000000
7613 verbal ADJ False 1 1 0 0.000009 0.000000
7614 adverting PROPN False 1 1 0 0.000009 0.000000

7615 rows × 8 columns

Here’s what we get:

(
    worddf
    .query("hamilton > 5 and madison > 5")
    >>
    p9.ggplot(p9.aes(x='hamilton', y='madison', color='pos'))
    + p9.geom_point()
    + p9.scale_x_log10()
    + p9.scale_y_log10()
)

Interlude: \(z\)-scores

Which words are “most different”?

Recall the chi-squared test for contingency tables. The key part of this is:

\[\begin{aligned} z_{ij} = \frac{\text{observed}_{ij} - \text{expected}_{ij}}{\sqrt{\text{expected}_{ij}}} . \end{aligned}\]

chalkboard interlude

Compute \(z\) scores

\[\begin{aligned} \text{expected}_{ij} = n_{i \cdot} \times \frac{ n_{\cdot j} }{ n_\text{total} } \end{aligned}\]

worddf['He'] = worddf['freq'] * worddf['hamilton'].sum() / worddf['freq'].sum()
worddf['Hz'] = (worddf['He'] - worddf['hamilton']) / np.sqrt(worddf['He'].values)
worddf['Me'] = worddf['freq'] * worddf['madison'].sum() / worddf['freq'].sum()
worddf['Mz'] = (worddf['Me'] - worddf['madison']) / np.sqrt(worddf['Me'].values)
worddf
word pos is_stop freq hamilton madison Hf Mf He Hz Me Mz
0 the DET True 17684 10403 4101 0.091813 0.099873 10515.018986 1.092412 3810.600489 -4.704345
1 of ADP True 11796 7309 2451 0.064506 0.059690 7013.976700 -3.522688 2541.836879 1.801724
2 be AUX True 8360 4872 1886 0.042998 0.045931 4970.909224 1.402875 1801.437463 -1.992363
3 and CCONJ True 5080 2702 1211 0.023847 0.029492 3020.600342 5.796951 1094.653386 -3.516539
4 in ADP True 4443 2819 865 0.024879 0.021066 2641.836087 -3.446850 957.390747 2.985959
... ... ... ... ... ... ... ... ... ... ... ... ...
7610 relinquishment NOUN False 1 1 0 0.000009 0.000000 0.594606 -0.525729 0.215483 0.464201
7611 colorable NOUN False 1 1 0 0.000009 0.000000 0.594606 -0.525729 0.215483 0.464201
7612 thenceforth VERB False 1 1 0 0.000009 0.000000 0.594606 -0.525729 0.215483 0.464201
7613 verbal ADJ False 1 1 0 0.000009 0.000000 0.594606 -0.525729 0.215483 0.464201
7614 adverting PROPN False 1 1 0 0.000009 0.000000 0.594606 -0.525729 0.215483 0.464201

7615 rows × 12 columns

A few \(z\) scores are above \(\pm 5\); most are smaller, and these are negatively correlated (unsurprisingly):

(
    worddf.query("hamilton > 5 and madison > 1")
    >>
    p9.ggplot(p9.aes(x="Hz", y="Mz", size="freq"))
    + p9.geom_point(alpha=0.25)
)

Which words are “different”? Let’s take an arbitrary cutoff:

sub_words = (
    worddf
    .query("freq > 50 and (abs(Hz) > 3 or abs(Mz) > 3)")
)
sub_words
word pos is_stop freq hamilton madison Hf Mf He Hz Me Mz
0 the DET True 17684 10403 4101 0.091813 0.099873 10515.018986 1.092412 3810.600489 -4.704345
1 of ADP True 11796 7309 2451 0.064506 0.059690 7013.976700 -3.522688 2541.836879 1.801724
3 and CCONJ True 5080 2702 1211 0.023847 0.029492 3020.600342 5.796951 1094.653386 -3.516539
4 in ADP True 4443 2819 865 0.024879 0.021066 2641.836087 -3.446850 957.390747 2.985959
6 to PART True 3859 2522 698 0.022258 0.016999 2294.585969 -4.747501 831.548704 4.631224
... ... ... ... ... ... ... ... ... ... ... ... ...
334 term NOUN False 70 26 28 0.000229 0.000682 41.622446 2.421504 15.083807 -3.325669
373 executive NOUN False 63 23 33 0.000203 0.000804 37.460201 2.362595 13.575426 -5.271992
388 elect VERB False 59 14 14 0.000124 0.000341 35.081776 3.559315 12.713494 -0.360811
440 faction NOUN False 51 26 21 0.000229 0.000511 30.324925 0.785378 10.989630 -3.019664
444 accord VERB False 51 20 22 0.000177 0.000536 30.324925 1.874939 10.989630 -3.321317

67 rows × 12 columns

And, back to our original plot:

sub_words['ha'] = np.where(sub_words['Hf'] > sub_words['Mf'], "left", "right")
(
    worddf
    .query("hamilton > 5 and madison > 1")
    >>
    p9.ggplot(p9.aes(x='hamilton', y='madison', color='pos'))
    + p9.geom_point()
    + p9.scale_x_log10()
    + p9.scale_y_log10()
    + p9.geom_text(data=sub_words, mapping=p9.aes(x="hamilton", y="madison", label="word", ha='ha'))
)

PCA again?

Now, does PCA with the “distinguishing words” actually distinguish the authors?

wordmat = np.array([
    count_words(doc, sub_words)
    for doc in docs
])
wordmat.shape
(85, 67)
import sklearn.decomposition
X = wordmat.astype("float")
X /= X.sum(axis=1)[:,np.newaxis]
skpca = sklearn.decomposition.PCA(n_components=4).fit(X)
skpcs = pd.concat([info, pd.DataFrame(
    skpca.transform(X),
    columns=[f"PC{k+1}" for k in range(skpca.n_components_)]
)], axis=1)
skloadings = (
    pd.DataFrame(skpca.components_.T, index=sub_words.loc[:,['word','pos']], columns=[f"PC{k+1}" for k in range(skpca.n_components_)])
        .reset_index(names='variable')
)

… kinda?

skpcs >> p9.ggplot(p9.aes(x="PC1", y="PC2", color='author')) + p9.geom_point()

A more direct approach

Instead, let’s use the \(z\)-scores like “loadings” to construct a per-essay \(z\)-score:

info['Hz'] = wordmat.dot(sub_words.loc[:,["Hz"]])
info['Mz'] = wordmat.dot(sub_words.loc[:,["Mz"]])
info >> p9.ggplot(p9.aes(x="Hz", y="Mz", color="author")) + p9.geom_point()

Conclusion

John Jay has different enough word usage to drive a PC that separates his work from the others. However, differences in word choice between Hamilton and Madison are not strong enough to completely separate them in a PCA. Surprisingly, this even holds when restricting to words that differ strongly in frequency between Hamilton and Madison. Constructing a \(z\)-type score to distinguish the authors using words that differ most between them places the disputed essays closer to Madison, but grouping with the known co-authored essays. Perhaps these were also co-authored? We need more historical context, and a more sophisticated statistical framework.

One more tidbit

Are there words that both Hamilton and Madison use, but in different ways?

n = sub_words.value_counts("word")
n.index[n > 1]
Index(['executive', 'to'], dtype='str', name='word')

Hamilton seems to use “executive” as a proposition much more than Madison:

sub_words.query("word.isin(['executive', 'to'])")
word pos is_stop freq hamilton madison Hf Mf He Hz Me Mz ha
6 to PART True 3859 2522 698 0.022258 0.016999 2294.585969 -4.747501 831.548704 4.631224 left
7 to ADP True 3184 2051 622 0.018101 0.015148 1893.226671 -3.626037 686.097713 2.447088 left
230 executive ADJ False 102 33 54 0.000291 0.001315 60.649849 3.550405 21.979261 -6.830065 right
296 executive PROPN False 79 71 7 0.000627 0.000170 46.973903 -3.505540 17.023153 2.429318 left
373 executive NOUN False 63 23 33 0.000203 0.000804 37.460201 2.362595 13.575426 -5.271992 right