Working with text

Author

Peter Ralph

Published

February 9, 2026

A bit more with the federalist papers

We’re not yet done here, because we’d still like to know: Who wrote the disputed papers?

So far, we’ve found that it’s not “completely obvious” (i.e., attribution doesn’t pop out of an unguided dimension reduction for Hamilton vs Madison, although it does for Jay).

The data, again

import json, re, collections
import pandas as pd
import numpy as np
import plotnine as p9

with open("data/federalist.json", 'r') as f:
    text = [json.loads(line) for line in f]

info = pd.DataFrame(
    { k: [t[k] for t in text] for k in ['author', 'date', 'title', 'paper_id', 'venue']}
).assign(length = [len(t['text'].split(" ")) for t in text])

spaCy

But this time

…we’ll use spaCy, installable via pip and available under the MIT license.

Here’s a guide to using spaCy.

Note: this is currently only available with python <= 3.13.

Let’s look at:

What this looks like in practice

First, install spaCy (python <= 3.13!)

import spacy

Then, download a pre-trained model:

!python -m spacy download en_core_web_sm

And use it:

nlp = spacy.load("en_core_web_sm")
doc = nlp(text[0]['text'])
doc[:100]

To the People of the State of New York:

AFTER an unequivocal experience of the inefficacy of the
subsisting federal government, you are called upon to deliberate on
a new Constitution for the United States of America. The subject
speaks its own importance; comprehending in its consequences
nothing less than the existence of the UNION, the safety and welfare
of the parts of which it is composed, the fate of an empire in many
respects the most interesting in the world. It has been frequently

What this gets us

tldr; lots of stuff

for token in doc[:10]:
    print(f"{token.text}\t{token.pos_}\t{token.dep_}")

To  ADP prep
the DET det
People  NOUN    pobj
of  ADP prep
the DET det
State   PROPN   pobj
of  ADP prep
New PROPN   compound
York    PROPN   pobj
:   PUNCT   punct

Back to the data

Which words are used differently?

Mosteller and Wallace found words they thought might be used differently (e.g., “upon”) and then used these to construct a statistical test. Let’s have a look.

But wait: what is “a word”? For instance: “right” can be several parts of speech.

Plan:

Count up how many time each “word” (defined to be: (text, part-of-speech) pair) appears in each author’s writing.
Pull out the ones that differ a lot.
Visualize the texts using just those words.

Counting the words

First, set up the list of words:

docs = [nlp(t['text']) for t in text]
wordlist = [
    (token.lemma_.lower(), token.pos_) for doc in docs for token in doc
    if not (token.is_punct or token.is_space)
]
counts = collections.Counter(wordlist)
words = list(counts.keys())
worddf = pd.DataFrame({
    "word" : [ w[0] for w in words ],
    "pos" : [ w[1] for w in words ],
    "is_stop" : [ doc.vocab[w].is_stop for w, _ in words ],
    "freq" : [ counts[w] for w in words ],
}).sort_values("freq", ascending=False).reset_index(drop=True)
worddf

	word	pos	is_stop	freq
0	the	DET	True	17684
1	of	ADP	True	11796
2	be	AUX	True	8360
3	and	CCONJ	True	5080
4	in	ADP	True	4443
...	...	...	...	...
7610	relinquishment	NOUN	False	1
7611	colorable	NOUN	False	1
7612	thenceforth	VERB	False	1
7613	verbal	ADJ	False	1
7614	adverting	PROPN	False	1

7615 rows × 4 columns

Next, count by author:

def count_words(doc, worddf):
    counts = collections.Counter([
        (token.lemma_.lower(), token.pos_) for token in doc
    ])
    return np.array([ counts[(w,c)] for (w,c) in zip(worddf['word'], worddf['pos']) ])

worddf['hamilton'] = 0
worddf['madison'] = 0

for auth, doc in zip(info['author'], docs):
    if auth in ("HAMILTON", "MADISON"):
        worddf[auth.lower()] += count_words(doc, worddf)

worddf['Hf'] = worddf['hamilton'] / worddf['hamilton'].sum()
worddf['Mf'] = worddf['madison'] / worddf['madison'].sum()
worddf

	word	pos	is_stop	freq	hamilton	madison	Hf	Mf
0	the	DET	True	17684	10403	4101	0.091813	0.099873
1	of	ADP	True	11796	7309	2451	0.064506	0.059690
2	be	AUX	True	8360	4872	1886	0.042998	0.045931
3	and	CCONJ	True	5080	2702	1211	0.023847	0.029492
4	in	ADP	True	4443	2819	865	0.024879	0.021066
...	...	...	...	...	...	...	...	...
7610	relinquishment	NOUN	False	1	1	0	0.000009	0.000000
7611	colorable	NOUN	False	1	1	0	0.000009	0.000000
7612	thenceforth	VERB	False	1	1	0	0.000009	0.000000
7613	verbal	ADJ	False	1	1	0	0.000009	0.000000
7614	adverting	PROPN	False	1	1	0	0.000009	0.000000

7615 rows × 8 columns

Here’s what we get:

(
    worddf
    .query("hamilton > 5 and madison > 5")
    >>
    p9.ggplot(p9.aes(x='hamilton', y='madison', color='pos'))
    + p9.geom_point()
    + p9.scale_x_log10()
    + p9.scale_y_log10()
)

Interlude: \(z\)-scores

Which words are “most different”?

Recall the chi-squared test for contingency tables. The key part of this is:

\[\begin{aligned} z_{ij} = \frac{\text{observed}_{ij} - \text{expected}_{ij}}{\sqrt{\text{expected}_{ij}}} . \end{aligned}\]

chalkboard interlude

Compute \(z\) scores

\[\begin{aligned} \text{expected}_{ij} = n_{i \cdot} \times \frac{ n_{\cdot j} }{ n_\text{total} } \end{aligned}\]

worddf['He'] = worddf['freq'] * worddf['hamilton'].sum() / worddf['freq'].sum()
worddf['Hz'] = (worddf['He'] - worddf['hamilton']) / np.sqrt(worddf['He'].values)
worddf['Me'] = worddf['freq'] * worddf['madison'].sum() / worddf['freq'].sum()
worddf['Mz'] = (worddf['Me'] - worddf['madison']) / np.sqrt(worddf['Me'].values)
worddf

	word	pos	is_stop	freq	hamilton	madison	Hf	Mf	He	Hz	Me	Mz
0	the	DET	True	17684	10403	4101	0.091813	0.099873	10515.018986	1.092412	3810.600489	-4.704345
1	of	ADP	True	11796	7309	2451	0.064506	0.059690	7013.976700	-3.522688	2541.836879	1.801724
2	be	AUX	True	8360	4872	1886	0.042998	0.045931	4970.909224	1.402875	1801.437463	-1.992363
3	and	CCONJ	True	5080	2702	1211	0.023847	0.029492	3020.600342	5.796951	1094.653386	-3.516539
4	in	ADP	True	4443	2819	865	0.024879	0.021066	2641.836087	-3.446850	957.390747	2.985959
...	...	...	...	...	...	...	...	...	...	...	...	...
7610	relinquishment	NOUN	False	1	1	0	0.000009	0.000000	0.594606	-0.525729	0.215483	0.464201
7611	colorable	NOUN	False	1	1	0	0.000009	0.000000	0.594606	-0.525729	0.215483	0.464201
7612	thenceforth	VERB	False	1	1	0	0.000009	0.000000	0.594606	-0.525729	0.215483	0.464201
7613	verbal	ADJ	False	1	1	0	0.000009	0.000000	0.594606	-0.525729	0.215483	0.464201
7614	adverting	PROPN	False	1	1	0	0.000009	0.000000	0.594606	-0.525729	0.215483	0.464201

7615 rows × 12 columns

A few \(z\) scores are above \(\pm 5\); most are smaller, and these are negatively correlated (unsurprisingly):

(
    worddf.query("hamilton > 5 and madison > 1")
    >>
    p9.ggplot(p9.aes(x="Hz", y="Mz", size="freq"))
    + p9.geom_point(alpha=0.25)
)

Which words are “different”? Let’s take an arbitrary cutoff:

sub_words = (
    worddf
    .query("freq > 50 and (abs(Hz) > 3 or abs(Mz) > 3)")
)
sub_words

	word	pos	is_stop	freq	hamilton	madison	Hf	Mf	He	Hz	Me	Mz
0	the	DET	True	17684	10403	4101	0.091813	0.099873	10515.018986	1.092412	3810.600489	-4.704345
1	of	ADP	True	11796	7309	2451	0.064506	0.059690	7013.976700	-3.522688	2541.836879	1.801724
3	and	CCONJ	True	5080	2702	1211	0.023847	0.029492	3020.600342	5.796951	1094.653386	-3.516539
4	in	ADP	True	4443	2819	865	0.024879	0.021066	2641.836087	-3.446850	957.390747	2.985959
6	to	PART	True	3859	2522	698	0.022258	0.016999	2294.585969	-4.747501	831.548704	4.631224
...	...	...	...	...	...	...	...	...	...	...	...	...
334	term	NOUN	False	70	26	28	0.000229	0.000682	41.622446	2.421504	15.083807	-3.325669
373	executive	NOUN	False	63	23	33	0.000203	0.000804	37.460201	2.362595	13.575426	-5.271992
388	elect	VERB	False	59	14	14	0.000124	0.000341	35.081776	3.559315	12.713494	-0.360811
440	faction	NOUN	False	51	26	21	0.000229	0.000511	30.324925	0.785378	10.989630	-3.019664
444	accord	VERB	False	51	20	22	0.000177	0.000536	30.324925	1.874939	10.989630	-3.321317

67 rows × 12 columns

And, back to our original plot:

sub_words['ha'] = np.where(sub_words['Hf'] > sub_words['Mf'], "left", "right")
(
    worddf
    .query("hamilton > 5 and madison > 1")
    >>
    p9.ggplot(p9.aes(x='hamilton', y='madison', color='pos'))
    + p9.geom_point()
    + p9.scale_x_log10()
    + p9.scale_y_log10()
    + p9.geom_text(data=sub_words, mapping=p9.aes(x="hamilton", y="madison", label="word", ha='ha'))
)

PCA again?

Now, does PCA with the “distinguishing words” actually distinguish the authors?

wordmat = np.array([
    count_words(doc, sub_words)
    for doc in docs
])
wordmat.shape

(85, 67)

import sklearn.decomposition
X = wordmat.astype("float")
X /= X.sum(axis=1)[:,np.newaxis]
skpca = sklearn.decomposition.PCA(n_components=4).fit(X)
skpcs = pd.concat([info, pd.DataFrame(
    skpca.transform(X),
    columns=[f"PC{k+1}" for k in range(skpca.n_components_)]
)], axis=1)
skloadings = (
    pd.DataFrame(skpca.components_.T, index=sub_words.loc[:,['word','pos']], columns=[f"PC{k+1}" for k in range(skpca.n_components_)])
        .reset_index(names='variable')
)

… kinda?

skpcs >> p9.ggplot(p9.aes(x="PC1", y="PC2", color='author')) + p9.geom_point()

A more direct approach

Instead, let’s use the \(z\)-scores like “loadings” to construct a per-essay \(z\)-score:

info['Hz'] = wordmat.dot(sub_words.loc[:,["Hz"]])
info['Mz'] = wordmat.dot(sub_words.loc[:,["Mz"]])
info >> p9.ggplot(p9.aes(x="Hz", y="Mz", color="author")) + p9.geom_point()

Conclusion

John Jay has different enough word usage to drive a PC that separates his work from the others. However, differences in word choice between Hamilton and Madison are not strong enough to completely separate them in a PCA. Surprisingly, this even holds when restricting to words that differ strongly in frequency between Hamilton and Madison. Constructing a \(z\)-type score to distinguish the authors using words that differ most between them places the disputed essays closer to Madison, but grouping with the known co-authored essays. Perhaps these were also co-authored? We need more historical context, and a more sophisticated statistical framework.

One more tidbit

Are there words that both Hamilton and Madison use, but in different ways?

n = sub_words.value_counts("word")
n.index[n > 1]

Index(['executive', 'to'], dtype='str', name='word')

Hamilton seems to use “executive” as a proposition much more than Madison:

sub_words.query("word.isin(['executive', 'to'])")

	word	pos	is_stop	freq	hamilton	madison	Hf	Mf	He	Hz	Me	Mz	ha
6	to	PART	True	3859	2522	698	0.022258	0.016999	2294.585969	-4.747501	831.548704	4.631224	left
7	to	ADP	True	3184	2051	622	0.018101	0.015148	1893.226671	-3.626037	686.097713	2.447088	left
230	executive	ADJ	False	102	33	54	0.000291	0.001315	60.649849	3.550405	21.979261	-6.830065	right
296	executive	PROPN	False	79	71	7	0.000627	0.000170	46.973903	-3.505540	17.023153	2.429318	left
373	executive	NOUN	False	63	23	33	0.000203	0.000804	37.460201	2.362595	13.575426	-5.271992	right