What's the smallest number of words you need to know to learn them all?
May 12, 2025
What’s the smallest number of words you need to know to be able to learn them all?
Setup
-
Download postprocessed data extracted from the latest Wiktionary dump.
mkdir -p data/raw_input wget -P data/raw_input https://kaikki.org/dictionary/English/words/kaikki.org-dictionary-English-words.jsonl -
Construct the dictionary graph adjacency lists.
pip install -r requirements.txt python constuct_graph.py
Some graph statistics
=== Structure ===
# Nodes: 998669
# Edges: 7547019
Average in-degree: 7.56
Average out-degree: 7.56
# Root nodes (zero in-degree): 38
# Leaf nodes (zero out-degree): 597909
Top 5 words with most incoming edges:
draw: 700 predecessors
drop: 469 predecessors
run: 465 predecessors
break: 428 predecessors
line: 426 predecessors
Top 5 words with most outgoing edges:
of: 714003 successors
plural: 265469 successors
A: 204612 successors
the: 191037 successors
a: 147549 successors
=== Cycles ===
# Self-loops: 24035
Graph is acyclic (ignoring self-loops): False
=== Components ===
# Strongly connected components: 928098
Largest SCC size: 66368
# Weakly connected components: 6
=== Connectivity Metrics ===
Graph density: 0.00000757
Avg reachability (forward, from sample of 100 nodes): 0.0600
Baselines
Randomly pick a candidate grounding set (random + mandatory)
Call the graph .
There are some words that necessarily have to be in any valid grounding set, namely
- words with empty definitions i.e. vertices with in-degree = 0.
- words with self-referential definitions i.e. vertices which have self-loops.
If you collect these into a set and go on to randomly pick vertices from , call that set and compute the empirical probability of constituting a valid grounding for , you get a line plot for the success probability as a function of .
So under this heuristic, even when you drop down to 970,000 ( 97.1%) vertices, you’re already dead in the water. This is the number to beat.
Iteratively insert random (yet) unknown vertices into the candidate grounding set (random + mandatory + iterative)
Initialize the grounding set to . Then alternate between
- Computing the set of all “known” words reachable from the current grounding set, and
- Adding a randomly chosen hitherto unknown vertex to the grounding set.
…until you obtain a valid grounding set.
Wow, this works so much better! This greedy strategy finds a grounding set of size 500,623 (50.1%).
Greedily pick vertices with the highest out-degrees (greedy + mandatory)
One simple baseline that could plausibly work better is to simply greedily pick vertices with the highest out-degrees until you first reach a valid grounding set (post including ). The intuition here is that you’re choosing to include vertices in your grounding set that have maximal utility in that they enable maximal marginal reachability.
Okay, that’s pretty good! This greedy strategy finds a grounding set of size 389,576 (39.0%).
Greedily pick vertices with the most unknown successors
greedy + mandatory + iterative
One obvious flaw of the previous baseline is that it can end up wasting precious grounding set space by including words whose successors are already all/mostly known. The out-degree heuristic values are computed once at the start and never updated.
What happens if instead, you maintain counts not simply of the number of successors (i.e. out-degree) each vertex has, but of the number of unknown successors they each have at any point. You can then do something slightly cleverer than the previous baseline by alternating between 1) Computing the set of all “known” words reachable from the current grounding set, and 2) Adding the unknown vertex with the highest number of unknown successors to the grounding set.
WHAT. This tiny tweak now finds a grounding set of size 36,255 (3.6%). How is that even possible? Any valid grounding set must be a superset of . And by itself contains 24,000 words…
greedy + empty + iterative
Also, if you start with an empty set as initial grounding instead of (i.e. greedy + empty + iterative), you do almost as well.
You find a grounding set of size 38,669 (3.9%).
Results
Call the set of undefined and self-referential words .
| Baseline | Grounding set size | % words retained |
|---|---|---|
| do nothing | 998,669 | 100.0 |
| random + mandatory | 970,000 | 97.1 |
| random + mandatory + iterative | 500,623 | 50.1 |
| greedy + mandatory | 389,576 | 39.0 |
| greedy + empty + iterative | 38,669 | 3.9 |
| greedy + mandatory + iterative | 36,255 | 3.6 |