← All posts

What's the smallest number of words you need to know to learn them all?

May 12, 2025

What’s the smallest number of words you need to know to be able to learn them all?

Setup

  1. Download postprocessed data extracted from the latest Wiktionary dump.

    mkdir -p data/raw_input
    wget -P data/raw_input https://kaikki.org/dictionary/English/words/kaikki.org-dictionary-English-words.jsonl
  2. Construct the dictionary graph adjacency lists.

    pip install -r requirements.txt
    python constuct_graph.py

Some graph statistics

=== Structure ===
# Nodes: 998669
# Edges: 7547019
Average in-degree: 7.56
Average out-degree: 7.56
# Root nodes (zero in-degree): 38
# Leaf nodes (zero out-degree): 597909

Top 5 words with most incoming edges:
  draw: 700 predecessors
  drop: 469 predecessors
  run: 465 predecessors
  break: 428 predecessors
  line: 426 predecessors

Top 5 words with most outgoing edges:
  of: 714003 successors
  plural: 265469 successors
  A: 204612 successors
  the: 191037 successors
  a: 147549 successors

=== Cycles ===
# Self-loops: 24035
Graph is acyclic (ignoring self-loops): False

=== Components ===
# Strongly connected components: 928098
Largest SCC size: 66368
# Weakly connected components: 6

=== Connectivity Metrics ===
Graph density: 0.00000757
Avg reachability (forward, from sample of 100 nodes): 0.0600

Baselines

Randomly pick a candidate grounding set (random + mandatory)

Call the graph G=(V,E)G = (V, E).

There are some words that necessarily have to be in any valid grounding set, namely

  1. words with empty definitions i.e. vertices with in-degree = 0.
  2. words with self-referential definitions i.e. vertices which have self-loops.

If you collect these into a set BB and go on to randomly pick kk vertices from VBV \setminus B, call that set AkA_k and compute the empirical probability of AkBA_k \cup B constituting a valid grounding for GG, you get a line plot for the success probability as a function of AkB=B+k|A_k \cup B| = |B| + k.

So under this heuristic, even when you drop down to 970,000 (\approx 97.1%) vertices, you’re already dead in the water. This is the number to beat.

Iteratively insert random (yet) unknown vertices into the candidate grounding set (random + mandatory + iterative)

Initialize the grounding set to BB. Then alternate between

  1. Computing the set of all “known” words reachable from the current grounding set, and
  2. Adding a randomly chosen hitherto unknown vertex to the grounding set.

…until you obtain a valid grounding set.

Wow, this works so much better! This greedy strategy finds a grounding set of size 500,623 (50.1%).

Greedily pick vertices with the highest out-degrees (greedy + mandatory)

One simple baseline that could plausibly work better is to simply greedily pick vertices with the highest out-degrees until you first reach a valid grounding set (post including BB). The intuition here is that you’re choosing to include vertices in your grounding set that have maximal utility in that they enable maximal marginal reachability.

Okay, that’s pretty good! This greedy strategy finds a grounding set of size 389,576 (39.0%).

Greedily pick vertices with the most unknown successors

greedy + mandatory + iterative

One obvious flaw of the previous baseline is that it can end up wasting precious grounding set space by including words whose successors are already all/mostly known. The out-degree heuristic values are computed once at the start and never updated.

What happens if instead, you maintain counts not simply of the number of successors (i.e. out-degree) each vertex has, but of the number of unknown successors they each have at any point. You can then do something slightly cleverer than the previous baseline by alternating between 1) Computing the set of all “known” words reachable from the current grounding set, and 2) Adding the unknown vertex with the highest number of unknown successors to the grounding set.

WHAT. This tiny tweak now finds a grounding set of size 36,255 (3.6%). How is that even possible? Any valid grounding set must be a superset of BB. And BB by itself contains \approx 24,000 words…

greedy + empty + iterative

Also, if you start with an empty set as initial grounding instead of BB (i.e. greedy + empty + iterative), you do almost as well.

You find a grounding set of size 38,669 (3.9%).

Results

Call the set of undefined and self-referential words BVB \subset V.

BaselineGrounding set size% words retained
do nothing998,669100.0
random + mandatory\approx 970,000\approx 97.1
random + mandatory + iterative500,62350.1
greedy + mandatory389,57639.0
greedy + empty + iterative38,6693.9
greedy + mandatory + iterative36,2553.6

Code: github.com/anirudhajith/dictionary-completeness