My old project, called radix, is still the best cognate-forest grower in the world, but unfortunately it's nowhere near good enough. It uses wiktionary entries to tell you what words are related to what other words through etymological descent. This is a very-quickly-written better-than-nothing partial state-share about the project, in case anyone's interested in solving the problem of cognate-forest-growing.

1. Takeaways

Use abstract regexes to infer links between words via patterns in wiktionary-ese.
Prune the display trees aggressively when feasible without eliding too many interesting cognates. Merge redundant subtrees.
When printing actual results, don't store and retrieve a root word's full set of descendants. Instead, store several copies restricted to just the descendants needed to display for one specific language.
Precompute the full global transitive closure once, in one big go, before serving many results over time.
Use subuniverses to do rapid end-to-end testing of semantically inert changes to complex code.

2. Annotated table of contents

The section "4. Introduction" will say what radix is for, what it is, and a bit of what's wrong with it.

The section "5. The state of radix" will say a bit about where radix is as a project.

The subsections within 6. Some of the core ideas in radix will describe more of how radix works, including problems, with the aim of giving some insights for anyone who wants to make a radix-like system:

"The graph structure: Pre-orders" discusses what etymologies look like and how they can get messy.
"6.3. The problem of sense disambiguation" discusses what words really are (from an etymological lens).
"6.4. How radix is implemented" discusses how radix specifically works and gives some pointers into the codebase.
"6.5. Shattering preorders" discusses how to serve results quickly by parsing the full futureward-set of a root into language-specific subsets.
"6.6. End-to-end testing" discusses how to quickly test functional equivalence of code by restricting to a narrow but locally complete sandbox.

3. Table of contents

4. Introduction

4.1. Why I made radix

I like etymologies of words, and I like knowing about many cognates / doublets, even distant ones. I think this gives rich texture to language. When you start learning the morphemes (meaningful chunks) that make up words, and learning the cognates of morphemes, you also start seeing them everywhere. The word "phenomenon" starts with the morpheme "pheno-" meaning "appearance"; that morpheme also shows up with a similar meaning in "epiphany", "phenotype", "diaphanous"—and more distantly and cryptically, in "fantasy", "phase", "phantom", "fantasy", "emphasis", "fancy", "beacon", "photon", "photograph", "Tiffany", "favor", and many more words.

Wiktionary.org is an amazing resource for learning about words that are cognate with other words. But, clicking around wiktionary can be very laborious. For example, the Proto-Indo-European entry for *bʰeh₂-, meaning "to shine" or "to appear", links to several descendant words. At any given time, I might be interested in only a subset of these; e.g. I'm usually not interested in Celtic or Indo-Iranian roots, simply because those would be unlikely to have descendants in the few languages I'm at all familiar with. If you're going through lots and lots of wiktionary entries, then the process of scanning and clicking through, looking for words you recognize, becomes very very time-consuming.

Clicking around sometimes doesn't even work at all, because sometimes the PIE entry might not link forward in time to a descendant word, even though the descendant word does link backward in time to the PIE root; you'd never learn about that descendant, if you are just clicking around starting at a different descendant. As an example, currently, the wiktionary entry for Ancient Greek "φημί" (web archive snapshot, as this will eventually get fixed), which means "to speak", does not link to "βλάσφημος", the ancestor of "blasphemy" ("deceiving-speech"). If you started from "prophetic" and clicked around, you'd get to "φημί" but you wouldn't get to "blasphemy". (And likewise "φημί" does not link, through a chain, to "prophetic".)

But that would have been a cool relationship to know about! How can we do better?

I'll briefly mention etymonline.com, which is also an amazing resource, like wiktionary. For English cognate-finding, that's probably the best current resource in many ways. However, etymonline is largely a labor of love, or something, by one guy named doug. It is limited in scope, and only extends to other languages insofar as they include ancestor words of English words, or as isolated links from English entries. You couldn't use etymonline to efficiently search for cognates from another language; at best you might find an English cognate. Etymonline is also going to be fundamentally less complete than wiktionary over time, as wiktionary has many editors and draws from many sources. (As a random example, wiktionary links "focus", uncertainly, back to the same PIE root as φαίνω; etymonline simply says the Latin is "of unknown origin". (Which is not at all a criticism of etymonline; it just is aimed at a different purpose, sometimes at the cost of showing all the plausible hypotheses.)

[IDK where to put this, but just noting that I'm arguably misusing the word "cognate". Cognates are sometimes supposed to derive in whole directly from common ancestor words. Instead I'm talking about... I don't know the real term, if any. Maybe "derivational family" or "root cognates", or my coinage "coetymons". In my defense, for example Meelen et al. call these "weak cognates" and give a typology^[1].]

4.2. The idea of radix

You can find a lot of close and distant cognates just by clicking around wiktionary. You could even in theory bridge some missing links, by guessing words that might be cognate. We could imagine someone thinking about the word "phenotype", and then wondering spontaneously whether it is cognate with "phenomenon", and going the wiktionary page and finding that indeed they share the "φαίνω" ("to appear") root.

But, we have computers. Computers can do things by themselves. We can tell our computers to click around for us.

That's what radix does, basically. It traverses the graph structure of links between wiktionary entries for different words. It infers which words are etymological ancestors or descendants of other words. Then it tries to display these (often large) structures, to show you what words are related to the word you started with. Here's what we get with "phenomenon":

That's a lot of info. One of the most important elements is the pastward trunk. This shows the ancestors of the word we started with. Rightward is pastward, e.g. you can see that English phenomenon points pastward to Latin phaenomenon, which in turn came from Ancient Greek φαινόμενον, which at its main root came from Proto-Indo-European *bʰeh₂-:

4.3. Problems with radix

The above is an especially clean pastward trunk. Very often radix unfortunately produces much messier and incorrect pastward trees:

Middle English is a little bit the bane of my existence; you can see the little < symbol, showing that at that point the pastward ancestry tracing becomes ambiguous between the two unrelated roots of "weave". One of them is cognate with "web" and means making interlaced fibers; the other is cognate with "vibrate" (and possibly "veer"), and means to wander or move in a wavy path:

But messiness is the least of the problems with the current draft of radix. Much more serious is that radix shows many many incorrect cognates. For example, here's the beginning of the radix entry for "value":

The pastward trunk, at least, is pretty neat! And correct! However, if you look near the bottom, you can see that it says PIE *h₂welh₁-: wild;. It's claiming that the starting English word "value" ultimately comes from PIE "*h₂welh₁-", and then another etymological descendant of that (reconstructed) ancient root word is the modern English word "wild".

If anyone were to actually use radix in its current form, they should always check with wiktionary (by Cmd+clicking on the word in question). In this case, the "wild" entry states that its root is "Proto-Indo-European *h₂welh₁- (“hair, wool, grass, ear (of corn), forest”)". Well, now we see the problem: there are actually two PIE roots with the same string "*h₂welh₁-" as their reconstruction. One means "to rule", and that's where "value" comes from; the other means "hair, wool", and that's where "wild" comes from.

This is not necessarily a problem for radix. In fact, by clicking on the PIE root in radix's pastward trunk, we get info about what radix thinks of that word:

The important part here is that radix only is including sense 1 ("to rule") and not sense 2 ("hair, wool"). It successfully traced that back from the starting word "value". The real issue is that "wild" points pastward to "*h₂welh₁-", but radix doesn't know how to tell which sense of "*h₂welh₁-" is being pointed to (presumably, the "wool" sense fails to point futureward to "wild").

This is maybe an over-involved example, I don't know. I guess I just want to communicate that... there's a ton of complexity here, there's currently a ton of errors in radix results, and also there's a ton of room for feasible improvement.

Before I link the thing, if you saw this post in a context where it is maybe currently being viewed simultaneously by many people, try not to spam it too much please, especially on words with big cognate trees. Also, please note that bug reports are not at all helpful; I greatly appreciate your care, but I'm not currently trying to fix the site at all. (Note: IDK if I wrote this anywhere, but, you can press d to toggle definitions on and off, and press l to toggle the language tags.) Ok, that said, you can try it here (please don't hug too hard): radix.ink. If you click around there's more info.

5. The state of radix

In short, the state is "in pre-alpha, in a deep freeze, no current plans to thaw". But if some person or small group were seriously interested in reviving it or learning from it, I would at least be very happy to talk. Feel free to reach out at my gmail address: radixtsvibt

You can find some version of the codebase here, roughly where I left off: https://github.com/tsvibt/public_radix_sep_2025/tree/main

Basically, I worked on it a ton a couple years ago. Now I'm too busy. The current codebase is stuck in development hell on some branch of trying to make the big preprocessing sweep be more efficient (with multiprocessing). That's plausibly not even worth the complexity, at least the way I'm currently doing it.

Even before the thaw, there were many serious problems with the results that radix was able to give. I think to be really useful, the quality of the results would have to be greatly improved. I also think it is feasible to greatly improve the quality of the results. There are several ways to do that, such as:

Fix many large and small known bugs / mistakes.
Improve the methods used to infer etymological links between words from the text of wiktionary entries. This is a large area.
Improve the methods used to prune the cognate poset for displaying.

The first thing to do, though, would be to thaw the project, or redo it better in some other form. This would be a big project because the problem is inherently complex (as the domain, words and history of words, is itself complex and uncertain), and because radix is out of date (e.g. various standards in wiktionary will have shifted over time).

6. Some of the core ideas in radix

In this section I'm going to describe some of the main ideas that went into radix in its current (frozen) form. The hope here is twofold:

In case anyone wants to thaw the radix codebase itself, this will help to explain some of the concepts involved.
In case anyone wants to make a successor project, they'll have some more information about the design constraints and possibilities (though not necessarily reliable info).

Because of time constraints, I'm not going to introduce what etymology is, how it works, what wiktionary is, how that works, and other important facts like that. There is a bit of information here (please don't hug too hard if it's being hugged) and on other information pages that you can click on along the top bar, e.g. this discusses more improvements.

6.1. General principles

Here are general principles that I had in mind while developing radix, and that I'd recommend for any project like this.

Improve wiktionary. Wiktionary should be improved for its own sake; and also, one of the best ways to improve results of a radix-like system is to improve wiktionary. Mainly this means simply adding more entries (with correct information), correcting mistakes (e.g. links that improperly indicate etymonic relationships), distinguishing different senses of words, adding more information (e.g. finding etymonic links in scholarly sources and putting them in articles), etc. There's also system-level work that needs doing—which you'd have to ask the wiktionary people about.
Give good results if wiktionary is good. In many cases, radix tries to go a bit beyond what wiktionary explicitly states about etymonic relationships, e.g. with abstract regex inference on wiktionary etymology sections or with inferred sense disambiguation for links. However, a more basic and more important goal should be to make it so that if, hypothetically, wiktionary were perfect, then radix would also be perfect.
Display information usefully. It's one thing to gather all the words that wiktionary says are cognate. It's another thing to present that to the user in an actually useful way. That requires pruning and user settings to narrow focus, lest you show gigantic redundant cognate trees.
Err on the side of yes displaying a word. This is a choice of design goal, but my suggestion is to show more words that might plausibly be cognates. The reason is that, unless radix is perfect, there will definitely be mistakes of exclusion or inclusion, probably both (since wiktionary itself has downright errors incorrectly indicating etymonic descent in templates); so the user probably has to "verify" cognates anyway (at least, read the wiktionary entries). For the use case of finding interesting cognates, it's better to show hypotheses than to exclude them, and let the user sift. But of course if you can fairly confidently exclude a large swath of words, do so, as that reduces noise.
Serve results quickly. Other systems that try to traverse wiktionary, among other problems, are slow. Since finding interesting cognates is often an iterative / interactive investigation, slowness can cripple that process.
Focus on etymology. There are many things that a word-browser could do, such as discuss pronunciation, synonymy, etc. These are interesting (and are somewhat addressed by wiktionary already); but they are largely not relevant to etymology in terms of finding (broadsense) cognates that are distant but are implied by known
Be language-neutral eventually. The current version of radix is very Indo-European centric (other languages are mostly excluded) and somewhat English centric (the default display settings emphasize English words, and etymonline results are compared). Partly that's because English is what I'm familiar with; partly that's because wiktionary is most complete for English; and partly that's because the Indo-European language family as a rich, old, varied, well-attested, well-studied set of languages with many interestingly divergent lines of phonological and semantic development. But in the longer term a radix-like system should be language-neutral. This would suggest, for example, not putting too much effort into features that don't generalize across languages.
Rough and ready. There are many edge cases and difficult design questions. For example, some etymologies given on wiktionary (taken from scholars) indicated "uncertain" and give a "maybe" etymology. Should you could those? Or do some probabilistic notation? Or what? These are reasonable questions, but trying to include uncertainty is a whole separate can of worms, and there's enough complication even if you ignore that whole dimension. So I suggest ignoring as many problems as possible, instead focusing on the core functionality of finding and displaying as many real cognates as possible and excluding or at least labeling as many false cognates as possible.

6.2. The graph structure: Pre-orders

6.2.1. The basics: futureward order

The basic structure we're working with is etymonic descent. We would say that English sing is futureward of Proto-Indo-European *sengʷʰ-:

That means "sing" descends from that PIE word etymologically. In other words, if you played history in reverse, you'd see people using the word "sing" and speaking English; then, as you go backward in time, you'd see people speaking Middle English and using the word "singen"; eventually you'd see people speaking some kind of Proto-Germanic language, and using a word like *singwaną; and then futher back, using some word like *sengʷʰ-; and even further back, yet another version of this word, in a language that no one today knows about; and if you went even further back, you'd eventually hear someone making up that word.

We would equivalently say that PIE *sengʷʰ- is pastward of English sing.

Wiktionary doesn't lay out this whole structure of futureward etymonic descent explicitly. Entries for different words may or may not declare some or all of their direct or indirect pastward ancestors and some of their direct or indirect futureward descendants (and sometimes even some of their cognates). This means we often have to infer etymonic descent. A simple example is transitivity of etymonic descent. You often have an English word $W_e$ which says that it descends from a Latin word $W_l$, but doesn't say what PIE word it descends from; and the Latin word $W_l$ says it descends from a PIE word $W_p$. So then we would infer that $W_e$ descends from $W_p$.

The overall strategy we use to find cognates of a word $W$ has two steps:

Find the etymonic pastward ancestors $W_1, W_2, ...$ of $W$.
Find the etymonic futureward descendants of $W_1, W_2, ...$. All of these are (broadsense) cognates of $W$.

For example, if we start with "phenomenon", radix gives this pastward trunk (trunk, as in tree—going downward on the tree is going pastward):

In this case, at least what is displayed is simple: just one PIE root. (Some stuff is omitted here—the "-menon" part has its own ancestry.) Then if we follow the futureward descent of words from this pastward trunk, we get something like:

There are many words in many languages that descend from these roots. Too many to usefully display, often. So instead we just show the descendants from a few chosen languages, and their ancestors. (You can slightly customize what languages radix shows you, in settings.)

Some more notions, given a word $W$:

Futurewardset. This is the set of words that are futureward of $W$. Technically, we consider a word futureward and pastward of itself. This is mathematically simpler, and is standard practice for studying orderings; for example, that way the futureward set of the futureward set of $W$ is equal to the futurewardset of $W$.
Strictly futureward. This means "is futureward of $W$ but not pastward of $W$". (As discussed below, there can be cycles in practice even though conceptually there shouldn't be cycles, so this is not equivalent to "is futureward of $W$ but not equal to $W$".)
Equivalent. This means "is both futureward and pastward of $W$". $W$ is always equivalent to itself. (Since there can be cycles, sometimes other words are equivalent to $W$ in the ordering.) [Note: this is different from how I use "order-equivalent" below to discuss diamonds.]
Immediately futureward. Saying "word $Y$ is immediately futureward of $W$" means "$Y$ is strictly futureward of $W$, and there is no word $X$ that is strictly between $W$ and $Y$ (i.e. strictly futureward of $W$ and pastward of $Y$)". This is important because when we display cognate forests, we want to show immediate descent by putting $Y$ right next to $W$; depth-first traversal order for a cognate forest is determined by stepping through the immediate futureward relation.
Pastwardset, strictly pastward, immediately pastward. (Likewise, mutatis mutandis.)
Futuremost. This is a word without any words that are strictly futureward of it. It is a leaf of the etymonic forest; it may be a modern word in a modern language, or a word in an ancient language that died out, or just that word didn't get transmitted, e.g. because the concept was obsolete or because it got replaced by another word.
Pastmost. This is a word without any words that are strictly futureward of it. This is a "root" word. For words in Indo-European languages, often a pastmost is a Proto-Indo-European word; but often not, e.g. because the ancestry of a word in, say, Latin, is unknown; or because the word comes from a non-Indo-European language such as Arabic or Hebrew or many other languages (e.g. for regional words like "orangutan"). The pastmosts of $W$ are the (known) oldest roots of $W$.

(Note: so it's stated somewhere, we assume that "is pastward" and "is futureward" are exact inverses of each other. That is, $W$ is futureward of $X$ if and only if $X$ is pastward of $W$.)

(The notion of "word" here is problematic; see "6.3. The problem of sense disambiguation" below.)

6.2.2. What real etymologies can look like

In the simplest form, the etymology of a word is linear: the word directly descends from an ancestor word in an earlier language; that ancenstor word in turn descends from a past word; back to PIE. E.g. "radix":

Not all English words go back to PIE. The humble rock gets lost in Latin:

Some words go back to non-Indo-European languages:

Some words, like compounds, have multiple roots, because at some point two words or morphemes got combined into one word:

(You have a horse-seamonster in your head haha.)

Because of compound words (in a broad sense, including combinations of any two morphemes), the overall graph structure of all words under etymonic descent is not a tree. Two different pastmost words (i.e. words without etymonic ancestors) may share a descendant, formed by compounding or other sort of derivation. We get a picture of a forest, where each tree grows from a pastmost root, and inosculation (h/t Rafe) is widespread.

6.2.3. Pruning ancestor morphemes with many descendants

Some morphemes are highly combinable, such as "pro-", so many words have those morphemes in their ancestry:

We can't load up the descendents of PIE *per- every time someone asks about "propane" or "profane" or "protractor" or "prospect" or "problem" or "propose" or "proper" or etc. etc.; so what we do is, we write down "ine-pro, *per-" in a file (ine-pro is the wiktionary language code for PIE), and we look at that file when we're constructing the pastward trunk, and we exclude those words, and we also exclude words whose only ancestors (i.e. pastward words) are excluded this way. Those are the greyed-out words.

6.2.4. Dealing with diamonds and duplication

An annoying problem arises when you can analyze a word in its current form, or follow its ancestry and then analyze it. For example, is the English word "democracy" composed of two English morphemes "demo-" and "-cracy"? Or is it a descendant of Ancient Greek δημοκράτια, which is itself from "δῆμος" and "κράτος"? Well, it's kinda both? IDK. Here's what we have:

It's not great, not terrible. Could be worse. The way this is displayed, we basically traverse the poset depth first. We also check if we've already included the present word in our traversal; if so, we add this word, but instead of continuing to traverse depth first, we just add ▲ indicating "this word already appeared in the tree, above". If we did not do that, the tree could be considerably larger. For example, this one might be about twice as big:

(Incidentally, that one illustrates a further potential wrinkle, which is that we might want to merge words.)

Could be better, too. The ideal would be to somehow nicely render everything compactly and non-redundantly, where each word shows up once. But note that this is strictly a poset, i.e. it is a poset but is not a tree (because for example "democracy" flows to "δῆμος" both via "δημοκρατέομαι" and separately via "δημοκράτια" and separately via English "demo-". Here's the "democracy" pastward trunk again:

Note how we have AGk δημοκρατία; δημοκρατέομαι after the Latin. We do this by merging those two Ancient Greek words. What we do is, before we traverse the preorder, we check to see if some sets of words are order-equivalent (i.e. they have the same set of strict ancestors and strict descendants within our pastward trunk preorder). If they are, then we merge them and display them together. To say it another way, these elements form a sort of "diamond". It would be a big waste of space (lines, especially) to separately show each of those elements coming from their shared ancestor. (Maybe we only do this if the words are in the same language.) It would be even better if we could merge the word on the bottom line in the pastward trunk, because it is basically equivalent.

There's probably ways to do this better than radix does, but you can at least see some of the issues. In theory you'd want to really deduplicate things, so each word shows up once, and use some fancy graph layout thing; but my experience was that this was too much hassle, too hard to control and predict, and maybe most importantly, not compact enough in terms of layout. But, could be good.

Because of diamonds and duplication, even the pastward etymonic descent of one word is not a tree. (Amusingly, there's another reason for this: autodoublets.)

This points to using, not trees, but partially ordered sets (posets). This suggests using the metaphor of an anastomotic river delta, rather than a tree or forest:

6.2.5. High leaf-count

Many entries on wiktionary have many descendants in a boring way. For example, here's the table of conjugations of French chanter:

Every single one of those conjugations has its own wiktionary entry which links back to "chanter". Across all the different words like this in all the languages, that adds up to a ton of largely-redundant information.

We don't want to totally exclude these words. Partly, that would be an ad hoc complication. Also, you definitely still want to represent all these derived and inflected words separately, in the underlying etymonic graph structure. For example, sometimes a modern English word might come from a conjugated Old French word in an opaque way; in this case you want to see that whole line of etymonic descent, so you can understand how the word changed over time. In any case, there's no great reason not to include them at least in the underlying graph, that's trying to be as close to the true anastomotic-river of etymonic descent.

Early versions of radix simply displayed all these conjugations in the straightforward way. This was hugely distracting and pointless, making all the cognate display trees maybe 2-5x bigger than they had to be, for basically no benefit.

The current version of radix hides these words, mainly using a simple trick: If there's a word $W$ with an immediate futureward word $X$ in the same language, and there's no word $Y$ that's in a "spotlighted language" (by default radix spotlights a few languages—English, German, etc.), then we hide $X$. This excludes conjugations. It also excludes compounding. E.g. here's part of the futureward display tree for German Haus:

Note that English house does not display (to the right, as futureward words, i.e. etymonic descendants) a big list of compounds such as {courthouse, birdhouse, boathouse, doghouse, greenhouse, guesthouse, lighthouse, longhouse, outhouse, playhouse, poorhouse, safehouse, schoolhouse, storehouse, warehouse, whorehouse}, even though wiktionary does have entries for those words and they are linked to "house". (It does display bringhouse, because that apparently routes through another language.) Now, the list of compounds is interesting! And, radix does allow you to start with "longhouse" and discover "house"; "longhouse" is fully represented internally. But, for the purposes of finding interesting cognates, you don't need to see these compounds.

6.2.6. Etymonic cycles

A cycle would be if multiple words are claimed to be descended from each other, in a circle. For example, a 2-cycle would be if $W$ says its root is $X$, and $X$ says its root is $W$. Broadly speaking, in the true underlying graph of actual etymonic descent, cycles should not appear.

For several reasons, there can be cycles in our computed graph of etymonic descent. Mainly this is due to errors. See the radix warning page for types of errors. One major type of error is errors or omissions in wiktionary. For example, some links may be simply improperly labeled, claiming etymonic descent incorrectly. Another example is ambiguous senses. Another major type of error is errors in radix; for example, an inferred link (see next subsection) might be inferred incorrectly. All of these errors actually happen.

Yet another type of error is when the scholarly work on the etymons of words is unsure about a word's origin. In that case, there could in theory be hypothesized links pointing in opposite directions (as words and languages are not momentary objects, but rather extended through time).

Even conceptually, the true underlying etymonic graph might actually have cycles, in a certain sense. For example, words can influence the form of other words, e.g. through hyperregularization. Does this count as true etymonic descent? I would say it does, or at least, some cases could make a compelling case—another word leaves a visible, quasi-semantically-meaningful morphological difference in another word. If we count this, and if there were cases of mutual influence between words, there would be a genuine cycle.

Therefore, we cannot actually use posets, technically speaking. That is, we cannot assume anti-symmetry—we cannot assume that if $W$ is pastward from $X$, and $X$ is pastward from $W$, then $W$ and $X$ are the same word.

Instead, we only assume that etymonic descent forms a preorder.

6.2.7. Inferring links with abstract regexes

How do we know when a sense is a pastward ancestor or futureward descendant of another word? The main way is just by reading off this information from wiktionary templates such as inherited. See "how" on the radix site. However, this gives a pretty incomplete picture. How do we do better?

As usual, the first and best way to improve the ability of a radix-like system to know about links between words, is to add that information to wiktionary itself.

That said, there are two more major methods that radix uses.

The first major additional method is inference. In this method, we read the text given by wiktionary—mainly from the etymology sections—and infer when that text strongly implies a pastward link between two words, even if the template itself does not assert a pastward link.

Example: English photon, where radix gives:

Now, how does radix know that Ancient Greek φῶς comes from Ancient Greek φάος? If we click on Ancient Greek φῶς in radix, and look towards the bottom of the popup, we see this:

This is a report of all the reasons radix thinks this pastward link exists. It only gives one reason. What is happening in this "reason"? It's saying that the reason is an "Inference" (specifically the one defined in src/inference/big_regex.py). That means it's not just reading the wiktionary template information. It's also then inferring a temporal link. It's a bit convoluted and very ad hoc, so not worth too much detail, but basically this inference rule is saying, look at this text in the etymology section:

From {{inh|el|gkm|φωτία}}, from {{inh|el|grc|φῶς}}, variant of {{m|grc|φᾰ́ος||light}},

Now, the {{m| template definitely cannot be taken as a pastward or futureward link in general. BUT (according to this specific rule), if we have an {{inh| template followed by an {{m| template with ", [something] of" in between, we assume (with a bunch of hard-coded exceptions) that the {{m| template is pastward of the {{inh| template. This is actually the only reason that radix knows Ancient Greek φῶς is futureward from Ancient Greek φάος. For example, the etymology section for Ancient Greek φῶς starts:

Contracted from {{m|grc|φάος}}.

We can't infer pastwardness from {{m| templates. (Another approach would be adding a rule for "Contracted from [template]".)

As an example of unimplemented inferences, here's the pastward trunk for English desire:

This is messy partly because radix doesn't know that Middle English desire also comes from Old French desirrer, and that Old French desirer also comes from Latin desidero. Could radix know that automatically? The etymology section for English desire looks like this:

Just reading that, it's fairly clear that e.g. Old French desirrer and Old French desirer are being treated as equivalent here. I'd say that this is enough evidence to automatically assume that there is at least one sense of each of those Old French words which is a descendant of Latin desidero.

(Note: currently, wiktionary explains desirer as an "alternative form of desirrer (“to desire”)". My guess is that radix simply doesn't recognize those types of links; that's another straightforward way to fix this missing inference.)

What does the underlying wiktionary text look like? At the moment it looks like:

From {{inh|en|enm|desir}}, {{m|enm|desire|pos=noun}} and {{m|enm|desiren|pos=verb}}, from {{der|en|fro|desirer}}, {{m|fro|desirrer}}, from {{der|en|la|dēsīderō|t=to long for, desire, feel the want of, miss, regret}}, apparently from {{m|la|de-}} + {{m|la|sidus}} (in the phrase ''de sidere'', "from the stars") in connection with astrological hopes. Compare {{m|en|consider}} and {{m|en|desiderate}}. Displaced native {{ncog|ang|wilnung||desire}} and {{m|ang|wilnian||to desire}}.

Since radix already parses this somewhat semantically, what remains is to analyze this etymology to extract more links. This is probably doable. (In fact, the current codebase of radix might already have a draft implementation—just, not included in the older version on the site.)

What's powering all this is abstract regexes. The idea of an abstract regular expression is that you do regular expressions, but instead of the primitive elements being characters tested by identity, you have the primitive elements being anything tested by any predicate you've provided. This way we can use our separate parser for wiktionary text, which produces a list of tokens (like [piece of text, wiktionary template with XYZ info, text, template, text, template]), and then operate on that with regexes that can recognize stuff like

A definite pastward template such as inherited or derived; followed by some text that starts with "," and ends with "from "; followed by a template that is either a definite pastward template, or m

Or that kind of thing. Fairly powerful in this context. You can see the inferences used in src/inference/refs_links.py. (Note that each one of these required substantial testing to see whether they picked up wrong things; several were discarded, as I recall it.)

6.2.8. Inferring links with the transitive closure

When we have inferred that $W$ is futureward from $X$ and $X$ is futureward from $Y$, then we can also infer that $W$ is futureward from $Y$, even if that isn't explicitly stated anywhere.

Thus, we can take the transitive closure of the "is futureward" relation. All the relationships inferred this way should be correct, if the input relations are correct. If the input relations are incorrect, we will make a bunch of false inferences.

In practice, in my experience, this does happen, but not so much that results are degraded beyond use. Egregious errors in wiktionary are present but rare, and in fact radix is a good way of finding certain kinds of errors in wiktionary. You browse radix; check for anomalies, like implausible cognates; investigate why; usually it's because of a problem with radix; but sometimes it's an error in wiktionary, which can then be corrected. (Anomalies could also be automatically detected, e.g. by searching for time reversals, i.e. cases where radix thinks that an English word is pastward of an Ancient Greek word, or similar.

Because links are NOT necessarily symmetric within wiktionary—i.e. sometimes $W$ says "I come from $X$" but $X$ does not say "$W$ comes from me"—we CANNOT compute the transitive closure in a local manner. We cannot just traverse the graph starting from one word and following its links. That does not work. We MUST aggregate information from the entire graph. This creates significant implementation challenges, which I'll discuss below in 6.4. How radix is implemented.

For now I'll just note one major design decision in radix. We actually have two graphs. As discussed in the next section "6.3. The problem of sense disambiguation", we have to track senses and links between senses. However, we don't want to do this at first. That's because:

It's easier to do sense disambiguation if you already have access to the full transitive closure of grams under futureward ordering. You have a small structure, the gramwise-pastwardset of a starting sense, within which to compute heuristics that guess about the senses meant by various links.
Sense disambiguation is hard, so we want to experiment with it more; it is therefore more fluid; therefore we want to precompute as much as possible without having to lock in a sense disambiguation method. In order to do that, we need a more expansive notion of links; and using grams (letter-string plus language) as the basic element works well.
Maybe other reasons, e.g. storage efficiency.

For these reasons, radix uses a two-layered system. First we infer many relations between grams and compute the transitive closure. Then, only after we've built that large structure, we extract a secondary structure of "futureward wordposets" for all the pastmost "words". (Which is to say: We find all the senses that don't have ancestor senses, and for each of those we compute the futureward preorder of senses that might come from that sense—which is some subset of the gramwise futurewardset of the gram of that pastmost word.)

6.3. The problem of sense disambiguation

6.3.1. What's a word?

A word is, like, a string of letters, right? WRAWNG.

That's not what a word is. First of all, a word has a language. English voyage and French voyage are definitely not the same word.

Second of all, a word has different senses. What exactly a "different sense" is might be unclear. (E.g. are "rake" the verb vs. "rake" the noun different senses? What about "mouse" (mammal) vs. "mouse" (computer device)? IDK.) But the important thing for use is the etymonic senses. So verbs and nouns are the same sense, and multiple different definitions listed in a list of several definitions are all part of the same one sense. However, "weave" and "weave" are two different senses! It is not uncommon that a single gram (string of letters paired with a language) has multiple etymonic senses. For example, the string "lead" in English has two etymonically unrelated meanings: one is about a metal, the other is about going ahead in front of followers.

So what we want is a preorder on senses, where a sense is a string of letters in a language with a specific etymological ancestry.

(In the code of radix, I actually use "Gram" to mean "string of letters plus a language, and "Word" to mean "a gram, plus a number specifying the sense". See src/classes.py.)

To a large extent, you can recognize different sense in wiktionary entries syntactically: Basically, sections with the header "Root" or "Etymology" are new senses. The actual rules are more complicated and also not formally enforced in wiktionary. This problem is addressed (in a very ad hoc, incomplete, janky manner) by src/parsearticle.py.

But this is not the whole story when our theory hits the messy reality of wiktionary. There are several problems. One problem is that sometimes the same string of letters has two Etymology entries for the same language, BUT really one of them just comes directly from the other. Another problem is that many entries in wiktionary don't exist or are empty or incomplete; in particular, there may be some missing senses. This means that if two etymologically unrelated words point to the same string of letters in the same language, and there is no entry for that language, then we are almost completely out of luck, by default; we cannot tell which ancestors of the collider correspond to which descendants of the collider.

Yet another issue is that, as I may have mentioned, Middle English in particular is Bad:

Really bad:

(Oh my God.)

There's probably some excuse, like "at least we were even trying to write things down and figure out spelling, sorry if it wasn't magically already standardized" or something. But anyway, this makes one think that multiple different grams (strings of letters) are kinda the same sense.

This problem of multiple grams for one sense exacerbates the much worse problem of ambiguous senses.

6.3.2. Ambiguous senses

The existence of multiple senses leads to a design choice, which is that the fundamental unit—the elements of the grand etymonic preorder on all words—is senses, not grams or strings of letters. A sense is a gram, plus a specification of which sense.

(In the current form of radix, the specification is just a number; 1,2,.. are the explicit senses demarcated by an "Etymology" or "Root" section in a wiktionary entry, 0 is a default sense for entry contents not inside one of those containers, and -1 is for when there is no wiktionary entry but a gram is linked to by another entry. Since wiktionary does not usually explicitly specify senses, radix makes a guess at what senses there are.)

The problem with ambiguous senses is that it leads to many false cognates. For example, the section "4.3. Problems with radix" above gave the example where English value points to PIE *h₂welh₁-, and English wild points to a different sense of the same gram PIE *h₂welh₁-. Because of this collision, when we're tracing forward from the pastward trunk of "value", we include "wild" because we can't automatically tell which sense is the sense of PIE *h₂welh₁- that "wild" is pointing to. In accordance with the 6.1. General principles, we want to show "wild" as cognate to "value" (meaning really, "here's a thing that MIGHT be cognate with value, check it yourself"), if we can't tell confident that it IS NOT cognate with "value".

6.3.3. Addressing ambiguous senses

In accordance with the general principle of improving wiktionary, I should note that wiktionary does have a mechanism for demarcating senses within entries for single grams, and specifying which sense is pointed to by a pastward or futureward link from another entry. So the "right" way to fix any given instance of this problem is to add the information to wiktionary. (And actually add the ability to incorporate that information to radix, which I had not yet done.)

That said, there's a few other things we can do. Suppose that English wild links pastward to PIE *h₂welh₁-, but does not specify a senseid that it is linking to. This happens to currently be the case; if you click edit, you can see the specification of the "wild" entry, and there you can see (until it's edited) that the entry gives this link:

{{der|en|ine-pro|*h₂welh₁-|t=hair, wool, grass, ear (of corn), forest}}

To translate, that means:

[der] the sense of the present entry derives (pastward; maybe indirectly, through other senses in other languages) from the word specified by this link
[en] the sense of the present entry is an English sense
[ine-pro] the sense of the pastward ward is a Proto-Indo-European sense
[*] (this is a reconstructed word, not attested in any known text)
[h₂welh₁-] (the text of the linked pastward word)
[t=] (translation into English)

How could we tell which sense of PIE *h₂welh₁- this link points at? One thing that radix does is to check if one of the senses of PIE *h₂welh₁- links (directly or indirectly) futureward to English wild. If so, we can try to exclude the other sense of PIE *h₂welh₁- from consideration as being the target of this particular link. This is kinda complicated to implement, and it might not actually work so well, but in my tests it seems to be ok at cutting down on incorrect cognates, hopefully with missing too many real ones. (Also, radix does a more complicated thing, that we needn't discuss; see src/traversing.py.

Even with that heuristic, in order to not shut down too many potential cognates, we always assume that a pastward link to a gram refers to at least one of the available senses in wiktionary. This is a problem if there are multiple senses of the pastward gram, but the wiktionary entry for that gram lists only one sense or doesn't exist at all. In that case, we're nearly forced into a collision and false cognates.

What else can we do?

In some ideal system, we would at least notice when it seems like there are missing senses causing collisions. Then these anomalies could be raised to the attention of editors, if the automatic detection is good enough to not just be spam. In some cases you could even fabricate senses automatically, though I would guess that that strategy would end up being more complex than the value (and instead you should just improve wiktionary itself).

One fairly straightforward idea is to use the meaning of the word. The example link above provides a translation of the target pastward word:

|t=hair, wool, grass, ear (of corn), forest}}

Looking at the two wiktionary entries for PIE *h₂welh₁-, it's clear which one of the two is being targeted by this link from "wild". Even if a link template doesn't provide a translation, you might still be able to infer which sense is linked by the definition of the linking word.

Implementing this would be quite nontrivial, but doable. You could try just checking overlap of words, maybe restricting to content words; or you could try some word-embedding-related thing. Using LLMs is a significant possibility. However, since there are millions of wiktionary entries and tens or hundreds of millions of links, you probably don't want to do a naive implementation. Instead you could use cheap rules to deal with obvious confident cases, and otherwise use a tutoring scaffold whether the LLM makes guesses about the target sense, and the human supervises.

6.4. How radix is implemented

There's too much detail, much of which is irrelevant anyway and much of which I don't remember, to go into everything. As a general note, I'll say that I learned a lot working on this—and conversely, this project includes a lot of major design mistakes, partly as holdovers and partly as things I wouldn't know how to do better even today. (E.g. handrolled caching, powerful but ad hoc testing, weird SQL nonsense, handling all sorts of weirdness like language codes and line noise in raw entries, etc.) The code (in a broken state) is here: https://github.com/tsvibt/public_radix_sep_2025/tree/main/src

I'll just give a couple sketches. The overall structure, conceptually, is as follows:

Get a wiktionary xml dump of all articles on (English) wiktionary
Extract the big graph (preorder) of words related by pastward/futureward relationships
When given a word $W$, get all the words pastward of $W$ and get all the words futureward of any of those words
Restrict these structures to the relevant words
Print out these restricted structures

Most of this work is done in one big day-long precomputation step. This step passes over the entire wiktionary (sort of) several times, building up the information we'll need later in order to serve results fast, and storing that information in a big (huge, >30GB) SQL database. The code for this big precomputation is in src/precomputing. This is one way in to understanding the code base a little bit. In a bit more detail:

Parse the xml into wiktionary entries for words. {A_xmldump_redirects.py, B_xmldump_separate.py, C_AB_xmls_parse.py}
- We discard many articles, e.g. articles that describe how wiktionary works; we only want word entries.
- We try to guess (in a handrolled ad hoc way) where the word and sense boundaries are.
Go through the entries to get links between words. {D_C_parsed_allmentioners.py, E_C_parse_linksrefs.py, F_E_linksrefs_pastwardsfuturewards.py}
- This is the main place where we're actually "reading" articles. We extract links between words and index those links to the words (rather than having them buried opaquely in wiktionary entries).
- This where we use the meaning of wiktionary templates and where we use abstract regexes.
Compute the big graph (preorder) of grams related by pastward/futureward relationships. {Q_E_futurelinks_wordgramrefs, G_F_grampastmosts.py, H_G_gram_immediatefutures.py, I_G_gram_strictfutures.py, M_I_gram_lang_allfutures.py, N_QGI_gram_coverings.py}
- This is where we compute the transitive closure, which is much of the computational difficulty. We compute and store full pastwardsets and futurewardsets. In a sense this is very inefficient because of the space consumption and time taken to write. On the other hand, I think it's faster than recomputing closures of things. But I'm not actually positive about that. I did try to test it, but I may have misunderstood the meaning of some tests; e.g. the computer's automatic caching behavior and memory paging behavior is, to me, unpredictable and confusing and different between different runs of the same program.
- We also compute useful structures such as a map from grams to the grams that are immediately futureward of that gram.
Compute structures about words, i.e. grams with sense numbers attached. {K_C_parsed_realgivenwords.py, O_KN_word_pastwardwordposet.py, P_O_pastmostword_futurewordposet.py, Q_E_futurelinks_wordgramrefs.py}
- We do various shenanigans to infer which senses point to which senses.
- We don't store futureward sense preorders for every word. Because we are doing shenanigans, unlike for the global gram futureward preorder, the sense preorder for a sense in the middle of a sense preorder is not necessarily just the order restriction to the futurewardset of that sense. Instead we compute all the pastmost senses, and compute the futureward sense preoders of those pastmost senses.

After all this precomputation, the system is ready to serve requests. Given a starting word (and other settings), we compute and print radix's guess at the relevant pastward sense preorder of that word, and the relevant futureward sense preoder of the words in the pastward preorder. The code for this is in src/printing.

There's various ways that we prune the preorder. See the above section "The graph structure: Pre-orders".

6.5. Shattering preorders

Listed in "6.1. General principles" is the goal of serving results fast, because this materially affects the usefulness of a system like this for someone who's trying to understand many cognates for several words. What are some of the challenges with quickly serving results?

Basically, at least the way radix is currently set up, and assuming I recall correctly, the main issue for performance is that the preorder objects are really big. There are two basic reasons for this:

A single pastmost root word might have very many descendants—sometimes tens or possibly hundreds of thousands, across many languages.
The full way of representing these structures, including a mapping from each word to its full futurewardset, is very very redundant (something like quadratic in the number of words).

I don't actually recall whether or not you can avoid the problem in 2. You might be able to get away with only retrieving the immediate-orders; but I'm not sure, and I think that requires some complicated precomputations. Anyway, this section is about problem 1.

One main thing you do to address problem 1 is to prune before storage. See e.g. "High leaf-count" above on pruning.

What else can we do?

First of all, why is the futureward preorder so big? The main issue is just that there are very many languages, even restricting to the Indo-European family. (See here; click on "Family tree" to see hundreds of Indo-European languages.) Most of these languages will be only very thinly represented, but still. This means the out-degree of many words, especially PIE words, can be very large; e.g. PIE *bʰer- has dozens of derived descendants, some of which themselves have many descendants (e.g. Latin fero leading to {transfer, refer, infer, confer, etc.} or Ancient Greek φέρω leading to {metaphor, phosphorous, periphery, etc.}).

A first-draft idea would be to simply exclude, from the outset, all words outside of some small set of interest. But this doesn't even work, because

you have to include the ancestors of words from your languages of interest; and
you want to infer links between those words from the etymology sections of words from languages that aren't in your special set.

So, you want to do the precomputation anyway with all the languages.

A second-draft idea for reducing the size would be to precompute the whole large graph, but then only store and serve the preorders that you get by restricting to the languages of interest. This proposal can work. However, it is not language-neutral, which goes against a principle mentioned above in "6.1. General principles".

Also importantly, it's simply a much less useful tool, even for a very English-centric user. Personally I am not infrequently genuinely interested in seeing only results for English; and I am not infrequently genuinely interested in seeing results for English and Latin and German and Ancient Greek. Therefore, a better solution would be to somehow serve

This leads to the method of shattering the preorder, and then only serving the relevant shards. The idea is basically that for each root (i.e. pastmost word), we compute the futureward preorder. Then, for each language that we want to make available, we compute the restriction of the preorder to words in that language and their ancestors. (This happens during precomputation in src/precomputing/M_I_gram_lang_allfutures.py.)

This leads to somewhat redundant storage, in that, for example, a PIE root's futureward preorder when restricted to English might significantly overlap with the same for German. However, any given language's preorder tends to be much smaller than the full preorder. Since, at least for me, it's usually not that useful to request a display for many languages at once, the speed gains are substantial.

Note that there are other strategies for going even faster. For example, you could cache the entire radix page, including the large futureward preorder display. But this is brittle to settings, e.g. selecting different languages. It would not be efficient to store the whole page for every set of languages that a user wants to see displayed. (But it would be good to store this for the most common few combinations.)

To illustrate:

The newer languages extend all the way back because the wedges represent the pastward closure, within the PIE root's futureward preorder, of the newer language's words. (This diagram is not accurate or to scale, it's just trying to illustrate shattering haha.)

Now, we could display this whole thing. Or, we could just pull the restricted preoders necessary to display results for English and German only. Then we get:

It's a lot smaller, and it's even more a lot smaller if you include the fact that we're excluding many more of the modern languages (which tend to have many wiktionary entries).

This does add computational burden, in that we have to merge the preorders. But this is fairly fast. The main cost is actually reading from disk, IIRC. This may mean that the storage and/or retrieval method is hopelessly slow; but I did tinker with that a fair amount.

6.6. End-to-end testing

A major issue with radix-like systems is the lack of bugs. You can run a giant day-long precomputation, and then find out that you messed up some logic; it didn't make an invalid program, it just failed to compute good results. (I assume this is the same in any programming task that's about squeezing answers from data.)

To some extent you have to bite the bullet and manually sanity check things. This is especially true when changing the actual intended semantics, e.g. adding a new rule for inferring links between words. This changes most of the datastructures that radix produces. In theory you could build up a set of modular tests, like that word $W$ should be known to be pastward of $V$, and that $X$ should be known to not be futureward of $Z$. But, I haven't done that.

A next-best partial solution for testing is end-to-end testing. The idea here is to check if the full output of the system is the same before and after a change. This does not help with intended semantic changes, but it does help with a large class of edits, especially refactors. For example, I used this to improve performance in many ways, while being fairly confident I wasn't messing up any of the semantics.

At least two challenges came up with setting up end-to-end testing.

One was determinism. For the idea to work, different runs have to always give the same answer. The main issue here was the fact that python sets don't preserve order. Mostly this is fine because preorders encoded as downsets don't care about how the downsets are stored. However, when we convert to a list for printing, we can get non-determinism (i.e. the result depends on python's hashseed for that session). To fix this, we make sure to sort at some point before traversing.

The major problem is precomputation time. If I'm modifying the methods for final printing, I can just test their outputs against the previous outputs, no problem. But if I'm modifying the core code that infers links, computes the global preorder, infers senses, etc., then large chunks of the database could be affected. To fully test that code would require running the entire giant computation again. That's infeasible.

Instead, the idea is to make "subuniverses". Basically, we take one or a small set of words (say, $W$); we get their pastward-futureward preoders; and then we take a "halo" around the set of words in those preorders, which comprises all words that mention any of the words in that big set. Then we dump all these words, including the halo, plus their raw wiktionary entry articles, into a new (much smaller) xml file.

The code that orchestrates this is here: scripts/make_subuniverse.py

That xml file can be used as the starting point for a whole new precompute run, followed by testing the outputs of printing methods. If the change is supposed to be semantically inert, then when we pass the original starting word $W$ to be printed, the result should be exactly the same. We created a little "subuniverse" comprising everything that affects $W$, so, just from the perspective of $W$, everything is the same. (The results for other words, even ones in the subuniverse, may have changed.)

The point of expanding to the halo of all mentioners to be able to notice when the change we make will affect how words get marked as linked to other words. If we didn't do this, then a modified rule of inference might, when run on the full universe, say that actually it turns out $V$ does link pastward to $W$. But since the subuniverse we constructed was assuming $V$ did not link to $W$, if we don't include the mentioner halo, we did not include $V$ in the subuniverse. So we do not notice the difference. This would be a failure of our end-to-end testing.

7. Conclusion

I think the current form of radix is useful for my niche application of finding distant non-obvious cognates more easily than just clicking around wiktionary. But, it could be so much better, with a bunch more work. Again I'd be happy to chat with anyone who might be seriously interested in making a high-quality radix-like system. You can reach me at my gmail address: radixtsvibt

I think the current minimally working version of radix demonstrates that something like this is feasible, albeit difficult. I hope I've described enough to also communicate that there's lots of room for significant improvement.

I think that words are important because they help you think well (and unsuitable words make you think mistakenly). When I learn about the connections between words I see the stars ("desire", "de-sidere", "sidereal", "of the stars") and I see the future of human thought. I'd like there to be a high-quality 'scope for words.

Meelen, Marieke, Nathan W. Hill, and Hannes Fellner. What Are Cognates? University of Edinburgh, 23 October 2023. https://doi.org/10.2218/pihph.7.2022.7405. ↩︎

Search This Blog

Introduction to radix (best cognate-tree grower, pre-α, dormant)