Tracing languages back to their roots often presents challenges like missing words, scant evidence and countless questions. Researchers in Carnegie Mellon University’s School of Computer Science hope to tackle some of these challenges using computer models that help reconstruct ancient languages.
Even though they’re no longer used, the protolanguages that spawned modern languages can provide a window into the cultural past. They shine light on how people have moved, their relationships, and how different groups came into contact in ways that can’t be gleaned from archaeological artifacts or population genetics.
To reconstruct missing or incomplete ancestral languages, David Mortensen , an assistant research professor in the Language Technologies Institute , worked with SCS undergraduate Liang (Leon) Lu and University of Southern California student Peirong Xie to create a semisupervised computer model that can derive protolanguages from their modern forms based on a small set of examples. Their work, " Semisupervised Neural Protolanguage Reconstruction ," received a Best Paper Award at the 2024 Association for Computational Linguistics conference in Bangkok.
"For most language families in the world, their protolanguages have not been reconstructed," Lu said.
Not all’ancestor languages are known. For example, Proto-Germanic, the common ancestor language of German, Swedish and Icelandic, is one of the many missing languages that historical linguists have been trying to reconstruct.
"As far as we know, nobody ever wrote down Proto-Germanic, but it’s the shared ancestor of all these languages," Mortensen said. "Protolanguage reconstruction is about piecing together those languages."
Mortensen’s lab first set out to train computer models to reconstruct the final few missing words of a protolanguage. The model relied on modern words and their known ancestors - also called protoforms - to fill in the gaps. It needed a lot of human guidance, and it wasn’t scalable or all that useful to historical linguists.
"Languages have family trees. Latin split into Spanish, French, Portuguese, Italian and Romanian. English comes from Middle English, which comes from Old English, which comes from Proto-West Germanic," Mortensen said. "You could recover or reconstruct what the ancestor languages were like by comparing the descendant languages and assuming a consistent set of changes."
When Lu joined the lab, he proposed designing a model that reconstructed protoforms based on some guides and a foundation of examples. But the current computer science junior envisioned much less human labeling. He also noticed that when translating between a language like French and its ancestral Latin, models trained to predict modern words from protoforms performed better than models trained to predict protoforms from modern words. He aimed to create a new tool that could work well in both directions.
"That was an important realization that informed us and pushed us to do something with the architecture to implement this observation," Lu said. "We could somehow make use of the fact that going forward in time is easier to model."
The neural network architecture Lu designed now includes one model for going backward in time and another model for going forward. This neural network can predict the correct ancestor words from modern terms and accurately transform the ancestor words back into their modern equivalents, similar to how human linguists approach reconstructing protolanguages. Accounting for the differences in the direction of translation allows the computer to predict ancient word forms more accurately. Lu tested the tool on both Romance languages and Sinitic languages to show that the approach can be generalized to multiple language families.
This semisupervised approach provides the most practical solution for historical linguists trying to reconstruct a protolanguage from its descendants in a new language family. The model can start working with only a few hundred example translations provided by a human linguist. Then it can predict protoforms for words that the linguist hasn’t worked out yet. This tool has the potential to ease the total workload of the human linguist, although it does need to be trained separately for each language family.
Learn more about the research and access the models on the project’s website.