Mind Matters Reporting on Natural and Artificial Intelligence
Unique scribe library full of old and valuable manuscripts
Unique scribe library full of old and valuable manuscripts

Surprising Ways AI Can Help Recover Lost Languages

Researchers into lost languages hail the new technologies as a golden age for discovery

When an apparently indecipherable manuscript from a lost language turns up, AI can help. But first, how is a language born and how does it die (or get lost)?

We really don’t know how human language was born. Theories abound but all we know for sure is that it is unique. In a 2017 paper at BMC Biology, evolutionary biologist Mark Pagel states flatly, “Human language is unique among all forms of animal communication.” In his open-access paper, he cuts short the widely popularized claims for chimpanzee language:

Most ape sign language, for example, is concerned with requests for food. The trained chimpanzee Nim Chimpsky’s longest recorded ‘utterance’, when translated from sign language, was ‘give orange me give eat orange me eat orange give me eat orange give me you’ – Pagel, M. Q&A: What is human language, when did it evolve and why should we care?.

BMC Biol 15, 64 (2017). https://doi.org/10.1186/s12915-017-0405-3 https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0405-3 (open access)

Nim Chimpsky could have accomplished the same task much more simply by pointing to an orange or a picture of one. If humans had not taught him sign language, he would likely have done so and left it at that.

So apes don’t have language but humans always have it. That is, all human groups express themselves through language. And all languages can, in principle, handle a full load of complex concepts (with considerable borrowing of technical terms from each other, of course).

Languages die out when people stop speaking them. Today’s global communication threatens languages with very small numbers of speakers. For example, Papua New Guinea has 7 million people and 856 of the world’s roughly 6000 recognized languages. Most Papua New Guinean languages have fewer than 1000 speakers. With masses of educational materials available in English, English is, not surprisingly, the main language of its school system. Thus some other local languages may be found among the 43% said to be endangered, in the sense that they are likely to become extinct in the near future..

Can extinct languages be brought back? Yes, but it takes a lot of documentation, a powerful incentive, and a great deal of work. Classical Hebrew, for example, was brought back from millennia ago by Jewish people living in Europe who were anxious to carve out an identity of their own. It is now the national language of Israel, with eight million speakers. But the Jewish people had the Hebrew scriptures and plenty of other preserved written work to guide them in their effort.

Many other languages have flourished and then gone extinct without ever being recorded. Sometimes we know their names and a few surviving words. And now there’s a new wrinkle: Languages that were ever written down have begun to be rescued from near total oblivion by a computer algorithm.

There was once a state in what is now western Azerbaijan (see red outline on map), known today as Caucasian Albania (387–706 AD). The inhabitants spoke a northeast Caucasian language whose alphabet was rediscovered in 1937 in a 15th century Armenian language manuscript. Short inscriptions on candlesticks, tiles, vessels, etc., were found in later years but nothing more substantial until 2003, when a manuscript was uncovered using new computer technology.

The manuscript was found at St. Catherine’s Monastery on Mount Sinai in Egypt, the setting for the Ten Commandments. That is the world’s oldest library, continuously in use from the sixth century AD in the afterglow of the Roman Empire. It contains a vast library of ancient manuscripts.

The writing in Caucasian Albanian had actually been rubbed out. It was found under other writing on a parchment (writing material made from animal skins). Parchments were expensive and time-consuming to produce; thus they were frequently rubbed out and reused, in which case they were called palimpsests:

Manuscripts with multiple layers of writing are known as palimpsests, and there are about 130 of them at St. Catherine’s Monastery, according to the website of the Early Manuscript Electronic Library, which has been leading the initiative to uncover the original texts. As Richard Gray explains in the Atlantic, with the rise of Islam in the 7th century, Christian sites in the Sinai Desert began to disappear, and Saint Catherine’s found itself in relative isolation. Monks turned to reusing older parchments when supplies at the monastery ran scarce.

Brigit Katz, “The Lost Languages Discovered in One of the World’s Oldest Continuously Run Libraries” at Smithsonian Magazine (September 4, 2017)

These writings exist only as “faint scratches and flecks of ink beneath more recent writing,” as Gray describes them. But new technology now offers a chance at decipherment.

To uncover the palimpsests’ secret texts, researchers photographed thousands of pages multiple times, illuminating each page with different-colored lights. They also photographed the pages with light shining onto them from behind, or from an oblique angle, which helped “highlight tiny bumps and depressions in the surface,” Gray writes. They then fed the information into a computer algorithm, which is able to distinguish the more recent texts from the originals.

Brigit Katz, “The Lost Languages Discovered in One of the World’s Oldest Continuously Run Libraries” at Smithsonian Magazine (September 4, 2017)

Could a team of dedicated human researchers have done all this without a computer? Possibly, but it would take them much, much longer to get through all the calculations than it takes a special purpose algorithm. (See, for example, the AI analysis of a burnt up scroll.)

After 74 palimpsests were photographed since 2011, the researchers found 108 pages with previously lost poems in Greek and the oldest-known example of a recipe attributed to Hippocrates, the earliest Greek medic. All dated from roughly the 4th and 12th centuries AD, about the time that the Caucasian Albanian kingdom flourished.

Attention has focused on the lost languages:

Over five years, the researchers gathered 30 terabytes of images from 74 palimpsests—totaling 6,800 pages. In some cases, the erased texts have increased the known vocabulary of a language by up to 50 percent, giving new hope to linguists trying to decipher them. One of the languages to reemerge from the parchments is Caucasian Albanian, which was spoken by a Christian kingdom in what is now modern day Azerbaijan. Almost all written records from the kingdom were lost in the 8th and 9th century when its churches were destroyed.

Richard Gray, “The Invisible Poems Hidden in One of the World’s Oldest Libraries” at The Atlantic (August 9, 2017)

The new technology has enabled scholars to recover some words, such as “net” and “fish,” though the massive task of decoding all the rest remains. To reduce the risk from vandalism or terrorism, the researchers have put the now-160 palimpsest texts online.

Considering how the new technology has energized their field, it’s no wonder the researchers hail a “new golden age of discovery.” Some become quite thoughtful too:

Another dead language to be found in the palimpsests is one used by some of the earliest Christian communities in the Middle East. Known as Christian Palestinian Aramaic, it is a strange mix of Syriac and Greek that died out in the 13th century. Some of the earliest versions of the New Testament were written in this language. “This was an entire community of people who had a literature, art, and spirituality,” says [librarian Michael] Phelps. “Almost all of that has been lost, yet their cultural DNA exists in our culture today. These palimpsest texts are giving them a voice again and letting us learn about how they contributed to who we are today.”

Richard Gray, “The Invisible Poems Hidden in One of the World’s Oldest Libraries” at The Atlantic (August 9, 2017)

Will the world ever run out of languages? Probably not. Both adults and children can patch together or even invent new languages to communicate with others when no common language exists. The river of human language changes a great deal as it flows but it does not run dry.

Note: The map of Caucasian Albania is courtesy Abu Zarr – Own work Original source: The Cambridge Ancient History, volume XIV, chapter 22b, page 662, Chapter: Armenia in the fifth and sixth century by R.W. Thomson (R.W.Thomson is retired Calouste Gulbenkian Professor of Armenian Studies at Oxford University). Original map: The map of Armenia and its neighbours on page 666. Late antiquity: empire and successors, A.D. 425-600. CC BY-SA 3.0

Note 2: “Caucasian Albania” is not related to Albania today. The term “albania” is Latin for “a mountainous region.”

You may also enjoy:

The origin of language remains obscure. One problem is that information is not measured in science in a way that relates to matter and energy.

Can AI help us decipher lost languages? That depends mainly on the reasons we haven’t yet deciphered ancient texts.


Why linguist Noam Chomsky is a great scientist of our era. He singlehandedly rid linguistics of a stultifying (and technically mistaken) behaviorism. (Michael Egnor)

Mind Matters News

Breaking and noteworthy news from the exciting world of natural and artificial intelligence at MindMatters.ai.

Surprising Ways AI Can Help Recover Lost Languages