Building a TTS Engine for Mundari, one of India’s most vulnerable languages

Speech data

March 2023

Building a TTS Engine for Mundari, one of India’s most vulnerable languages

placeholder

The Challenge

A lost language echoes a lost culture, and it reflects the loss of invaluable knowledge. Of the estimated 7,111 known living languages in the world today, nearly half are in danger of extinction and are likely to disappear in this century. A large number of indigenous languages spoken in tribal-dominated regions of India are endangered including Mundari. According to the latest census, Mundari is still the primary language of over one million speakers across Bihar, Odisha, Jharkhand and West Bengal. Yet, no public Automatic Speech Recognition (ASR) model exists for the language.

GiZ (Deutsche Gesellschaft für Internationale Zusammenarbeit) reached out to Karya Inc. to build the first public ASR model for Mundari and help revitalize the endangered Munda languages and the centuries-old cultures that speak through them

The Solution

Partnering with language experts at IIT-Kharagpur and Microsoft Research, the Karya team reached out to Mundari speakers in 15 villages. We specifically looked for villagers who could speak both Mundari and Hindi. Our workers translated the 60,000 Hindi sentences into Mundari, building the largest public dataset of Mundari sentences. Two speakers (a man and a woman) were selected to record the entirety of the corpus, and help build the first TTS (Text to Speech) engine in Mundari. The entirety of this project was conducted remotely.