Collecting and annotating a corpora of 500,000 text sentences in 5 Indian languages for the Bill and Melinda Gates Foundation

Text data

March 2023

Collecting and annotating a corpora of 500,000 text sentences in 5 Indian languages for the Bill and Melinda Gates Foundation

placeholder

The Challenge

Karya won a USD 2 million grant from the Gates Foundation to collect and annotate over 500,000 sentences in Hindi, Telugu, Malayalam, Marathi and Bengali. The primary goal of this research project is to explore whether digital microwork through women-centric user communities can help identify and mitigate gender biases that exist in AI technologies. NLU models are core to many language technologies. Existing NLU models have gender biases and these biases can have real-world social and economic consequences for women users.

The Solution

To reduce this gender-bias, Karya 1) collected over 500,000 text sentences directly from 30,000 women in rural India that better captures their language, syntax, rhetoric, and priorities, and 2) build a counteractive corpora that actively reverse gender-stereotype biases in existing models. We hired low-income women as data collectors to increase gender inclusivity and project impact. Our workers annotated and identified bias in the corpora, marking the specific part of the sentences that had a gender bias.