Collecting and annotating a corpora of 500,000 text sentences in 5 Indian languages for the Bill and Melinda Gates Foundation

The Challenge

Karya won a USD 2 million grant from the Gates Foundation to collect and annotate over 500,000 sentences in Hindi, Telugu, Malayalam, Marathi and Bengali. The primary goal of this research project is to explore whether digital microwork through women-centric user communities can help identify and mitigate gender biases that exist in AI technologies. NLU models are core to many language technologies. Existing NLU models have gender biases and these biases can have real-world social and economic consequences for women users.

The Solution

To reduce this gender-bias, Karya 1) collected over 500,000 text sentences directly from 30,000 women in rural India that better captures their language, syntax, rhetoric, and priorities, and 2) build a counteractive corpora that actively reverse gender-stereotype biases in existing models. We hired low-income women as data collectors to increase gender inclusivity and project impact. Our workers annotated and identified bias in the corpora, marking the specific part of the sentences that had a gender bias.

Data Services

Technology

Ethical Data

Data Services

Research & Development

AI/ML Technology

Catalogue

Mission

Team

Partner with Karya

Careers

Data Services

Research & Development

AI/ML Technology

Catalogue

Mission

Team

Partner with Karya

Careers

Data Services

Technology

Ethical Data

Collecting and annotating a corpora of 500,000 text sentences in 5 Indian languages for the Bill and Melinda Gates Foundation

The Challenge

The Solution

Related

Data Services

Technology

Ethical Data

Data Services

Research & Development

AI/ML Technology

Catalogue

Mission

Team

Partner with Karya

Careers

Data Services

Research & Development

AI/ML Technology

Catalogue

Mission

Team

Partner with Karya

Careers

Data Services

Technology

Ethical Data

Collecting and annotating a corpora of 500,000 text sentences in 5 Indian languages for the Bill and Melinda Gates Foundation

The Challenge

The Solution

Related

Building the largest annotated text dataset in Odia for the healthcare, banking and agriculture domains

Conversational Data and Call Center Data Collection(2400 hours)