Building a corpus of over 1 million images of English newspapers for Microsoft India and annotating the images

Image data

November 2021

Building a corpus of over 1 million images of English newspapers for Microsoft India and annotating the images

placeholder

The Challenge

Microsoft India wanted to collect 1 million images of English newspaper pages and text-books to augment their OCR technology. To ensure diversity within the dataset, every participant was asked to submit pages from as many different books and newspapers as possible.

The Solution

Karya workers are able to work in English. For textbooks, our team partnered with local libraries, and our workers (based in Kolkata and 3 surrounding villages) submitted the 1 million images in record speed. Our client responded with pure delight. In fact, we have taken the permission of showing you a text from the client below.

We were then asked to replicate this process for newspapers. Local libraries rarely stock newspapers, and it proved to be difficult to find homes that stocked a large collection of newspapers. Instead, our team partnered with hundreds of local 'kabadiwallas' (junk dealers) and provided them a way to augment their daily incomes and help us with our data collection exercise.

Once the images were collected, our workers used Karya's Android app to annotate key parts of the collected images - the headline, the byline, an image insert, etc.