From Concept to Completion: Building a Nationwide Speech Data Repository

From Concept to Completion: Building a Nationwide Speech Data Repository

“Think big, start small, then scale or fail fast.” These words by Mats Lederhausen encapsulate the essence of our journey in establishing a nationwide natural language speech data repository in India. When I was air-dropped into this project as the Program Lead, there was no team, no processes—just a bold mission: to curate 150,000 hours of open-source natural language data from 750 districts across India, breaking linguistic barriers and empowering Large Language Models. Here’s an account of how we laid the foundation to build a scalable program to make this vision a reality.

First, we meticulously defined success criteria for the pilot phase. Rather than relying on assumptions about the linguistic landscape, we opted for a flexible approach that embraced the diversity of languages and dialects across India. Our goal was to capture natural language in its entirety, preserving nuances like tone and local dialects. We brainstormed multiple approaches: could we request locals to sing a folk song? Unfortunately, that wouldn’t align with our objectives. Singing wouldn’t capture the natural tone of speech we’re aiming for. Similarly, asking individuals to read a pre-written text wouldn’t serve our purpose either. Our goal isn’t to capture rehearsed voices; it’s to preserve the natural nuances of speech, regardless of literacy levels. We also didn’t want to exclude people who couldn’t read in a certain language. We also rejected scripted prompts in favor of a more organic approach, devising a strategy centered around collecting responses to an image shown. We meticulously identified 200 categories of images commonly found in neighborhoods, ranging from libraries to markets. These images were divided into two categories: generic images representing various locales across the country and specific images unique to each neighborhood. The concept was simple yet ingenious: individuals would observe these images and naturally respond to what they see, enabling us to capture authentic speech from any native speaker.

We engaged two seasoned vendors specializing in on-the-ground data collection, complemented by the recruitment of a skilled machine learning engineer. Tasked with developing intricate models for data processing, conducting rigorous checks, and implementing signal processing techniques, our machine learning engineer ensured the integrity of our collected data. Yet, a major challenge appeared: how to validate the authenticity of the speech data obtained from our vendors at the project’s outset? The solution was to create an in-house data validation team to check at least 30% of the collected data.

The next hurdle involved sourcing native speakers from 80 districts to create this team — a task requiring strategic outreach and community engagement. Leveraging the power of network effects, we reached out to educators in local schools and leveraged social media platforms like Facebook groups. Our pitch highlighted the importance of preserving local languages and the potential for participants to earn additional income by joining our data validation team. Recognizing the need to expand our program team, we hired ten program associates and three program managers. These team members were trained to engage with native speakers, equipping them with the necessary skills to validate our collected data effectively.

We finalized seven critical questions against which our data validation team would respond. The initial phase commenced with collecting responses from data validators across 80 districts via Google Sheets. This Google Sheet interface served as our project’s minimum viable product. This allowed us to promptly initiate the data collection process and gather valuable user feedback on its effectiveness. We quickly realized the need for a simpler user interface, as many participants found it intimidating, leading us to pivot to a more accessible solution.

In addressing this situation, we had to keep in mind the technological literacy of our participants, most of the participants predominantly hailed from remote villages in India. Their primary mode of communication is limited to basic mobile applications, such as messaging services. Consequently, we developed an intuitive WhatsApp chatbot tailored to their needs fostering greater engagement and facilitating smoother interactions throughout the project. This strategic adaptation was pivotal in bridging the digital divide and empowering participation from these remote communities.

The chatbot was then integrated with a robust system that not only showcased responses to the program management team but also automatically tallied the submissions from each data validator. This integration streamlined operations by 40%, eliminating the need for meticulous data entry and sparing the program management team the laborious task of manually summarizing inputs. The process became more efficient with simplified backend checks, allowing for smoother oversight and enhanced productivity. Thus we laid a robust foundation for this nationwide project, culminating in the open-sourcing of 10,000 hours of speech data.

To view the latest progress of this project, please visit: https://vaani.iisc.ac.in/

Leave a Comment

Your email address will not be published. Required fields are marked *