Your AI solutions are only as good as the data you feed it! We live in a world that is being reshaped by AI every passing minute. But the scary part is that the data powering these AI models do not represent the whole population. Many segments of society are underrepresented. When I set out to build the largest dataset of Indian dialects for Project Vaani, I wanted to ensure this dataset was inclusive and enabled the documentation of languages that are dying and the voices of the minority.
Setting the Right Foundation
We created a foundational document for data collection agencies to ensure the project’s success. The contract rewarded inclusivity and accuracy, not just scale. Vendors were required to:
- Ensure gender parity
- Collect a diverse range of voices from all age groups
- Capture tribal languages
- Approach each district without pre-existing biases about the number of languages spoken
In a country with at least 19,500 dialects and 121 languages, we didn’t want to assume there were only a certain number of languages and dialects. Our district-focused approach ensured comprehensive coverage.
Early Challenges
Gender Imbalance
Despite our efforts, an early batch of data had over 85% male voices. We had clearly incentivized the collection of data from all genders, so what went wrong? Upon investigation, we discovered that since most data collectors were men, women were uncomfortable engaging with them. To address this, we started identifying and training women data collectors. Gradually, we saw an increase in the number of women contributing their voices to this repository. Today, the dataset contains 56% female audio.
Safety Concerns
Our objective was to reach every district in the Indian subcontinent, including some of the most remote areas. However, we encountered unexpected resistance in certain regions, where our data collectors faced serious threats for their efforts to engage with local communities. While we anticipated difficulties in these areas, the safety of our team has always been our top priority. We promptly took action to protect the safety of the data collectors and have since adopted alternative methods to continue the project without compromising anyone’s safety.
Here are some approaches that I adopted to implement the project in high-risk areas:
- Partnered with local organizations, and NGOs that have established trust and relationships within the area. They provided valuable insights into local dynamics and helped navigate sensitive situations.
- Conducted thorough risk assessments to identify potential threats and vulnerabilities. This includes understanding the motivations and actions of local terror groups.
- Developed a comprehensive security plan in collaboration with security experts and local authorities. This plan included protocols for data collectors, communication channels, and emergency procedures.
- Implemented techniques for anonymous data collection to protect the identities of participants and data collectors.
- Utilized remote data collection methods where possible, such as mobile apps or phone interviews, to minimize physical presence in high-risk areas.
- Trained local community members to assist with data collection, reducing the need for external personnel.
- Maintained flexibility in project timelines and methodologies to respond to changing security situations.
By tackling these problems head-on during the initial phase of the project, we established a solid foundation to scale and curate an all-inclusive natural language dataset.
Key Learnings:
Here are the key learnings from this experience:
- Begin with Inclusivity: Start thinking about inclusivity from day one. Identify gaps early and improve continuously.
- Diverse Participation: Employ individuals from diverse backgrounds at every level of data curation.
- Inclusive Collection Process: Design a data collection process that welcomes people from all walks of life.
- User-Friendly Tools: Use tools that are simple and intuitive for everyone to ensure accessibility.
To follow the latest progress on the project, visit Vaani Project.