Automatic annotation of voice forum content for rural users and evaluation of relevance
Abstract
Voice forums are an effective intervention medium for marginalized communities to access information in a structured and localized manner. Users actively contribute by posting questions and responses in the form of audio messages, and thereby help in enriching the voice forum content. In order to build an audio library using the voice forums to disseminate information, significant manual effort is needed in analyzing and curating the data. This is one of the key impediments to the successful implementation of voice forums for knowledge dissemination and training. In this paper, we explore the effectiveness of automated approaches to analyze and curate voice forum content in Hindi, a native language in the northern part of India. We study the use of standard techniques such as topic modeling and extractive summarization on Hindi speech transcripts (with WER of 67%) to cluster audios thematically and create summaries for individual audios respectively. These curated audios are used to build an IVR-based library for community health workers in rural India. We evaluated the relevance and preference of the automated annotation using a field trail. We find that the relevance perception varied between human and automatically generated annotations, but automatically generated summaries were still found to be useful to access the voice forum audios.