1 research outputs found
Temporally Aligning Long Audio Interviews with Questions: A Case Study in Multimodal Data Integration
The problem of audio-to-text alignment has seen significant amount of
research using complete supervision during training. However, this is typically
not in the context of long audio recordings wherein the text being queried does
not appear verbatim within the audio file. This work is a collaboration with a
non-governmental organization called CARE India that collects long audio health
surveys from young mothers residing in rural parts of Bihar, India. Given a
question drawn from a questionnaire that is used to guide these surveys, we aim
to locate where the question is asked within a long audio recording. This is of
great value to African and Asian organizations that would otherwise have to
painstakingly go through long and noisy audio recordings to locate questions
(and answers) of interest. Our proposed framework, INDENT, uses a
cross-attention-based model and prior information on the temporal ordering of
sentences to learn speech embeddings that capture the semantics of the
underlying spoken text. These learnt embeddings are used to retrieve the
corresponding audio segment based on text queries at inference time. We
empirically demonstrate the significant effectiveness (improvement in R-avg of
about 3%) of our model over those obtained using text-based heuristics. We also
show how noisy ASR, generated using state-of-the-art ASR models for Indian
languages, yields better results when used in place of speech. INDENT, trained
only on Hindi data is able to cater to all languages supported by the
(semantically) shared text space. We illustrate this empirically on 11 Indic
languages.Comment: Work Accepted in IJCAI-23- AI and Social Good Trac