CORE
🇺🇦
make metadata, not war
Services
Services overview
Explore all CORE services
Access to raw data
API
Dataset
FastSync
Content discovery
Recommender
Discovery
OAI identifiers
OAI Resolver
Managing content
Dashboard
Bespoke contracts
Consultancy services
Support us
Support us
Membership
Sponsorship
Community governance
Advisory Board
Board of supporters
Research network
About
About us
Our mission
Team
Blog
FAQs
Contact us
Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices
Authors
Alexander F. Danvers
Rajat Hebbar
+7 more
Matthias R. Mehl
Suzanne A. Moseley
Shrikanth Narayanan
Pavlos Papadopoulos
Angelina J. Polsinelli
Ramon Reyes
David A. Sbarra
Publication date
1 January 2021
Publisher
'Springer Science and Business Media LLC'
Doi
Cite
Abstract
Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection. © 2021, The Author(s).National Institutes of HealthOpen access journalThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at
[email protected]
Similar works
Full text
Open in the Core reader
Download PDF
Available Versions
Sustaining member
The University of Arizona
See this paper in CORE
Go to the repository landing page
Download from data provider
oai:repository.arizona.edu:101...
Last time updated on 20/03/2021
IUPUIScholarWorks
See this paper in CORE
Go to the repository landing page
Download from data provider
oai:scholarworks.iupui.edu:180...
Last time updated on 19/05/2022