2,951 research outputs found
Using multiple visual tandem streams in audio-visual speech recognition
The method which is called the "tandem approach" in speech recognition has been shown to increase performance by using classifier posterior probabilities as observations in a hidden Markov model. We study the effect of using visual tandem features in audio-visual speech recognition using a novel setup which uses multiple classifiers to obtain multiple visual tandem features. We adopt the approach of multi-stream hidden Markov models where visual tandem features from two different classifiers are considered as additional streams in the model. It is shown in our experiments that using multiple visual tandem features improve the recognition accuracy in various noise conditions. In addition, in order to handle asynchrony between audio and visual observations, we employ coupled hidden Markov models and obtain improved performance as compared to the synchronous model
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Examining the Difference Between Asynchronous and Synchronous Training
For my project, we chose to do a thesis so that it would better help me out in the future in the case I wanted to get my PhD. My thesis so far has been to develop software that will help POD sites better be able to train their volunteers in the case of an emergency. We have already collected some data for our research from a test POD site that was constructed. We took data on the amount of time it took each volunteer to get an individual actor through the line depending on whether they learned via teacher or by my software. The data helped to prove how beneficial teaching via software could be, due to the fact there wasn’t any missing information, and there was a greater retention rate. Currently I just work at Lowes as a customer service administrator, mostly so I get to interact with customers every day to better understand how to communicate and give the information I would have on my software. The general are that my research has been taken so far is in emergency preparedness, and I would like to continue heading this direction until other opportunities arise
Recommended from our members
uC: Ubiquitous Collaboration Platform for Multimodal Team Interaction Support
A human-centered computing platform that improves teamwork and transforms the “human- computer interaction experience” for distributed teams is presented. This Ubiquitous Collaboration, or uC (“you see”), platform\u27s objective is to transform distributed teamwork (i.e., work occurring when teams of workers and learners are geographically dispersed and often interacting at different times). It achieves this goal through a multimodal team interaction interface realized through a reconfigurable open architecture. The approach taken is to integrate: (1) an intuitive speech- and video-centric multi-modal interface to augment more conventional methods (e.g., mouse, stylus and touch), (2) an open and reconfigurable architecture supporting information gathering, and (3) a machine intelligent approach to analysis and management of heterogeneous live and stored sensor data to support collaboration. The system will transform how teams of people interact with computers by drawing on both the virtual and physical environment
Multi-Scale Attention for Audio Question Answering
Audio question answering (AQA), acting as a widely used proxy task to explore
scene understanding, has got more attention. The AQA is challenging for it
requires comprehensive temporal reasoning from different scales' events of an
audio scene. However, existing methods mostly extend the structures of visual
question answering task to audio ones in a simple pattern but may not perform
well when perceiving a fine-grained audio scene. To this end, we present a
Multi-scale Window Attention Fusion Model (MWAFM) consisting of an asynchronous
hybrid attention module and a multi-scale window attention module. The former
is designed to aggregate unimodal and cross-modal temporal contexts, while the
latter captures sound events of varying lengths and their temporal dependencies
for a more comprehensive understanding. Extensive experiments are conducted to
demonstrate that the proposed MWAFM can effectively explore temporal
information to facilitate AQA in the fine-grained scene.Code:
https://github.com/GeWu-Lab/MWAFMComment: Accepted by InterSpeech 202
- …