2,951 research outputs found

    Using multiple visual tandem streams in audio-visual speech recognition

    Get PDF
    The method which is called the "tandem approach" in speech recognition has been shown to increase performance by using classifier posterior probabilities as observations in a hidden Markov model. We study the effect of using visual tandem features in audio-visual speech recognition using a novel setup which uses multiple classifiers to obtain multiple visual tandem features. We adopt the approach of multi-stream hidden Markov models where visual tandem features from two different classifiers are considered as additional streams in the model. It is shown in our experiments that using multiple visual tandem features improve the recognition accuracy in various noise conditions. In addition, in order to handle asynchrony between audio and visual observations, we employ coupled hidden Markov models and obtain improved performance as compared to the synchronous model

    Articulatory features for robust visual speech recognition

    Full text link

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Examining the Difference Between Asynchronous and Synchronous Training

    Get PDF
    For my project, we chose to do a thesis so that it would better help me out in the future in the case I wanted to get my PhD. My thesis so far has been to develop software that will help POD sites better be able to train their volunteers in the case of an emergency. We have already collected some data for our research from a test POD site that was constructed. We took data on the amount of time it took each volunteer to get an individual actor through the line depending on whether they learned via teacher or by my software. The data helped to prove how beneficial teaching via software could be, due to the fact there wasn’t any missing information, and there was a greater retention rate. Currently I just work at Lowes as a customer service administrator, mostly so I get to interact with customers every day to better understand how to communicate and give the information I would have on my software. The general are that my research has been taken so far is in emergency preparedness, and I would like to continue heading this direction until other opportunities arise

    Overcoming asynchrony in Audio-Visual Speech Recognition

    Full text link

    Multi-Scale Attention for Audio Question Answering

    Full text link
    Audio question answering (AQA), acting as a widely used proxy task to explore scene understanding, has got more attention. The AQA is challenging for it requires comprehensive temporal reasoning from different scales' events of an audio scene. However, existing methods mostly extend the structures of visual question answering task to audio ones in a simple pattern but may not perform well when perceiving a fine-grained audio scene. To this end, we present a Multi-scale Window Attention Fusion Model (MWAFM) consisting of an asynchronous hybrid attention module and a multi-scale window attention module. The former is designed to aggregate unimodal and cross-modal temporal contexts, while the latter captures sound events of varying lengths and their temporal dependencies for a more comprehensive understanding. Extensive experiments are conducted to demonstrate that the proposed MWAFM can effectively explore temporal information to facilitate AQA in the fine-grained scene.Code: https://github.com/GeWu-Lab/MWAFMComment: Accepted by InterSpeech 202
    corecore