6,106 research outputs found
Research on the utilization of pattern recognition techniques to identify and classify objects in video data Technical progress report, 31 Jan. - 31 May 1967
Pattern recognition techniques for extracting information from video data and for reducing amount of data to convey this information - decision mechanisms and property filter
Studies on noise robust automatic speech recognition
Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Audio-visual speaker separation
Communication using speech is often an audio-visual experience. Listeners hear what is
being uttered by speakers and also see the corresponding facial movements and other gestures.
This thesis is an attempt to exploit this bimodal (audio-visual) nature of speech for
speaker separation. In addition to the audio speech features, visual speech features are used
to achieve the task of speaker separation. An analysis of the correlation between audio and
visual speech features is carried out first. This correlation between audio and visual features
is then used in the estimation of clean audio features from visual features using Gaussian
MixtureModels (GMMs) andMaximum a Posteriori (MAP) estimation.
For speaker separation three methods are proposed that use the estimated clean audio features.
Firstly, the estimated clean audio features are used to construct aWiener filter to separate
the mixed speech at various signal-to-noise ratios (SNRs) into target and competing
speakers. TheWiener filter gains are modified in several ways in search for improvements in
quality and intelligibility of the extracted speech. Secondly, the estimated clean audio features
are used in developing visually-derived binary masking method for speaker separation.
The estimated audio features are used to compute time-frequency binary masks that identify
the regions where the target speaker dominates. These regions are retained and formthe
estimate of the target speaker’s speech. Experimental results compare the visually-derived
binary masks with ideal binary masks which shows a useful level of accuracy. The effectiveness
of the visually-derived binary mask for speaker separation is then evaluated through
estimates of speech quality and speech intelligibility and shows substantial gains over the
original mixture. Thirdly, the estimated clean audio features and the visually-derivedWiener
filtering are used to modify the operation of an effective audio-only method of speaker separation,
namely the soft mask method, to allow visual speech information to improve the
separation task. Experimental results are presented that compare the proposed audio-visual
speaker separation with the audio-only method using both speech quality and intelligibility
metrics. Finally, a detailed comparison is made of the proposed and existing methods of
speaker separation using objective and subjective measures
- …