11 research outputs found
Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping
By identifying lip movements and characterizing their associations with speech sounds, the performance of speech recognition systems can be improved, particularly when operating in noisy environments. In this paper, we present a geometrical-based automatic lip reading system that extracts the lip region from images using conventional techniques, but the contour itself is extracted using a novel application of a combination of border following and convex hull approaches. Classification is carried out using an enhanced dynamic time warping technique that has the ability to operate in multiple dimensions and a template probability technique that is able to compensate for differences in the way words are uttered in the training set. The performance of the new system has been assessed in recognition of the English digits 0 to 9 as available in the CUAVE database. The experimental results obtained from the new approach compared favorably with those of existing lip reading approaches, achieving a word recognition accuracy of up to 71% with the visual information being obtained from estimates of lip height, width and their ratio
LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES
Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works
A novel lip geometry approach for audio-visual speech recognition
By identifying lip movements and characterizing their associations with speech sounds, the performance of speech recognition systems can be improved, particularly when operating in noisy environments. Various method have been studied by research group around the world to incorporate lip movements into speech recognition in recent years, however exactly how best to incorporate the additional visual information is still not known. This study aims to extend the knowledge of relationships between visual and speech information specifically using lip geometry information due to its robustness to head rotation and the fewer number of features required to represent movement. A new method has been developed to extract lip geometry information, to perform classification and to integrate visual and speech modalities. This thesis makes several contributions. First, this work presents a new method to extract lip geometry features using the combination of a skin colour filter, a border following algorithm and a convex hull approach. The proposed method was found to improve lip shape extraction performance compared to existing approaches. Lip geometry features including height, width, ratio, area, perimeter and various combinations of these features were evaluated to determine which performs best when representing speech in the visual domain. Second, a novel template matching technique able to adapt dynamic differences in the way words are uttered by speakers has been developed, which determines the best fit of an unseen feature signal to those stored in a database template. Third, following on evaluation of integration strategies, a novel method has been developed based on alternative decision fusion strategy, in which the outcome from the visual and speech modality is chosen by measuring the quality of audio based on kurtosis and skewness analysis and driven by white noise confusion. Finally, the performance of the new methods introduced in this work are evaluated using the CUAVE and LUNA-V data corpora under a range of different signal to noise ratio conditions using the NOISEX-92 dataset
A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition
A key requirement for developing any innovative system in a
computing environment is to integrate a sufficiently friendly
interface with the average end user. Accurate design of such a
user-centered interface, however, means more than just the
ergonomics of the panels and displays. It also requires that
designers precisely define what information to use and how, where,
and when to use it. Recent advances in user-centered design of
computing systems have suggested that multimodal integration can
provide different types and levels of intelligence to the user
interface. The work of this thesis aims at improving speech
recognition-based interfaces by making use of the visual modality
conveyed by the movements of the lips.
Designing a good visual front end is a major part of this framework.
For this purpose, this work derives the optical flow fields for
consecutive frames of people speaking. Independent Component
Analysis (ICA) is then used to derive basis flow fields. The
coefficients of these basis fields comprise the visual features of
interest. It is shown that using ICA on optical flow fields yields
better classification results than the traditional approaches based
on Principal Component Analysis (PCA). In fact, ICA can capture
higher order statistics that are needed to understand the motion of
the mouth. This is due to the fact that lips movement is complex in
its nature, as it involves large image velocities, self occlusion
(due to the appearance and disappearance of the teeth) and a lot of
non-rigidity.
Another issue that is of great interest to audio-visual speech
recognition systems designers is the integration (fusion) of the
audio and visual information into an automatic speech recognizer.
For this purpose, a reliability-driven sensor fusion scheme is
developed. A statistical approach is developed to account for the
dynamic changes in reliability. This is done in two steps. The first
step derives suitable statistical reliability measures for the
individual information streams. These measures are based on the
dispersion of the N-best hypotheses of the individual stream
classifiers. The second step finds an optimal mapping between the
reliability measures and the stream weights that maximizes the
conditional likelihood. For this purpose, genetic algorithms are
used.
The addressed issues are challenging problems and are substantial
for developing an audio-visual speech recognition framework that can
maximize the information gather about the words uttered and minimize
the impact of noise
A motion-based approach for audio-visual automatic speech recognition
The research work presented in this thesis introduces novel approaches for both visual
region of interest extraction and visual feature extraction for use in audio-visual
automatic speech recognition. In particular, the speaker‘s movement that occurs
during speech is used to isolate the mouth region in video sequences and motionbased
features obtained from this region are used to provide new visual features for
audio-visual automatic speech recognition. The mouth region extraction approach
proposed in this work is shown to give superior performance compared with existing
colour-based lip segmentation methods. The new features are obtained from three
separate representations of motion in the region of interest, namely the difference in
luminance between successive images, block matching based motion vectors and
optical flow. The new visual features are found to improve visual-only and audiovisual
speech recognition performance when compared with the commonly-used
appearance feature-based methods.
In addition, a novel approach is proposed for visual feature extraction from either the
discrete cosine transform or discrete wavelet transform representations of the mouth
region of the speaker. In this work, the image transform is explored from a new
viewpoint of data discrimination; in contrast to the more conventional data
preservation viewpoint. The main findings of this work are that audio-visual
automatic speech recognition systems using the new features extracted from the
frequency bands selected according to their discriminatory abilities generally
outperform those using features designed for data preservation.
To establish the noise robustness of the new features proposed in this work, their
performance has been studied in presence of a range of different types of noise and at
various signal-to-noise ratios. In these experiments, the audio-visual automatic speech
recognition systems based on the new approaches were found to give superior
performance both to audio-visual systems using appearance based features and to
audio-only speech recognition systems
A motion based approach for audio-visual automatic speech recognition
The research work presented in this thesis introduces novel approaches for both visual region of interest extraction and visual feature extraction for use in audio-visual automatic speech recognition. In particular, the speaker‘s movement that occurs during speech is used to isolate the mouth region in video sequences and motionbased features obtained from this region are used to provide new visual features for audio-visual automatic speech recognition. The mouth region extraction approach proposed in this work is shown to give superior performance compared with existing colour-based lip segmentation methods. The new features are obtained from three separate representations of motion in the region of interest, namely the difference in luminance between successive images, block matching based motion vectors and optical flow. The new visual features are found to improve visual-only and audiovisual speech recognition performance when compared with the commonly-used appearance feature-based methods. In addition, a novel approach is proposed for visual feature extraction from either the discrete cosine transform or discrete wavelet transform representations of the mouth region of the speaker. In this work, the image transform is explored from a new viewpoint of data discrimination; in contrast to the more conventional data preservation viewpoint. The main findings of this work are that audio-visual automatic speech recognition systems using the new features extracted from the frequency bands selected according to their discriminatory abilities generally outperform those using features designed for data preservation. To establish the noise robustness of the new features proposed in this work, their performance has been studied in presence of a range of different types of noise and at various signal-to-noise ratios. In these experiments, the audio-visual automatic speech recognition systems based on the new approaches were found to give superior performance both to audio-visual systems using appearance based features and to audio-only speech recognition systems.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Discovering Dynamic Visemes
Abstract
This thesis introduces a set of new, dynamic units of visual speech which are learnt
using computer vision and machine learning techniques. Rather than clustering
phoneme labels as is done traditionally, the visible articulators of a speaker are
tracked and automatically segmented into short, visually intuitive speech gestures
based on the dynamics of the articulators. The segmented gestures are clustered
into dynamic visemes, such that movements relating to the same visual function
appear within the same cluster. Speech animation can then be generated on any
facial model by mapping a phoneme sequence to a sequence of dynamic visemes,
and stitching together an example of each viseme in the sequence. Dynamic visemes
model coarticulation and maintain the dynamics of the original speech, so simple
blending at the concatenation boundaries ensures a smooth transition. The efficacy
of dynamic visemes for computer animation is formally evaluated both objectively
and subjectively, and compared with traditional phoneme to static lip-pose interpolation