253 research outputs found
Lip Localization Algorithm Using Gabor Filters
This paper describes a lip localization algorithm within a still image frame for subsequent tracking and audiovisual speech recognition processing. A Gabor filter-based feature space is promoted as a means to localize lips within an image based off of shape. This filtered space is shown to effectively differentiate facial features, including lips, from their backgrounds and to bound the full extent of the lips within a face-classified region of interest. Extensive training and test sets are used to justify design decisions and performance
Automatic Visual Speech Recognition
Intelligent SystemsElectrical Engineering, Mathematics and Computer Scienc
Adaptive threshold optimisation for colour-based lip segmentation in automatic lip-reading systems
A thesis submitted to the Faculty of Engineering and the Built Environment,
University of the Witwatersrand, Johannesburg, in ful lment of the requirements for
the degree of Doctor of Philosophy.
Johannesburg, September 2016Having survived the ordeal of a laryngectomy, the patient must come to terms with
the resulting loss of speech. With recent advances in portable computing power,
automatic lip-reading (ALR) may become a viable approach to voice restoration. This
thesis addresses the image processing aspect of ALR, and focuses three contributions
to colour-based lip segmentation.
The rst contribution concerns the colour transform to enhance the contrast
between the lips and skin. This thesis presents the most comprehensive study to
date by measuring the overlap between lip and skin histograms for 33 di erent
colour transforms. The hue component of HSV obtains the lowest overlap of 6:15%,
and results show that selecting the correct transform can increase the segmentation
accuracy by up to three times.
The second contribution is the development of a new lip segmentation algorithm
that utilises the best colour transforms from the comparative study. The algorithm
is tested on 895 images and achieves percentage overlap (OL) of 92:23% and segmentation
error (SE) of 7:39 %.
The third contribution focuses on the impact of the histogram threshold on the
segmentation accuracy, and introduces a novel technique called Adaptive Threshold
Optimisation (ATO) to select a better threshold value. The rst stage of ATO
incorporates -SVR to train the lip shape model. ATO then uses feedback of shape
information to validate and optimise the threshold. After applying ATO, the SE
decreases from 7:65% to 6:50%, corresponding to an absolute improvement of 1:15 pp
or relative improvement of 15:1%. While this thesis concerns lip segmentation in
particular, ATO is a threshold selection technique that can be used in various
segmentation applications.MT201
Articulatory features for robust visual speech recognition
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 99-105).This thesis explores a novel approach to visual speech modeling. Visual speech, or a sequence of images of the speaker's face, is traditionally viewed as a single stream of contiguous units, each corresponding to a phonetic segment. These units are defined heuristically by mapping several visually similar phonemes to one visual phoneme, sometimes referred to as a viseme. However, experimental evidence shows that phonetic models trained from visual data are not synchronous in time with acoustic phonetic models, indicating that visemes may not be the most natural building blocks of visual speech. Instead, we propose to model the visual signal in terms of the underlying articulatory features. This approach is a natural extension of feature-based modeling of acoustic speech, which has been shown to increase robustness of audio-based speech recognition systems. We start by exploring ways of defining visual articulatory features: first in a data-driven manner, using a large, multi-speaker visual speech corpus, and then in a knowledge-driven manner, using the rules of speech production. Based on these studies, we propose a set of articulatory features, and describe a computational framework for feature-based visual speech recognition. Multiple feature streams are detected in the input image sequence using Support Vector Machines, and then incorporated in a Dynamic Bayesian Network to obtain the final word hypothesis. Preliminary experiments show that our approach increases viseme classification rates in visually noisy conditions, and improves visual word recognition through feature-based context modeling.by Ekaterina Saenko.S.M
Automated Speaker Independent Visual Speech Recognition: A Comprehensive Survey
Speaker-independent VSR is a complex task that involves identifying spoken
words or phrases from video recordings of a speaker's facial movements. Over
the years, there has been a considerable amount of research in the field of VSR
involving different algorithms and datasets to evaluate system performance.
These efforts have resulted in significant progress in developing effective VSR
models, creating new opportunities for further research in this area. This
survey provides a detailed examination of the progression of VSR over the past
three decades, with a particular emphasis on the transition from
speaker-dependent to speaker-independent systems. We also provide a
comprehensive overview of the various datasets used in VSR research and the
preprocessing techniques employed to achieve speaker independence. The survey
covers the works published from 1990 to 2023, thoroughly analyzing each work
and comparing them on various parameters. This survey provides an in-depth
analysis of speaker-independent VSR systems evolution from 1990 to 2023. It
outlines the development of VSR systems over time and highlights the need to
develop end-to-end pipelines for speaker-independent VSR. The pictorial
representation offers a clear and concise overview of the techniques used in
speaker-independent VSR, thereby aiding in the comprehension and analysis of
the various methodologies. The survey also highlights the strengths and
limitations of each technique and provides insights into developing novel
approaches for analyzing visual speech cues. Overall, This comprehensive review
provides insights into the current state-of-the-art speaker-independent VSR and
highlights potential areas for future research
Audio-coupled video content understanding of unconstrained video sequences
Unconstrained video understanding is a difficult task. The main aim of this thesis is to
recognise the nature of objects, activities and environment in a given video clip using
both audio and video information. Traditionally, audio and video information has not
been applied together for solving such complex task, and for the first time we propose,
develop, implement and test a new framework of multi-modal (audio and video) data
analysis for context understanding and labelling of unconstrained videos.
The framework relies on feature selection techniques and introduces a novel algorithm
(PCFS) that is faster than the well-established SFFS algorithm. We use the framework for
studying the benefits of combining audio and video information in a number of different
problems. We begin by developing two independent content recognition modules. The
first one is based on image sequence analysis alone, and uses a range of colour, shape,
texture and statistical features from image regions with a trained classifier to recognise
the identity of objects, activities and environment present. The second module uses audio
information only, and recognises activities and environment. Both of these approaches
are preceded by detailed pre-processing to ensure that correct video segments containing
both audio and video content are present, and that the developed system can be made
robust to changes in camera movement, illumination, random object behaviour etc. For
both audio and video analysis, we use a hierarchical approach of multi-stage
classification such that difficult classification tasks can be decomposed into simpler and
smaller tasks.
When combining both modalities, we compare fusion techniques at different levels of
integration and propose a novel algorithm that combines advantages of both feature and
decision-level fusion. The analysis is evaluated on a large amount of test data comprising
unconstrained videos collected for this work. We finally, propose a decision correction
algorithm which shows that further steps towards combining multi-modal classification
information effectively with semantic knowledge generates the best possible results
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES
Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works
- …