61 research outputs found

    Visual Speech and Speaker Recognition

    Get PDF
    This thesis presents a learning based approach to speech recognition and person recognition from image sequences. An appearance based model of the articulators is learned from example images and is used to locate, track, and recover visual speech features. A major difficulty in model based approaches is to develop a scheme which is general enough to account for the large appearance variability of objects but which does not lack in specificity. The method described here decomposes the lip shape and the intensities in the mouth region into weighted sums of basis shapes and basis intensities, respectively, using a Karhunen-Loéve expansion. The intensities deform with the shape model to provide shape independent intensity information. This information is used in image search, which is based on a similarity measure between the model and the image. Visual speech features can be recovered from the tracking results and represent shape and intensity information. A speechreading (lip-reading) system is presented which models these features by Gaussian distributions and their temporal dependencies by hidden Markov models. The models are trained using the EM-algorithm and speech recognition is performed based on maximum posterior probability classification. It is shown that, besides speech information, the recovered model parameters also contain person dependent information and a novel method for person recognition is presented which is based on these features. Talking persons are represented by spatio-temporal models which describe the appearance of the articulators and their temporal changes during speech production. Two different topologies for speaker models are described: Gaussian mixture models and hidden Markov models. The proposed methods were evaluated for lip localisation, lip tracking, speech recognition, and speaker recognition on an isolated digit database of 12 subjects, and on a continuous digit database of 37 subjects. The techniques were found to achieve good performance for all tasks listed above. For an isolated digit recognition task, the speechreading system outperformed previously reported systems and performed slightly better than untrained human speechreaders

    Towards Speaker Independent Continuous Speechreading

    Get PDF
    This paper describes recent speechreading experiments for a speaker independent continuous digit recognition task. Visual feature extraction is performed by a lip tracker which recovers information about the lip shape and information about the grey-level intensity around the mouth. These features are used to train visual word models using continuous density HMMs. Results show that the method generalises well to new speakers and that the recognition rate is highly variable across digits as expected due to the high visual confusability of certain words

    Off-Line Cursive Script Recognition Based on Continuous Density HMM

    Get PDF
    A system for off-line cursive script recognition is presented. A new normalization technique (based on statistical methods) to compensate for the variability of writing style is described. The key problem of segmentation is avoided by applying a sliding window on the handwritten words. A feature vector is extracted from each frame isolated by the window. The feature vectors are used as observations in letter-oriented continuous density HMMs that perform the recognition. Feature extraction and modeling techniques are illustrated. In order to allow the comparison of the results, the system has been trained and tested using the same data and experimental conditions as in other published works. The performance of the system is evaluated in terms of character and word (with and without lexicon) recognition rate. Results comparable to those of more complex systems have been achieved

    Using the Multi-Stream Approach for Continuous Audio-Visual Speech Recognition

    Get PDF
    The Multi-Stream automatic speech recognition approach was investigated in this work as a framework for Audio-Visual data fusion and speech recognition. This method presents many potential advantages for such a task. It particularly allows for synchronous decoding of continuous speech while still allowing for some asynchrony of the visual and acoustic information streams. First, the Multi-Stream formalism is briefly recalled. Then, on top of the Multi-Stream motivations, experiments on the {\sc M2VTS} multimodal database are presented and discussed. To our knowledge, these are the first experiments about multi-speaker continuous Audio-Visual Speech Recognition (AVSR). It is shown that the Multi-Stream approach can yield improved Audio-Visual speech recognition performance when the acoustic signal is corrupted by noise as well as for clean speech

    A Survey of Text Detection and Recognition in Images and Videos

    Get PDF
    A Survey of Text Detection and Recognition in Images and Videos, including the state-of-the-art methods and systems

    Recognition of Asymmetric Facial Action Unit Activities and Intensities

    Get PDF
    Most automatic facial expression analysis systems try to analyze emotion categories. However, psychologists argue that there is no straight forward way to classify emotions from facial expressions. Instead, they propose FACS (Facial Action Coding System), a de-facto standard for categorizing facial actions independent from emotional categories. We describe a system that recognizes asymmetric FACS Action Unit activities and intensities without the use of markers. Facial expression extraction is achieved by difference images that are projected into a sub-space using either PCA or ICA, followed by nearest neighbor classification. Experiments show that this holistic approach achieves a recognition performance comparable to marker-based facial expression analysis systems or human FACS experts for a single-subject database recorded under controlled conditions

    Optimal Parameterization of Point Distribution Models

    Get PDF
    We address the problem of determining the \emph{optimal model complexity} for shape modeling. This complexity is a compromise between model specificity and generality. We show that the error of a model can be split into two components, the model error and the fitting error, of which the first one can be used to optimize the model complexity based on the specific application. This strategy improves over traditional approaches, where the model complexity is only determined by vague heuristics or trial-and-error. A method for the determination of optimal active shape models is proposed and its efficiency is validated in several experiments. Furthermore, this method gives an indication on the range of valid shape parameters and on whether or not an increased number of training data will reduce the number of shape parameters further

    Illumination-robust Pattern Matching Using Distorted Color Histograms

    Get PDF
    It is argued that global illumination should be modeled separately from other incidents that change the appearance of objects. The effects of intensity variations of the global illumination are discussed and constraints deduced that restrict the shape of a function that maps the histogram of a template to the histogram of an image location. This approach is illustrated for simple pattern matching and for a combination with a PCA (\emph{Eigenface}) model of the grey-level appearance

    Automatic Facial Expression Analysis: A Survey

    Get PDF
    Over the last decade, automatic facial expression analysis has become an active research area that finds potential applications in areas such as more engaging human-computer interfaces, talking heads, image retrieval and human emotion analysis. Facial expressions reflect not only emotions, but other mental activities, social interaction and physiological signals. In this survey we introduce the most prominent automatic facial expression analysis methods and systems presented in the literature. Facial motion and deformation extraction approaches as well as classification methods are discussed with respect to issues such as face normalization, facial expression dynamics and facial expression intensity, but also with regard to their robustness towards environmental changes
    • …
    corecore