3 research outputs found

    Continuous Audio-Visual Speech Recognition

    Get PDF
    We address the problem of robust lip tracking, visual speech feature extraction, and sensor integration for audio-visual speech recognition applications. An appearance based model of the articulators, which represents linguistically important features, is learned from example images and is used to locate, track, and recover visual speech information. We tackle the problem of joint temporal modelling of the acoustic and visual speech signals by applying Multi-Stream hidden Markov models. This approach allows the use of different temporal topologies and levels of stream integration and hence enables to model temporal dependencies more accurately. The system has been evaluated for a continuously spoken digit recognition task of 37 subjects

    Statistical chromaticity models for lip tracking with B-splines

    No full text
    . A method for lip tracking intended to support personal verification is presented in this paper. Lip contours are represented by means of quadratic Bsplines. The lips are automatically localised in the original image and an elliptic B-spline is generated to start up tracking. Lip localisation exploits grey-level gradient projections as well as chromaticity models to find the lips in an automatically segmented region corresponding to the face area. Tracking proceeds by estimating new lip contour positions according to a statistical chromaticity model for the lips. The current tracker implementation follows a deterministic second order model for the spline motion based on a Lagrangian formulation of contour dynamics. The method has been tested on the M2VTS database[1]. Lips were accurately tracked on sequences consisting of more than hundred frames. localisation 1 Introduction INT. CONF. ON AUDIO- AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, CRANS MONTANA, SWITZERLAND, 1997. Lip tr..

    Visual Speech and Speaker Recognition

    Get PDF
    This thesis presents a learning based approach to speech recognition and person recognition from image sequences. An appearance based model of the articulators is learned from example images and is used to locate, track, and recover visual speech features. A major difficulty in model based approaches is to develop a scheme which is general enough to account for the large appearance variability of objects but which does not lack in specificity. The method described here decomposes the lip shape and the intensities in the mouth region into weighted sums of basis shapes and basis intensities, respectively, using a Karhunen-Loéve expansion. The intensities deform with the shape model to provide shape independent intensity information. This information is used in image search, which is based on a similarity measure between the model and the image. Visual speech features can be recovered from the tracking results and represent shape and intensity information. A speechreading (lip-reading) system is presented which models these features by Gaussian distributions and their temporal dependencies by hidden Markov models. The models are trained using the EM-algorithm and speech recognition is performed based on maximum posterior probability classification. It is shown that, besides speech information, the recovered model parameters also contain person dependent information and a novel method for person recognition is presented which is based on these features. Talking persons are represented by spatio-temporal models which describe the appearance of the articulators and their temporal changes during speech production. Two different topologies for speaker models are described: Gaussian mixture models and hidden Markov models. The proposed methods were evaluated for lip localisation, lip tracking, speech recognition, and speaker recognition on an isolated digit database of 12 subjects, and on a continuous digit database of 37 subjects. The techniques were found to achieve good performance for all tasks listed above. For an isolated digit recognition task, the speechreading system outperformed previously reported systems and performed slightly better than untrained human speechreaders
    corecore