874 research outputs found
Cued Speech Gesture Recognition: A First Prototype Based on Early Reduction
International audienceCued Speech is a specific linguistic code for hearing-impaired people. It is based on both lip reading and manual gestures. In the context of THIMP (Telephony for the Hearing-IMpaired Project), we work on automatic cued speech translation. In this paper, we only address the problem of automatic cued speech manual gesture recognition. Such a gesture recognition issue is really common from a theoretical point of view, but we approach it with respect to its particularities in order to derive an original method. This method is essentially built around a bioinspired method called early reduction. Prior to a complete analysis of each image of a sequence, the early reduction process automatically extracts a restricted number of key images which summarize the whole sequence. Only the key images are studied from a temporal point of view with lighter computation than the complete sequenc
Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping
By identifying lip movements and characterizing their associations with speech sounds, the performance of speech recognition systems can be improved, particularly when operating in noisy environments. In this paper, we present a geometrical-based automatic lip reading system that extracts the lip region from images using conventional techniques, but the contour itself is extracted using a novel application of a combination of border following and convex hull approaches. Classification is carried out using an enhanced dynamic time warping technique that has the ability to operate in multiple dimensions and a template probability technique that is able to compensate for differences in the way words are uttered in the training set. The performance of the new system has been assessed in recognition of the English digits 0 to 9 as available in the CUAVE database. The experimental results obtained from the new approach compared favorably with those of existing lip reading approaches, achieving a word recognition accuracy of up to 71% with the visual information being obtained from estimates of lip height, width and their ratio
Facial Emotion Recognition Feature Extraction: A Survey
Facial emotion recognition is a process based on facial expression to automatically recognize individual emotion expression. Automatic recognition refers to creating computer systems that are able to simulate human natural ability of detection, analysis, and determination of emotion by facial expression. Human natural recognition uses various points of observation to make decision or conclusion on emotion expressed by the present person in front. Facial features efficiently extracted aid in improving the classifier performance and application efficiency. Many feature extraction methods based on shape, texture, and other local features are proposed in the literature, and this chapter will review them. This chapter will survey some recent and formal feature expression methods from video and image products and classify them according to their efficiency and application
3D Shape Descriptor-Based Facial Landmark Detection: A Machine Learning Approach
Facial landmark detection on 3D human faces has had numerous applications in the literature
such as establishing point-to-point correspondence between 3D face models which is itself a
key step for a wide range of applications like 3D face detection and authentication, matching,
reconstruction, and retrieval, to name a few.
Two groups of approaches, namely knowledge-driven and data-driven approaches, have been
employed for facial landmarking in the literature. Knowledge-driven techniques are the
traditional approaches that have been widely used to locate landmarks on human faces. In
these approaches, a user with sucient knowledge and experience usually denes features to
be extracted as the landmarks. Data-driven techniques, on the other hand, take advantage
of machine learning algorithms to detect prominent features on 3D face models. Besides
the key advantages, each category of these techniques has limitations that prevent it from
generating the most reliable results.
In this work we propose to combine the strengths of the two approaches to detect facial
landmarks in a more ecient and precise way. The suggested approach consists of two phases.
First, some salient features of the faces are extracted using expert systems. Afterwards,
these points are used as the initial control points in the well-known Thin Plate Spline (TPS)
technique to deform the input face towards a reference face model. Second, by exploring and
utilizing multiple machine learning algorithms another group of landmarks are extracted.
The data-driven landmark detection step is performed in a supervised manner providing an
information-rich set of training data in which a set of local descriptors are computed and used
to train the algorithm. We then, use the detected landmarks for establishing point-to-point
correspondence between the 3D human faces mainly using an improved version of Iterative
Closest Point (ICP) algorithms. Furthermore, we propose to use the detected landmarks for
3D face matching applications
A system for recognizing human emotions based on speech analysis and facial feature extraction: applications to Human-Robot Interaction
With the advance in Artificial Intelligence, humanoid robots start to interact with ordinary people based on the growing understanding of psychological processes. Accumulating evidences in Human Robot Interaction (HRI) suggest that researches are focusing on making an emotional communication between human and robot for creating a social perception, cognition, desired interaction and sensation.
Furthermore, robots need to receive human emotion and optimize their behavior to help and interact with a human being in various environments. The most natural way to recognize basic emotions is extracting sets of features from human speech, facial expression and body gesture. A system for recognition of emotions based on speech analysis and facial features extraction can have interesting applications in Human-Robot Interaction. Thus, the Human-Robot Interaction ontology explains how the knowledge of these fundamental sciences is applied in physics (sound analyses), mathematics (face detection and perception), philosophy theory (behavior) and robotic science context.
In this project, we carry out a study to recognize basic emotions (sadness, surprise, happiness, anger, fear and disgust). Also, we propose a methodology and a software program for classification of emotions based on speech analysis and facial features extraction.
The speech analysis phase attempted to investigate the appropriateness of using acoustic (pitch value, pitch peak, pitch range, intensity and formant), phonetic (speech rate) properties of emotive speech with the freeware program PRAAT, and consists of generating and analyzing a graph of speech signals. The proposed architecture investigated the appropriateness of analyzing emotive speech with the minimal use of signal processing algorithms. 30 participants to the experiment had to repeat five sentences in English (with durations typically between 0.40 s and 2.5 s) in order to extract data relative to pitch (value, range and peak) and rising-falling intonation. Pitch alignments (peak, value and range) have been evaluated and the results have been compared with intensity and speech rate.
The facial feature extraction phase uses the mathematical formulation (B\ue9zier curves) and the geometric analysis of the facial image, based on measurements of a set of Action Units (AUs) for classifying the emotion. The proposed technique consists of three steps: (i) detecting the facial region within the image, (ii) extracting and classifying the facial features, (iii) recognizing the emotion. Then, the new data have been merged with reference data in order to recognize the basic emotion.
Finally, we combined the two proposed algorithms (speech analysis and facial expression), in order to design a hybrid technique for emotion recognition. Such technique have been implemented in a software program, which can be employed in Human-Robot Interaction.
The efficiency of the methodology was evaluated by experimental tests on 30 individuals (15 female and 15 male, 20 to 48 years old) form different ethnic groups, namely: (i) Ten adult European, (ii) Ten Asian (Middle East) adult and (iii) Ten adult American.
Eventually, the proposed technique made possible to recognize the basic emotion in most of the cases
Modelling talking human faces
This thesis investigates a number of new approaches for visual speech
synthesis using data-driven methods to implement a talking face.
The main contributions in this thesis are the following. The accuracy
of shared Gaussian process latent variable model (SGPLVM)
built using the active appearance model (AAM) and relative spectral
transform-perceptual linear prediction (RASTAPLP) features is improved
by employing a more accurate AAM. This is the first study
to report that using a more accurate AAM improves the accuracy of
SGPLVM. Objective evaluation via reconstruction error is performed
to compare the proposed approach against previously existing methods.
In addition, it is shown experimentally that the accuracy of AAM
can be improved by using a larger number of landmarks and/or larger
number of samples in the training data.
The second research contribution is a new method for visual speech
synthesis utilising a fully Bayesian method namely the manifold relevance
determination (MRD) for modelling dynamical systems through
probabilistic non-linear dimensionality reduction. This is the first time
MRD was used in the context of generating talking faces from the
input speech signal. The expressive power of this model is in the ability
to consider non-linear mappings between audio and visual features
within a Bayesian approach. An efficient latent space has been learnt
iii
Abstract iv
using a fully Bayesian latent representation relying on conditional nonlinear
independence framework. In the SGPLVM the structure of the
latent space cannot be automatically estimated because of using a maximum
likelihood formulation. In contrast to SGPLVM the Bayesian approaches
allow the automatic determination of the dimensionality of the
latent spaces. The proposed method compares favourably against several
other state-of-the-art methods for visual speech generation, which
is shown in quantitative and qualitative evaluation on two different
datasets.
Finally, the possibility of incremental learning of AAM for inclusion
in the proposed MRD approach for visual speech generation is
investigated. The quantitative results demonstrate that using MRD in
conjunction with incremental AAMs produces only slightly less accurate
results than using batch methods. These results support a way of
training this kind of models on computers with limited resources, for
example in mobile computing.
Overall, this thesis proposes several improvements to the current
state-of-the-art in generating talking faces from speech signal leading
to perceptually more convincing results
High-quality face capture, animation and editing from monocular video
Digitization of virtual faces in movies requires complex capture setups and extensive manual work to produce superb animations and video-realistic editing. This thesis pushes the boundaries of the digitization pipeline by proposing automatic algorithms for high-quality 3D face capture and animation, as well as photo-realistic face editing. These algorithms reconstruct and modify faces in 2D videos recorded in uncontrolled scenarios and illumination. In particular, advances in three main areas offer solutions for the lack of depth and overall uncertainty in video recordings. First, contributions in capture include model-based reconstruction of detailed, dynamic 3D geometry that exploits optical and shading cues, multilayer parametric reconstruction of accurate 3D models in unconstrained setups based on inverse rendering, and regression-based 3D lip shape enhancement from high-quality data. Second, advances in animation are video-based face reenactment based on robust appearance metrics and temporal clustering, performance-driven retargeting of detailed facial models in sync with audio, and the automatic creation of personalized controllable 3D rigs. Finally, advances in plausible photo-realistic editing are dense face albedo capture and mouth interior synthesis using image warping and 3D teeth proxies. High-quality results attained on challenging application scenarios confirm the contributions and show great potential for the automatic creation of photo-realistic 3D faces.Die Digitalisierung von Gesichtern zum Einsatz in der Filmindustrie erfordert komplizierte Aufnahmevorrichtungen und die manuelle Nachbearbeitung von Rekonstruktionen, um perfekte Animationen und realistische Videobearbeitung zu erzielen. Diese Dissertation erweitert vorhandene Digitalisierungsverfahren durch die Erforschung von automatischen Verfahren zur qualitativ hochwertigen 3D Rekonstruktion, Animation und Modifikation von Gesichtern. Diese Algorithmen erlauben es, Gesichter in 2D Videos, die unter allgemeinen Bedingungen und unbekannten Beleuchtungsverhältnissen aufgenommen wurden, zu rekonstruieren und zu modifizieren. Vor allem Fortschritte in den folgenden drei Hauptbereichen tragen zur Kompensation von fehlender Tiefeninformation und der allgemeinen Mehrdeutigkeit von 2D Videoaufnahmen bei. Erstens, Beiträge zur modellbasierten Rekonstruktion von detaillierter und dynamischer 3D Geometrie durch optische Merkmale und die Shading-Eigenschaften des Gesichts, mehrschichtige parametrische Rekonstruktion von exakten 3D Modellen mittels inversen Renderings in allgemeinen Szenen und regressionsbasierter 3D Lippenformverfeinerung mittels qualitativ hochwertigen Daten. Zweitens, Fortschritte im Bereich der Computeranimation durch videobasierte Gesichtsausdrucksübertragung und temporaler Clusterbildung, Übertragung von detaillierten Gesichtsmodellen, deren Mundbewegung mit Ton synchronisiert ist, und die automatische Erstellung von personalisierten "3D Face Rigs". Schließlich werden Fortschritte im Bereich der realistischen Videobearbeitung vorgestellt, welche auf der dichten Rekonstruktion von Hautreflektionseigenschaften und der Mundinnenraumsynthese mittels bildbasierten und geometriebasierten Verfahren aufbauen. Qualitativ hochwertige Ergebnisse in anspruchsvollen Anwendungen untermauern die Wichtigkeit der geleisteten Beiträgen und zeigen das große Potential der automatischen Erstellung von realistischen digitalen 3D Gesichtern auf
Automated Facial Anthropometry Over 3D Face Surface Textured Meshes
The automation of human face measurement means facing mayor technical and technological challenges. The use of 3D scanning technology is widely accepted in the scientific community and it offers the possibility of developing non-invasive measurement techniques. However, the selection of the points that form the basis of the measurements is a task that still requires human intervention. This work introduces digital image processing methods for automatic localization of facial features. The first goal was to examine different ways to represent 3D shapes and to evaluate whether these could be used as representative features of facial attributes, in order to locate them automatically. Based on the above, a non-rigid registration procedure was developed to estimate dense point-to-point correspondence between two surfaces. The method is able to register 3D models of faces in the presence of facial expressions. Finally, a method that uses both shape and appearance information of the surface, was designed for automatic localization of a set of facial features that are the basis for determining anthropometric ratios, which are widely used in fields such as ergonomics, forensics, surgical planning, among othersResumen : La automatización de la medición del rostro humano implica afrontar grandes desafíos técnicos y tecnológicos. Una alternativa de solución que ha encontrado gran aceptación dentro de la comunidad científica, corresponde a la utilización de tecnología de digitalización 3D con lo cual ha sido posible el desarrollo de técnicas de medición no invasivas. Sin embargo, la selección de los puntos que son la base de las mediciones es una tarea que aún requiere de la intervención humana. En este trabajo se presentan métodos de procesamiento digital de imágenes para la localización automática de características faciales. Lo primero que se hizo fue estudiar diversas formas de representar la forma en 3D y cómo estas podían contribuir como características representativas de los atributos faciales con el fin de poder ubicarlos automáticamente. Con base en lo anterior, se desarrolló un método para la estimación de correspondencia densa entre dos superficies a partir de un procedimiento de registro no rígido, el cual se enfocó a modelos de rostros 3D en presencia de expresiones faciales. Por último, se plantea un método, que utiliza tanto información de la forma como de la apariencia de las superficies, para la localización automática de un conjunto de características faciales que son la base para determinar índices antropométricos ampliamente utilizados en campos tales como la ergonomía, ciencias forenses, planeación quirúrgica, entre otrosDoctorad
Visual Speech Recognition
In recent years, Visual speech recognition has a more concentration, by researchers, than the past. Because of the leakage of the visual processing of the Arabic vocabularies recognition, we start to search in this field. Audio speech recognition concerned with the acoustic characteristic of the signal, but there are many situations that the audio signal is weak of not exist, and this will be a point in Chapter 2. The visual recognition process focuses on the features extracted from video of the speaker. These features are to be classified using several techniques. The most important feature to be extracted is motion. By segmenting motion of the lips of the speaker, an algorithm has manipulate it in such away to recognize the word which is said. But motion segmentation is not the only problem facing the speech recognition process, segmenting the lips itself is an early step in the speech recognition process, so, to segment lips motion we have to segment lips first, a new approach for lip segmentation is proposed in this thesis. Sometimes, motion feature needs another feature to support in recognition the spoken word. So in our thesis another new algorithm is proposed to use motion segmentation by using the Abstract Difference Image from an image series, supported by correlation for registering images in the image series, to recognize ten words in the Arabic language, the words are from “one” to “ten” in Arabic language. The algorithm also uses the HU-Invariant set of features to describe the Abstract Difference Image, and uses a three different recognition methods to recognize the words. The CLAHE method as a filtering technique is used by our algorithm to manipulate lighting problems. Our algorithm based on extracting the differences details from a series of images to recognize the word, achieved an overall results 55.8%, it is an adequate result for our algorithm when integrated in an audio-visual system
- …