2 research outputs found

    Low latency and tight resources viseme recognition from speech using an artificial neural network

    Get PDF
    We present a speech driven real-time viseme recognition system based on Artificial Neural Network (ANN). A Multi-Layer Perceptron (MLP) is used to provide a light and responsive framework, adapted to the final application (i.e., the animation of the lips of an avatar on multi-task platforms with embedded resources and latency constraints). Several improvements of this system are studied such as data selection, network size, training set size, or choice of the best acoustic unit to recognize. All variants are compared to a baseline system, and the combined improvements achieve a recognition rate of 64.3% for a set of 18 visemes and 70.8% for 9 visemes. We then propose a tradeoff system between the recognition performance, the resource requirements and the latency constraints. A scalable method is also described.Ce rapport présente un système de reconnaissance de visèmes à partir du signal de parole utilisant un réseau de neurones artificiels et capable de fonctionner en temps réel. Un Multi-Layer Perceptron (MLP) permet d'obtenir une méthode rapide et légère adaptée à l'application finale (i.e., l'animation des lèvres d'un avatar par une plateforme multitâche de type set-top-box avec des contraintes de ressources et de latence). Plusieurs améliorations de ce système sont également présentées telles que la sélection des données d'apprentissage, la taille du réseau, la taille de la base d'apprentissage ou encore le choix de l'unité acoustique à reconnaître. Toutes ces variantes sont comparées au système de base. La combinaison de toutes ces améliorations permet d'atteindre un taux de reconnaissance de 64.3% pour un jeu de 18 visèmes et 70.8% pour 9 visèmes. Nous proposons ensuite un système faisant le compromis entre performance, besoin en ressources et latence. Une variante adaptable (scalable) est aussi décrite

    Hidden Markov Models for Visual Speech Synthesis in Limited Data

    Get PDF
    This work presents a new approach for estimating control points (facial locations that control movement) to allow the artificial generation of video with apparent mouth movement (visual speech) time-synced with recorded audio. First, Hidden Markov Models (HMMs) are estimated for each visual speech category (viseme) present in stored video data, where a category is defined as the mouth movement corresponding to a given sound and where the visemes are further categorized as trisemes (a viseme in the context of previous and following visemes). Next, a decision tree is used to cluster and relate states in the HMMs that are similar in a contextual and statistical sense. The tree is also used to estimate HMMs that generate sequences of visual speech control points for trisemes not occurring in the stored data. An experiment is described that evaluates the effect of several algorithm variables, and a statistical analysis is presented that establishes appropriate levels for each variable by minimizing the error between the desired and estimated control points. The analysis indicates that the error is lowest when the process is conducted with three-state left-to right no skip HMMs trained using short-duration dynamic features, a high log-likelihood threshold, and a low outlier threshold. Also, comparisons of mouth shapes generated from the artificial control points and the true control points (estimated from video not used to train the HMMs) indicate that the process provides accurate estimates for most trisemes tested in this work. The research presented here thus establishes a useful method for synthesizing realistic audio-synchronized video facial features
    corecore