861 research outputs found
Audio-to-Visual Speech Conversion using Deep Neural Networks
We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal. We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations and perform a thorough analysis of our results
Expressive Modulation of Neutral Visual Speech
The need for animated graphical models of the human face is commonplace in
the movies, video games and television industries, appearing in everything from
low budget advertisements and free mobile apps, to Hollywood blockbusters
costing hundreds of millions of dollars. Generative statistical models of
animation attempt to address some of the drawbacks of industry standard
practices such as labour intensity and creative inflexibility.
This work describes one such method for transforming speech animation curves
between different expressive styles. Beginning with the assumption that
expressive speech animation is a mix of two components, a high-frequency
speech component (the content) and a much lower-frequency expressive
component (the style), we use Independent Component Analysis (ICA) to
identify and manipulate these components independently of one another. Next
we learn how the energy for different speaking styles is distributed in terms of
the low-dimensional independent components model. Transforming the
speaking style involves projecting new animation curves into the lowdimensional
ICA space, redistributing the energy in the independent
components, and finally reconstructing the animation curves by inverting the
projection.
We show that a single ICA model can be used for separating multiple expressive
styles into their component parts. Subjective evaluations show that viewers can
reliably identify the expressive style generated using our approach, and that they
have difficulty in identifying transformed animated expressive speech from the
equivalent ground-truth
Example Based Caricature Synthesis
The likeness of a caricature to the original face image is an essential and often overlooked part of caricature
production. In this paper we present an example based caricature synthesis technique, consisting of shape
exaggeration, relationship exaggeration, and optimization for likeness. Rather than relying on a large training set
of caricature face pairs, our shape exaggeration step is based on only one or a small number of examples of facial
features. The relationship exaggeration step introduces two definitions which facilitate global facial feature
synthesis. The first is the T-Shape rule, which describes the relative relationship between the facial elements in an
intuitive manner. The second is the so called proportions, which characterizes the facial features in a proportion
form. Finally we introduce a similarity metric as the likeness metric based on the Modified Hausdorff Distance
(MHD) which allows us to optimize the configuration of facial elements, maximizing likeness while satisfying a
number of constraints. The effectiveness of our algorithm is demonstrated with experimental results
Hidden Markov Models for Visual Speech Synthesis in Limited Data
This work presents a new approach for estimating control points (facial locations that control movement) to allow the artificial generation of video with apparent mouth movement (visual speech) time-synced with recorded audio. First, Hidden Markov Models (HMMs) are estimated for each visual speech category (viseme) present in stored video data, where a category is defined as the mouth movement corresponding to a given sound and where the visemes are further categorized as trisemes (a viseme in the context of previous and following visemes). Next, a decision tree is used to cluster and relate states in the HMMs that are similar in a contextual and statistical sense. The tree is also used to estimate HMMs that generate sequences of visual speech control points for trisemes not occurring in the stored data. An experiment is described that evaluates the effect of several algorithm variables, and a statistical analysis is presented that establishes appropriate levels for each variable by minimizing the error between the desired and estimated control points. The analysis indicates that the error is lowest when the process is conducted with three-state left-to right no skip HMMs trained using short-duration dynamic features, a high log-likelihood threshold, and a low outlier threshold. Also, comparisons of mouth shapes generated from the artificial control points and the true control points (estimated from video not used to train the HMMs) indicate that the process provides accurate estimates for most trisemes tested in this work. The research presented here thus establishes a useful method for synthesizing realistic audio-synchronized video facial features
Fault diagnosis of rolling element bearing based on wavelet kernel principle component analysis-coupled hidden Markov model
Different description results will be obtained when apply hidden Markov model (HMM) to the two different channel signals from the same data collection point respectively. Besides, wrong fault diagnosis result might be obtained because fault feature information would not be described comprehensively by using only one single channel signal. In theory, two channel signals collected form the same data collection point will contain much more fault information than the single channel signal contain, but the coupled phenomenon might occur between the two channel signals. Coupled hidden Markov model (CHMM) is the improved method of HMM and it can fuse the information of two channel signals from the same data collection point efficiently, so much more reliable diagnosis result could be obtained by using CHMM than by using HMM. Stated thus, the fault diagnosis method of rolling element bearing based on wavelet kernel component analysis (WKPCA)-CHMM is proposed: Firstly, use WKPCA as fault feature vectors extraction method to increase the efficiency of the proposed method. Then apply CHMM to the extracted fault feature vectors and satisfactory fault diagnosis result is obtained at last. The feasibility and advantages of the proposed method are verified through experiment
Multimodal Based Audio-Visual Speech Recognition for Hard-of-Hearing: State of the Art Techniques and Challenges
Multimodal Integration (MI) is the study of merging the knowledge acquired by the nervous system using sensory modalities such as speech, vision, touch, and gesture. The applications of MI expand over the areas of Audio-Visual Speech Recognition (AVSR), Sign Language Recognition (SLR), Emotion Recognition (ER), Bio Metrics Applications (BMA), Affect Recognition (AR), Multimedia Retrieval (MR), etc. The fusion of modalities such as hand gestures- facial, lip- hand position, etc., are mainly used sensory modalities for the development of hearing-impaired multimodal systems. This paper encapsulates an overview of multimodal systems available within literature towards hearing impaired studies. This paper also discusses some of the studies related to hearing-impaired acoustic analysis. It is observed that very less algorithms have been developed for hearing impaired AVSR as compared to normal hearing. Thus, the study of audio-visual based speech recognition systems for the hearing impaired is highly demanded for the people who are trying to communicate with natively speaking languages. This paper also highlights the state-of-the-art techniques in AVSR and the challenges faced by the researchers for the development of AVSR systems
- …