854 research outputs found

    Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion

    Get PDF
    Acoustic-to-articulatory inversion, the estimation of articulatory kinematics from an acoustic waveform, is a challenging but important problem. Accurate estimation of articulatory movements has the potential for significant impact on our understanding of speech production, on our capacity to assess and treat pathologies in a clinical setting, and on speech technologies such as computer aided pronunciation assessment and audio-video synthesis. However, because of the complex and speaker-specific relationship between articulation and acoustics, existing approaches for inversion do not generalize well across speakers. As acquiring speaker-specific kinematic data for training is not feasible in many practical applications, this remains an important and open problem. This paper proposes a novel approach to acoustic-to-articulatory inversion, Parallel Reference Speaker Weighting (PRSW), which requires no kinematic data for the target speaker and a small amount of acoustic adaptation data. PRSW hypothesizes that acoustic and kinematic similarities are correlated and uses speaker-adapted articulatory models derived from acoustically derived weights. The system was assessed using a 20-speaker data set of synchronous acoustic and Electromagnetic Articulography (EMA) kinematic data. Results demonstrate that by restricting the reference group to a subset consisting of speakers with strong individual speaker-dependent inversion performance, the PRSW method is able to attain kinematic-independent acoustic-to-articulatory inversion performance nearly matching that of the speaker-dependent model, with an average correlation of 0.62 versus 0.63. This indicates that given a sufficiently complete and appropriately selected reference speaker set for adaptation, it is possible to create effective articulatory models without kinematic training data

    Speaker Independent Acoustic-to-Articulatory Inversion

    Get PDF
    Acoustic-to-articulatory inversion, the determination of articulatory parameters from acoustic signals, is a difficult but important problem for many speech processing applications, such as automatic speech recognition (ASR) and computer aided pronunciation training (CAPT). In recent years, several approaches have been successfully implemented for speaker dependent models with parallel acoustic and kinematic training data. However, in many practical applications inversion is needed for new speakers for whom no articulatory data is available. In order to address this problem, this dissertation introduces a novel speaker adaptation approach called Parallel Reference Speaker Weighting (PRSW), based on parallel acoustic and articulatory Hidden Markov Models (HMM). This approach uses a robust normalized articulatory space and palate referenced articulatory features combined with speaker-weighted adaptation to form an inversion mapping for new speakers that can accurately estimate articulatory trajectories. The proposed PRSW method is evaluated on the newly collected Marquette electromagnetic articulography - Mandarin Accented English (EMA-MAE) corpus using 20 native English speakers. Cross-speaker inversion results show that given a good selection of reference speakers with consistent acoustic and articulatory patterns, the PRSW approach gives good speaker independent inversion performance even without kinematic training data

    THREE DIMENSIONAL MODELING AND ANIMATION OF FACIAL EXPRESSIONS

    Get PDF
    Facial expression and animation are important aspects of the 3D environment featuring human characters. These animations are frequently used in many kinds of applications and there have been many efforts to increase the realism. Three aspects are still stimulating active research: the detailed subtle facial expressions, the process of rigging a face, and the transfer of an expression from one person to another. This dissertation focuses on the above three aspects. A system for freely designing and creating detailed, dynamic, and animated facial expressions is developed. The presented pattern functions produce detailed and animated facial expressions. The system produces realistic results with fast performance, and allows users to directly manipulate it and see immediate results. Two unique methods for generating real-time, vivid, and animated tears have been developed and implemented. One method is for generating a teardrop that continually changes its shape as the tear drips down the face. The other is for generating a shedding tear, which is a kind of tear that seamlessly connects with the skin as it flows along the surface of the face, but remains an individual object. The methods both broaden CG and increase the realism of facial expressions. A new method to automatically set the bones on facial/head models to speed up the rigging process of a human face is also developed. To accomplish this, vertices that describe the face/head as well as relationships between each part of the face/head are grouped. The average distance between pairs of vertices is used to place the head bones. To set the bones in the face with multi-density, the mean value of the vertices in a group is measured. The time saved with this method is significant. A novel method to produce realistic expressions and animations by transferring an existing expression to a new facial model is developed. The approach is to transform the source model into the target model, which then has the same topology as the source model. The displacement vectors are calculated. Each vertex in the source model is mapped to the target model. The spatial relationships of each mapped vertex are constrained

    Reoccurring patterns in hierarchical protein materials and music: The power of analogies

    Get PDF
    Complex hierarchical structures composed of simple nanoscale building blocks form the basis of most biological materials. Here we demonstrate how analogies between seemingly different fields enable the understanding of general principles by which functional properties in hierarchical systems emerge, similar to an analogy learning process. Specifically, natural hierarchical materials like spider silk exhibit properties comparable to classical music in terms of their hierarchical structure and function. As a comparative tool here we apply hierarchical ontology logs (olog) that follow a rigorous mathematical formulation based on category theory to provide an insightful system representation by expressing knowledge in a conceptual map. We explain the process of analogy creation, draw connections at several levels of hierarchy and identify similar patterns that govern the structure of the hierarchical systems silk and music and discuss the impact of the derived analogy for nanotechnology.Comment: 13 pages, 3 figure

    Expressive Modulation of Neutral Visual Speech

    Get PDF
    The need for animated graphical models of the human face is commonplace in the movies, video games and television industries, appearing in everything from low budget advertisements and free mobile apps, to Hollywood blockbusters costing hundreds of millions of dollars. Generative statistical models of animation attempt to address some of the drawbacks of industry standard practices such as labour intensity and creative inflexibility. This work describes one such method for transforming speech animation curves between different expressive styles. Beginning with the assumption that expressive speech animation is a mix of two components, a high-frequency speech component (the content) and a much lower-frequency expressive component (the style), we use Independent Component Analysis (ICA) to identify and manipulate these components independently of one another. Next we learn how the energy for different speaking styles is distributed in terms of the low-dimensional independent components model. Transforming the speaking style involves projecting new animation curves into the lowdimensional ICA space, redistributing the energy in the independent components, and finally reconstructing the animation curves by inverting the projection. We show that a single ICA model can be used for separating multiple expressive styles into their component parts. Subjective evaluations show that viewers can reliably identify the expressive style generated using our approach, and that they have difficulty in identifying transformed animated expressive speech from the equivalent ground-truth

    Investigating 3D Visual Speech Animation Using 2D Videos

    Get PDF
    Lip motion accuracy is of paramount importance for speech intelligibility, especially for users who are hard of hearing or foreign language learners. Furthermore, generating a high level of realism in lip movements is required for the game and film production industries. This thesis focuses on the mapping of tracked lip motions of front-view 2D videos of a real speaker to a synthetic 3D head. A data-driven approach is used based on a 3D morphable model (3DMM) built using 3D synthetic head poses. The 3DMMs have been widely used for different tasks such as face recognition, detect facial expressions and lip motions in 2D videos. However, investigating factors such as the required facial landmarks for the mapping process, the amount of data for constructing the 3DMM, and differences in facial features between real faces and 3D faces that may influence the resulting animation have not been considered yet. Therefore, this research centers around investigating the impact of these factors on the final 3D lip motions. The thesis explores how different sets of facial features used in the mapping process influence the resulting 3D motions. Five sets of the facial features are used for mapping the real faces to the corresponding 3D faces. The results show that the inclusion of eyebrows, eyes, nose, and lips improves the 3D lip motions, while face contour features (i.e. the outside boundary of the front view of the face) restrict the face’s mesh, distorting the resulting animation. This thesis investigates how using different amounts of data when constructing the 3DMM affects the 3D lip motions. The results show that using a wider range of synthetic head poses for different phoneme intensities to create a 3DMM, as well as a combination of front- and side-view photographs of real speakers to produce initial neutral 3D synthetic head poses, provides better animation results compared to ground truth data consisting of front- and side-view 2D videos of real speakers. The thesis also investigates the impact of differences and similarities in facial features between real speakers and the 3DMMs on the resulting 3D lip motions by mapping between non-similar faces based on differences and similarities in vertical mouth height and mouth width. The objective and user test results show that mapping 2D videos of real speakers with low vertical mouth heights to 3D heads that correspond to real speakers with high vertical mouth heights, or vice versa, generates less good 3D lip motions. It is thus important that this is considered when using a 2D recording of a real actor’s lip movements to control a 3D synthetic character

    Augmented Reality

    Get PDF
    Augmented Reality (AR) is a natural development from virtual reality (VR), which was developed several decades earlier. AR complements VR in many ways. Due to the advantages of the user being able to see both the real and virtual objects simultaneously, AR is far more intuitive, but it's not completely detached from human factors and other restrictions. AR doesn't consume as much time and effort in the applications because it's not required to construct the entire virtual scene and the environment. In this book, several new and emerging application areas of AR are presented and divided into three sections. The first section contains applications in outdoor and mobile AR, such as construction, restoration, security and surveillance. The second section deals with AR in medical, biological, and human bodies. The third and final section contains a number of new and useful applications in daily living and learning

    Inter-speaker speech variability assessment using statistical deformable models from 3.0 Tesla magnetic resonance images

    Get PDF
    The morphological and dynamic characterisation of the vocal tract during speech production has been gaining greater attention due to the motivation of the latest improvements in magnetic resonance (MR) imaging; namely, with the use of higher magnetic fields, such as 3.0 Tesla. In this work, the automatic study of the vocal tract from 3.0 Tesla MR images was assessed through the application of statistical deformable models. Therefore, the primary goal focused on the analysis of the shape of the vocal tract during the articulation of European Portuguese sounds, followed by the evaluation of the results concerning the automatic segmentation, i.e. identification of the vocal tract in new MR images. In what concerns speech production, this is the first attempt to automatically characterise and reconstruct the vocal tract shape of 3.0 Tesla MR images by using deformable models; particularly, by using active and appearance shape models. The achieved results clearly evidence the adequacy and advantage of the automatic analysis of the 3.0 Tesla MR images of these deformable models in order to extract the vocal tract shape and assess the involved articulatory movements. These achievements are mostly required, for example, for a better knowledge of speech production, mainly of patients suffering from articulatory disorders, and to build enhanced speech synthesizer models.info:eu-repo/semantics/publishedVersio

    FLNet: Landmark Driven Fetching and Learning Network for Faithful Talking Facial Animation Synthesis

    Full text link
    Talking face synthesis has been widely studied in either appearance-based or warping-based methods. Previous works mostly utilize single face image as a source, and generate novel facial animations by merging other person's facial features. However, some facial regions like eyes or teeth, which may be hidden in the source image, can not be synthesized faithfully and stably. In this paper, We present a landmark driven two-stream network to generate faithful talking facial animation, in which more facial details are created, preserved and transferred from multiple source images instead of a single one. Specifically, we propose a network consisting of a learning and fetching stream. The fetching sub-net directly learns to attentively warp and merge facial regions from five source images of distinctive landmarks, while the learning pipeline renders facial organs from the training face space to compensate. Compared to baseline algorithms, extensive experiments demonstrate that the proposed method achieves a higher performance both quantitatively and qualitatively. Codes are at https://github.com/kgu3/FLNet_AAAI2020.Comment: Accepted by AAAI 202
    • …
    corecore