Search CORE

3,258 research outputs found

Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers

Author: Matthews I
Theobald B
Publication venue
Publication date: 01/01/2012
Field of study

We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

University of East Anglia digital repository

Predicting Hungarian sound durations for continuous speech

Author: Olaszy Gábor
Publication venue: 'Akademiai Kiado Zrt.'
Publication date: 01/01/2002
Field of study

Direct measurements show that a number of factors influence the final value of sound durations in continuous speech. On the segmental level it is mainly the articulatory movements that determine important influence factors, while on the suprasegmental level accent, syllabic stress, within-word position, the preceding and following syllables and finallyutterance position may have an influence on final sound durations. So the problem of how to predict sound durations can be described with a multivariable function in which the effect of the variables cannot be easily defined with good accuracy. It is difficult to separate the effects of certain functions, i.e., it is difficult to model this function, making direct measurements on the speech signal. A model has been constructed and realized in which three well-defined levels are working separately. In the first one (this is the segmental level) the separation of the effect of articulation from other factors is solved. The second and third levels relate to the suprasegmental level of speech

Repository of the Academy's Library

Data-Driven Critical Tract Variable Determination for European Portuguese

Author: Almeida N.
Cunha C.
Frahm J.
Joseph A.
Silva S.
Teixeira A.
Publication venue: 'MDPI AG'
Publication date: 21/10/2020
Field of study

Technologies, such as real-time magnetic resonance (RT-MRI), can provide valuable information to evolve our understanding of the static and dynamic aspects of speech by contributing to the determination of which articulators are essential (critical) in producing specific sounds and how (gestures). While a visual analysis and comparison of imaging data or vocal tract profiles can already provide relevant findings, the sheer amount of available data demands and can strongly profit from unsupervised data-driven approaches. Recent work, in this regard, has asserted the possibility of determining critical articulators from RT-MRI data by considering a representation of vocal tract configurations based on landmarks placed on the tongue, lips, and velum, yielding meaningful results for European Portuguese (EP). Advancing this previous work to obtain a characterization of EP sounds grounded on Articulatory Phonology, important to explore critical gestures and advance, for example, articulatory speech synthesis, entails the consideration of a novel set of tract variables. To this end, this article explores critical variable determination considering a vocal tract representation aligned with Articulatory Phonology and the Task Dynamics framework. The overall results, obtained considering data for three EP speakers, show the applicability of this approach and are consistent with existing descriptions of EP sounds

Multidisciplinary Digital Publishing Institute

MPG.PuRe

Augmented Reality

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

Augmented Reality (AR) is a natural development from virtual reality (VR), which was developed several decades earlier. AR complements VR in many ways. Due to the advantages of the user being able to see both the real and virtual objects simultaneously, AR is far more intuitive, but it's not completely detached from human factors and other restrictions. AR doesn't consume as much time and effort in the applications because it's not required to construct the entire virtual scene and the environment. In this book, several new and emerging application areas of AR are presented and divided into three sections. The first section contains applications in outdoor and mobile AR, such as construction, restoration, security and surveillance. The second section deals with AR in medical, biological, and human bodies. The third and final section contains a number of new and useful applications in daily living and learning

Directory of Open Access Books (DOAB)

Subjective relevance of objective measures for spatial impression (A)

Author: Gade Anders Christian
Wang Lily M.
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2000
Field of study

Crossref

Online Research Database In Technology

A Sonic Arts Approach to Sound Design Practice

Author: Boland Carl
Publication venue
Publication date: 01/07/2012
Field of study

University of South Wales Research Explorer

Speech animation using electromagnetic articulography as motion capture data

Author: Ouni Slim
Richmond Korin
Steiner Ingmar
Publication venue
Publication date: 01/01/2013
Field of study

Electromagnetic articulography (EMA) captures the position and orientation of a number of markers, attached to the articulators, during speech. As such, it performs the same function for speech that conventional motion capture does for full-body movements acquired with optical modalities, a long-time staple technique of the animation industry. In this paper, EMA data is processed from a motion-capture perspective and applied to the visualization of an existing multimodal corpus of articulatory data, creating a kinematic 3D model of the tongue and teeth by adapting a conventional motion capture based animation paradigm. This is accomplished using off-the-shelf, open-source software. Such an animated model can then be easily integrated into multimedia applications as a digital asset, allowing the analysis of speech production in an intuitive and accessible manner. The processing of the EMA data, its co-registration with 3D data from vocal tract magnetic resonance imaging (MRI) and dental scans, and the modeling workflow are presented in detail, and several issues discussed

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Edinburgh Research Explorer

Registration and statistical analysis of the tongue shape during speech production

Author: Hewer Alexander
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2019
Field of study

This thesis analyzes the human tongue shape during speech production. First, a semi-supervised approach is derived for estimating the tongue shape from volumetric magnetic resonance imaging data of the human vocal tract. Results of this extraction are used to derive parametric tongue models. Next, a framework is presented for registering sparse motion capture data of the tongue by means of such a model. This method allows to generate full three-dimensional animations of the tongue. Finally, a multimodal and statistical text-to-speech system is developed that is able to synthesize audio and synchronized tongue motion from text.Diese Dissertation beschäftigt sich mit der Analyse der menschlichen Zungenform während der Sprachproduktion. Zunächst wird ein semi-überwachtes Verfahren vorgestellt, mit dessen Hilfe sich Zungenformen von volumetrischen Magnetresonanztomographie- Aufnahmen des menschlichen Vokaltrakts schätzen lassen. Die Ergebnisse dieses Extraktionsverfahrens werden genutzt, um ein parametrisches Zungenmodell zu konstruieren. Danach wird eine Methode hergeleitet, die ein solches Modell nutzt, um spärliche Bewegungsaufnahmen der Zunge zu registrieren. Dieser Ansatz erlaubt es, dreidimensionale Animationen der Zunge zu erstellen. Zuletzt wird ein multimodales und statistisches Text-to-Speech-System entwickelt, das in der Lage ist, Audio und die dazu synchrone Zungenbewegung zu synthetisieren.German Research Foundatio

Universaar

Acronym

Audio-visuelle Wahrnehmung von Vokalen im Deutschen, Ungarischen, Georgischen und Ägyptisch-Arabischen

Author: Deme Andrea
Greisbach Reinhold
Juhász Kornélia
Winter Isa Samira
Publication venue: Phonetik und Phonologie im deutschsprachigen Raum
Publication date: 05/10/2022
Field of study

BieColl - Bielefeld eCollections