13,095 research outputs found

    Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers

    Get PDF
    We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

    Learning-based 3D human motion capture and animation synthesis

    Get PDF
    Realistic virtual human avatar is a crucial element in a wide range of applications, from 3D animated movies to emerging AR/VR technologies. However, producing a believable 3D motion for such avatars is widely known to be a challenging task. A traditional 3D human motion generation pipeline consists of several stages, each requiring expensive equipment and skilled human labor to perform, limiting its usage beyond the entertainment industry despite its massive potential benefits. This thesis attempts to explore some alternative solutions to reduce the complexity of the traditional 3D animation pipeline. To this end, it presents several novel ways to perform 3D human motion capture, synthesis, and control. Specifically, it focuses on using learning-based methods to bypass the critical bottlenecks of the classical animation approach. First, a new 3D pose estimation method from in-the-wild monocular images is proposed, eliminating the need for a multi-camera setup in the traditional motion capture system. Second, it explores several data-driven designs to achieve a believable 3D human motion synthesis and control that can potentially reduce the need for manual animation. In particular, the problem of speech-driven 3D gesture synthesis is chosen as the case study due to its uniquely ambiguous nature. The improved motion generation quality is achieved by introducing a novel adversarial objective that rates the difference between real and synthetic data. A novel motion generation strategy is also introduced by combining a classical database search algorithm with a powerful deep learning method, resulting in a greater motion control variation than the purely predictive counterparts. Furthermore, this thesis also contributes a new way of collecting a large-scale 3D motion dataset through the use of learning-based monocular estimations methods. This result demonstrates the promising capability of learning-based monocular approaches and shows the prospect of combining these learning-based modules into an integrated 3D animation framework. The presented learning-based solutions open the possibility of democratizing the traditional 3D animation system that can be enabled using low-cost equipment, e.g., a single RGB camera. Finally, this thesis also discusses the potential further integration of these learning-based approaches to enhance 3D animation technology.Realistische virtuelle menschliche Avatare sind ein entscheidendes Element in einer Vielzahl von Anwendungen, von 3D-Animationsfilmen bis hin zu neuen AR/VR-Technologien. Die Erzeugung glaubwürdiger Bewegungen solcher Avatare in drei Dimensionen ist bekanntermaßen eine herausfordernde Aufgabe. Traditionelle Pipelines zur Erzeugung menschlicher 3D-Bewegungen bestehen aus mehreren Stufen, die jede für sich genommen teure Ausrüstung und den Einsatz von Expertenwissen erfordern und daher trotz ihrer enormen potenziellen Vorteile abseits der Unterhaltungsindustrie nur eingeschränkt verwendbar sind. Diese Arbeit untersucht verschiedene Alternativen um die Komplexität der traditionellen 3D-Animations-Pipeline zu reduzieren. Zu diesem Zweck stellt sie mehrere neuartige Möglichkeiten zur Erfassung, Synthese und Steuerung humanoider 3D-Bewegungen vor. Sie konzentriert sich auf die Verwendung lernbasierter Methoden, um kritische Teile des klassischen Animationsansatzes zu überbrücken: Zunächst wird eine neue 3D-Pose-Estimation-Methode für monokulare Bilder vorgeschlagen, um die Notwendigkeit mehrerer Kameras im traditionellen Motion-Capture-Ansatz zu beseitigen. Des Weiteren untersucht die Arbeit mehrere datengetriebene Ansätze zur Synthese und Steuerung glaubwürdiger humanoider 3D-Bewegungen, die möglicherweise den Bedarf an manueller Animation reduzieren können. Als Fallstudie wird, aufgrund seiner einzigartig mehrdeutigen Natur, das Problem der sprachgetriebenen 3D-Gesten-Synthese untersucht. Die Verbesserungen in der Qualität der erzeugten Bewegungen wird durch eine neuartige Kostenfunktion erreicht, die den Unterschied zwischen realen und synthetischen Daten bewertet. Außerdem wird eine neue Strategie zur Bewegungssynthese beschrieben, die eine klassische Datenbanksuche mit einer leistungsstarken Deep-Learning-Methode kombiniert, was zu einer größeren Variation der Bewegungssteuerung führt, als rein lernbasierte Verfahren sie bieten. Ein weiterer Beitrag dieser Dissertation besteht in einer neuen Methode zum Aufbau eines großen Datensatzes dreidimensionaler Bewegungen, auf Grundlage lernbasierter monokularer Pose-Estimation- Methoden. Dies demonstriert die vielversprechenden Möglichkeiten lernbasierter monokularer Methoden und lässt die Aussicht erkennen, diese lernbasierten Module zu einem integrierten 3D-Animations- Framework zu kombinieren. Die in dieser Arbeit vorgestellten lernbasierten Lösungen eröffnen die Möglichkeit, das traditionelle 3D-Animationssystem auch mit kostengünstiger Ausrüstung, wie z.B. einer einzelnen RGB-Kamera verwendbar zu machen. Abschließend diskutiert diese Arbeit auch die mögliche weitere Integration dieser lernbasierten Ansätze zur Verbesserung der 3D-Animationstechnologie

    Classification of soundscapes of urban public open spaces

    Get PDF
    It is increasingly acknowledged by landscape architects and urban planners that the soundscape contributes significantly to the perception of urban public open spaces. Describing and classifying this impact, however, remains a challenge. This article presents a hierarchical method for classification that distinguishes between backgrounded and foregrounded, disruptive and supportive, and finally calming and stimulating soundscapes. This four-class classification is applied to a growing collection of immersive audio-visual recordings of sound environments from around the world that could be explored using virtual reality playback. To validate the proposed methodology, an experiment involving 40 participants and 50 soundscape stimuli collected in urban public open spaces worldwide was conducted. The experiment showed that (1) the virtual reality headset reproduction based on affordable spatial audio with 360-degree video recordings was perceived as ecologically valid in terms of realism and immersion; (2) the proposed classification method results in well-separated classes; (3) membership to these classes could be explained by physical parameters, both regarding sound and vision. Moreover, models based on a limited number of acoustical indicators were constructed that could correctly classify a soundscape in each of the four proposed categories, with an accuracy exceeding 88% on an independent dataset

    Selective attention and speech processing in the cortex

    Full text link
    In noisy and complex environments, human listeners must segregate the mixture of sound sources arriving at their ears and selectively attend a single source, thereby solving a computationally difficult problem called the cocktail party problem. However, the neural mechanisms underlying these computations are still largely a mystery. Oscillatory synchronization of neuronal activity between cortical areas is thought to provide a crucial role in facilitating information transmission between spatially separated populations of neurons, enabling the formation of functional networks. In this thesis, we seek to analyze and model the functional neuronal networks underlying attention to speech stimuli and find that the Frontal Eye Fields play a central 'hub' role in the auditory spatial attention network in a cocktail party experiment. We use magnetoencephalography (MEG) to measure neural signals with high temporal precision, while sampling from the whole cortex. However, several methodological issues arise when undertaking functional connectivity analysis with MEG data. Specifically, volume conduction of electrical and magnetic fields in the brain complicates interpretation of results. We compare several approaches through simulations, and analyze the trade-offs among various measures of neural phase-locking in the presence of volume conduction. We use these insights to study functional networks in a cocktail party experiment. We then construct a linear dynamical system model of neural responses to ongoing speech. Using this model, we are able to correctly predict which of two speakers is being attended by a listener. We then apply this model to data from a task where people were attending to stories with synchronous and scrambled videos of the speakers' faces to explore how the presence of visual information modifies the underlying neuronal mechanisms of speech perception. This model allows us to probe neural processes as subjects listen to long stimuli, without the need for a trial-based experimental design. We model the neural activity with latent states, and model the neural noise spectrum and functional connectivity with multivariate autoregressive dynamics, along with impulse responses for external stimulus processing. We also develop a new regularized Expectation-Maximization (EM) algorithm to fit this model to electroencephalography (EEG) data

    Facial Expression Analysis under Partial Occlusion: A Survey

    Full text link
    Automatic machine-based Facial Expression Analysis (FEA) has made substantial progress in the past few decades driven by its importance for applications in psychology, security, health, entertainment and human computer interaction. The vast majority of completed FEA studies are based on non-occluded faces collected in a controlled laboratory environment. Automatic expression recognition tolerant to partial occlusion remains less understood, particularly in real-world scenarios. In recent years, efforts investigating techniques to handle partial occlusion for FEA have seen an increase. The context is right for a comprehensive perspective of these developments and the state of the art from this perspective. This survey provides such a comprehensive review of recent advances in dataset creation, algorithm development, and investigations of the effects of occlusion critical for robust performance in FEA systems. It outlines existing challenges in overcoming partial occlusion and discusses possible opportunities in advancing the technology. To the best of our knowledge, it is the first FEA survey dedicated to occlusion and aimed at promoting better informed and benchmarked future work.Comment: Authors pre-print of the article accepted for publication in ACM Computing Surveys (accepted on 02-Nov-2017

    Participant responses to virtual agents in immersive virtual environments.

    Get PDF
    This thesis is concerned with interaction between people and virtual humans in the context of highly immersive virtual environments (VEs). Empirical studies have shown that virtual humans (agents) with even minimal behavioural capabilities can have a significant emotional impact on participants of immersive virtual environments (IVEs) to the extent that these have been used in studies of mental health issues such as social phobia and paranoia. This thesis focuses on understanding the impact on the responses of people to the behaviour of virtual humans rather than their visual appearance. There are three main research questions addressed. First, the thesis considers what are the key nonverbal behavioural cues used to portray a specific psychological state. Second, research determines the extent to which the underlying state of a virtual human is recognisable through the display of a key set of cues inferred from the behaviour of real humans. Finally, the degree to which a perceived psychological state in a virtual human invokes responses from participants in immersive virtual environments that are similar to those observed in the physical world is considered. These research questions were investigated through four experiments. The first experiment focused on the impact of visual fidelity and behavioural complexity on participant responses by implementing a model of gaze behaviour in virtual humans. The results of the study concluded that participants expected more life-like behaviours from more visually realistic virtual humans. The second experiment investigated the detrimental effects on participant responses when interacting with virtual humans with low behavioural complexity. The third experiment investigated the differences in responses of participants to virtual humans perceived to be in varying emotional states. The emotional states of the virtual humans were portrayed using postural and facial cues. Results indicated that posture does play an important role in the portrayal of affect however the behavioural model used in the study did not fully cover the qualities of body movement associated with the emotions studied. The final experiment focused on the portrayal of affect through the quality of body movement such as the speed of gestures. The effectiveness of the virtual humans was gauged through exploring a variety of participant responses including subjective responses, objective physiological and behavioural measures. The results show that participants are affected and respond to virtual humans in a significant manner provided that an appropriate behavioural model is used
    corecore