2,151 research outputs found
Expressive Modulation of Neutral Visual Speech
The need for animated graphical models of the human face is commonplace in
the movies, video games and television industries, appearing in everything from
low budget advertisements and free mobile apps, to Hollywood blockbusters
costing hundreds of millions of dollars. Generative statistical models of
animation attempt to address some of the drawbacks of industry standard
practices such as labour intensity and creative inflexibility.
This work describes one such method for transforming speech animation curves
between different expressive styles. Beginning with the assumption that
expressive speech animation is a mix of two components, a high-frequency
speech component (the content) and a much lower-frequency expressive
component (the style), we use Independent Component Analysis (ICA) to
identify and manipulate these components independently of one another. Next
we learn how the energy for different speaking styles is distributed in terms of
the low-dimensional independent components model. Transforming the
speaking style involves projecting new animation curves into the lowdimensional
ICA space, redistributing the energy in the independent
components, and finally reconstructing the animation curves by inverting the
projection.
We show that a single ICA model can be used for separating multiple expressive
styles into their component parts. Subjective evaluations show that viewers can
reliably identify the expressive style generated using our approach, and that they
have difficulty in identifying transformed animated expressive speech from the
equivalent ground-truth
Multimodal Content Analysis for Effective Advertisements on YouTube
The rapid advances in e-commerce and Web 2.0 technologies have greatly
increased the impact of commercial advertisements on the general public. As a
key enabling technology, a multitude of recommender systems exists which
analyzes user features and browsing patterns to recommend appealing
advertisements to users. In this work, we seek to study the characteristics or
attributes that characterize an effective advertisement and recommend a useful
set of features to aid the designing and production processes of commercial
advertisements. We analyze the temporal patterns from multimedia content of
advertisement videos including auditory, visual and textual components, and
study their individual roles and synergies in the success of an advertisement.
The objective of this work is then to measure the effectiveness of an
advertisement, and to recommend a useful set of features to advertisement
designers to make it more successful and approachable to users. Our proposed
framework employs the signal processing technique of cross modality feature
learning where data streams from different components are employed to train
separate neural network models and are then fused together to learn a shared
representation. Subsequently, a neural network model trained on this joint
feature embedding representation is utilized as a classifier to predict
advertisement effectiveness. We validate our approach using subjective ratings
from a dedicated user study, the sentiment strength of online viewer comments,
and a viewer opinion metric of the ratio of the Likes and Views received by
each advertisement from an online platform.Comment: 11 pages, 5 figures, ICDM 201
Automatic Emotion Recognition: Quantifying Dynamics and Structure in Human Behavior.
Emotion is a central part of human interaction, one that has a huge influence on its overall tone and outcome. Today's human-centered interactive technology can greatly benefit from automatic emotion recognition, as the extracted affective information can be used to measure, transmit, and respond to user needs. However, developing such systems is challenging due to the complexity of emotional expressions and their dynamics in terms of the inherent multimodality between audio and visual expressions, as well as the mixed factors of modulation that arise when a person speaks. To overcome these challenges, this thesis presents data-driven approaches that can quantify the underlying dynamics in audio-visual affective behavior. The first set of studies lay the foundation and central motivation of this thesis. We discover that it is crucial to model complex non-linear interactions between audio and visual emotion expressions, and that dynamic emotion patterns can be used in emotion recognition. Next, the understanding of the complex characteristics of emotion from the first set of studies leads us to examine multiple sources of modulation in audio-visual affective behavior. Specifically, we focus on how speech modulates facial displays of emotion. We develop a framework that uses speech signals which alter the temporal dynamics of individual facial regions to temporally segment and classify facial displays of emotion. Finally, we present methods to discover regions of emotionally salient events in a given audio-visual data. We demonstrate that different modalities, such as the upper face, lower face, and speech, express emotion with different timings and time scales, varying for each emotion type. We further extend this idea into another aspect of human behavior: human action events in videos. We show how transition patterns between events can be used for automatically segmenting and classifying action events. Our experimental results on audio-visual datasets show that the proposed systems not only improve performance, but also provide descriptions of how affective behaviors change over time. We conclude this dissertation with the future directions that will innovate three main research topics: machine adaptation for personalized technology, human-human interaction assistant systems, and human-centered multimedia content analysis.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/133459/1/yelinkim_1.pd
Mixed-EVC: Mixed Emotion Synthesis and Control in Voice Conversion
Emotional voice conversion (EVC) traditionally targets the transformation of
spoken utterances from one emotional state to another, with previous research
mainly focusing on discrete emotion categories. This paper departs from the
norm by introducing a novel perspective: a nuanced rendering of mixed emotions
and enhancing control over emotional expression. To achieve this, we propose a
novel EVC framework, Mixed-EVC, which only leverages discrete emotion training
labels. We construct an attribute vector that encodes the relationships among
these discrete emotions, which is predicted using a ranking-based support
vector machine and then integrated into a sequence-to-sequence (seq2seq) EVC
framework. Mixed-EVC not only learns to characterize the input emotional style
but also quantifies its relevance to other emotions during training. As a
result, users have the ability to assign these attributes to achieve their
desired rendering of mixed emotions. Objective and subjective evaluations
confirm the effectiveness of our approach in terms of mixed emotion synthesis
and control while surpassing traditional baselines in the conversion of
discrete emotions from one to another
10 years of BAWLing into affective and aesthetic processes in reading: what are the echoes?
Reading is not only “cold” information processing, but involves affective and
aesthetic processes that go far beyond what current models of word
recognition, sentence processing, or text comprehension can explain. To
investigate such “hot” reading processes, standardized instruments that
quantify both psycholinguistic and emotional variables at the sublexical,
lexical, inter-, and supralexical levels (e.g., phonological iconicity, word
valence, arousal-span, or passage suspense) are necessary. One such
instrument, the Berlin Affective Word List (BAWL) has been used in over 50
published studies demonstrating effects of lexical emotional variables on all
relevant processing levels (experiential, behavioral, neuronal). In this
paper, we first present new data from several BAWL studies. Together, these
studies examine various views on affective effects in reading arising from
dimensional (e.g., valence) and discrete emotion features (e.g., happiness),
or embodied cognition features like smelling. Second, we extend our
investigation of the complex issue of affective word processing to words
characterized by a mixture of affects. These words entail positive and
negative valence, and/or features making them beautiful or ugly. Finally, we
discuss tentative neurocognitive models of affective word processing in the
light of the present results, raising new issues for future studies
Sound synthesis for communicating nonverbal expressive cues
Non-verbal sounds (NVS) constitute an appealing communicative channel for transmitting a message during a dialog. They provide two main benefits, such as they are not linked to any particular language, and they can express a message in a short time. NVS have been successfully used in robotics, cell phones, and science fiction films. However, there is a lack of deep studies on how to model NVS. For instance, most of the systems for NVS expression are ad hoc solutions that focus on the communication of the most prominent emotion. Only a small number of papers have proposed a more general model or dealt directly with the expression of pure communicative acts, such as affirmation, denial, or greeting. In this paper we propose a system, referred to as the sonic expression system (SES), that is able to generate NVS on the fly by adapting the sound to the context of the interaction. The system is designed to be used by social robots while conducting human robot interactions. It is based on a model that includes several acoustic features from the amplitude, frequency, and time spaces. In order to evaluate the capabilities of the system, nine categories of communicative acts were created. By means of an online questionnaire, 51 participants classified the utterances according to their meaning, such as agreement, hesitation, denial, hush, question, summon, encouragement, greetings, and laughing. The results showed how very different NVS generated by our SES can be used for communicating.Publicad
Stress recognition from speech signal
Předložená disertační práce se zabývá vývojem algoritmů pro detekci stresu z řečového signálu. Inovativnost této práce se vyznačuje dvěma typy analýzy řečového signálu, a to za použití samohláskových polygonů a analýzy hlasivkových pulsů. Obě tyto základní analýzy mohou sloužit k detekci stresu v řečovém signálu, což bylo dokázáno sérií provedených experimentů. Nejlepších výsledků bylo dosaženo pomocí tzv. Closing-To-Opening phase ratio příznaku v Top-To-Bottom kritériu v kombinaci s vhodným klasifikátorem. Detekce stresu založená na této analýze může být definována jako jazykově i fonémově nezávislá, což bylo rovněž dokázáno získanými výsledky, které dosahují v některých případech až 95% úspěšnosti. Všechny experimenty byly provedeny na vytvořené české databázi obsahující reálný stres, a některé experimenty byly také provedeny pro anglickou stresovou databázi SUSAS.Presented doctoral thesis is focused on development of algorithms for psychological stress detection in speech signal. The novelty of this thesis aims on two different analysis of the speech signal- the analysis of vowel polygons and the analysis of glottal pulses. By performed experiments, the doctoral thesis uncovers the possible usage of both fundamental analyses for psychological stress detection in speech. The analysis of glottal pulses in amplitude domain according to Top-To-Bottom criterion seems to be as the most effective with the combination of properly chosen classifier, which can be defined as language and phoneme independent way to stress recognition. All experiments were performed on developed Czech real stress database and some observations were also made on English database SUSAS. The variety of possibly effective ways of stress recognition in speech leads to approach very high recognition accuracy of their combination, or of their possible usage for detection of other speaker’s state, which has to be further tested and verified by appropriate databases.
- …