27,118 research outputs found
Affective social anthropomorphic intelligent system
Human conversational styles are measured by the sense of humor, personality,
and tone of voice. These characteristics have become essential for
conversational intelligent virtual assistants. However, most of the
state-of-the-art intelligent virtual assistants (IVAs) are failed to interpret
the affective semantics of human voices. This research proposes an
anthropomorphic intelligent system that can hold a proper human-like
conversation with emotion and personality. A voice style transfer method is
also proposed to map the attributes of a specific emotion. Initially, the
frequency domain data (Mel-Spectrogram) is created by converting the temporal
audio wave data, which comprises discrete patterns for audio features such as
notes, pitch, rhythm, and melody. A collateral CNN-Transformer-Encoder is used
to predict seven different affective states from voice. The voice is also fed
parallelly to the deep-speech, an RNN model that generates the text
transcription from the spectrogram. Then the transcripted text is transferred
to the multi-domain conversation agent using blended skill talk,
transformer-based retrieve-and-generate generation strategy, and beam-search
decoding, and an appropriate textual response is generated. The system learns
an invertible mapping of data to a latent space that can be manipulated and
generates a Mel-spectrogram frame based on previous Mel-spectrogram frames to
voice synthesize and style transfer. Finally, the waveform is generated using
WaveGlow from the spectrogram. The outcomes of the studies we conducted on
individual models were auspicious. Furthermore, users who interacted with the
system provided positive feedback, demonstrating the system's effectiveness.Comment: Multimedia Tools and Applications (2023
Continuous Interaction with a Virtual Human
Attentive Speaking and Active Listening require that a Virtual Human be capable of simultaneous perception/interpretation and production of communicative behavior. A Virtual Human should be able to signal its attitude and attention while it is listening to its interaction partner, and be able to attend to its interaction partner while it is speaking – and modify its communicative behavior on-the-fly based on what it perceives from its partner. This report presents the results of a four week summer project that was part of eNTERFACE’10. The project resulted in progress on several aspects of continuous interaction such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and models for appropriate reactions to listener responses. A pilot user study was conducted with ten participants. In addition, the project yielded a number of deliverables that are released for public access
Building and Designing Expressive Speech Synthesis
We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future “spoken language will provide a natural conversational interface between human beings and so-called intelligent systems.” [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out “voice interfaces have become notorious for fostering frustration and failure” [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the user’s successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech
Expression of gender in the human voice: investigating the “gender code”
We can easily and reliably identify the gender of an unfamiliar interlocutor over
the telephone. This is because our voice is “sexually dimorphic”: men typically speak
with a lower fundamental frequency (F0 - lower pitch) and lower vocal tract resonances
(ΔF – “deeper” timbre) than women. While the biological bases of these differences are
well understood, and mostly down to size differences between men and women, very
little is known about the extent to which we can play with these differences to
accentuate or de-emphasise our perceived gender, masculinity and femininity in a range
of social roles and contexts.
The general aim of this thesis is to investigate the behavioural basis of gender
expression in the human voice in both children and adults. More specifically, I
hypothesise that, on top of the biologically determined sexual dimorphism, humans use
a “gender code” consisting of vocal gestures (global F0 and ΔF adjustments) aimed at
altering the gender attributes conveyed by their voice. In order to test this hypothesis, I
first explore how acoustic variation of sexually dimorphic acoustic cues (F0 and ΔF)
relates to physiological differences in pre-pubertal speakers (vocal tract length) and
adult speakers (body height and salivary testosterone levels), and show that voice
gender variation cannot be solely explained by static, biologically determined
differences in vocal apparatus and body size of speakers. Subsequently, I show that both
children and adult speakers can spontaneously modify their voice gender by lowering
(raising) F0 and ΔF to masculinise (feminise) their voice, a key ability for the
hypothesised control of voice gender. Finally, I investigate the interplay between voice
gender expression and social context in relation to cultural stereotypes. I report that
listeners spontaneously integrate stereotypical information in the auditory and visual
domain to make stereotypical judgments about children’s gender and that adult actors
manipulate their gender expression in line with stereotypical gendered notions of
homosexuality. Overall, this corpus of data supports the existence of a “gender code” in
human nonverbal vocal communication. This “gender code” provides not only a
methodological framework with which to empirically investigate variation in voice
gender and its role in expressing gender identity, but also a unifying theoretical
structure to understand the origins of such variation from both evolutionary and social
perspectives
- …