Search CORE

1,250 research outputs found

Building and Designing Expressive Speech Synthesis

Author: Leigh Clark
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future “spoken language will provide a natural conversational interface between human beings and so-called intelligent systems.” [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out “voice interfaces have become notorious for fostering frustration and failure” [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the user’s successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech

Cronfa at Swansea University

Multilingual and Multimodal Corpus-Based Text-to-Speech System - PLATTOS -

Author: Izidor Mlakar
Matej Rojc
Publication venue: 'IntechOpen'
Publication date: 21/06/2011
Field of study

IntechOpen

Digital library of University of Maribor

Expressing Robot Personality through Talking Body Language

Author: Lazkano Ortega Elena
Martínez Otzeta José María
Rodríguez Rodríguez Igor
Zabala Cristóbal Unai
Publication venue: 'MDPI AG'
Publication date: 19/05/2021
Field of study

Social robots must master the nuances of human communication as a mean to convey an effective message and generate trust. It is well-known that non-verbal cues are very important in human interactions, and therefore a social robot should produce a body language coherent with its discourse. In this work, we report on a system that endows a humanoid robot with the ability to adapt its body language according to the sentiment of its speech. A combination of talking beat gestures with emotional cues such as eye lightings, body posture of voice intonation and volume permits a rich variety of behaviors. The developed approach is not purely reactive, and it easily allows to assign a kind of personality to the robot. We present several videos with the robot in two different scenarios, and showing discrete and histrionic personalities.This work has been partially supported by the Basque Government (IT900-16 and Elkartek 2018/00114), the Spanish Ministry of Economy and Competitiveness (RTI 2018-093337-B-100, MINECO/FEDER, EU)

Multidisciplinary Digital Publishing Institute

Archivo Digital para la Docencia y la Investigación

Recruitment of Language-, Emotion- and Speech-Timing Associated Brain Regions for Expressing Emotional Prosody: Investigation of Functional Neuroanatomy with fMRI

Author: Adank
Agnieszka Jazdzyk
Aziz-Zadeh
Banse
Beck
Belin
Belyk
Binney
Blackman
Blakemore
Blonder
Boersma
Brett
Bruck
Cancelliere
Capilla
Carbary
Chakrabarti
Chen
Cohen
Corden
Crawford
Denson
Deuse
Devlin
Dhanjal
Dogil
Ekman
Elliott
Ethofer
Ethofer
Fabiansson
Fragopanagos
Friederici
Frühholz
Frühholz
Fusar-Poli
Gaab
Gaab
Gandour
Glockner
Gorelick
Gracco
Grandjean
Grecucci
Griffiths
Griswold
Grossman
Guastella
Hailstone
Hall
Hamann
Henson
Hickok
Hoekert
Jacob
Jacob
Johnstone
Johnstone
Josephs
Juslin
Juslin
Kimbrell
Klaas
Klasen
Kotz
Kotz
Kotz
Kotz
Kotz
Kotz
Kotz
Kreifelts
Kringelbach
Kühn
Lacadie
Lancaster
Lancaster
Laukka
Laukka
Lee
Lee
Leigh-Paffenroth
Liem
Manuela Stets
Martens
Mayer
Mayer
Mincic
Mitchell
Moelker
Mothes-Lasch
Mugler
Murphy
Nakhutina
Narain
Obleser
Obleser
Okada
Oldfield
Ooi
Park
Paulmann
Paulmann
Paulmann
Paulmann
Peelen
Pell
Phan
Pichon
Pierre-Yves
Pohl
Quadflieg
Rachel L. C. Mitchell
Rits
Rochman
Ross
Ross
Ross
Ross
Sander
Sander
Sassa
Satpute
Scherer
Scherer
Scherer
Schirmer
Schirmer
Schröder
Schröder
Scott
Siemer
Simmonds
Sodickson
Sonja A. Kotz
Stewart
Talairach
Valk
Ververidis
Visser
Vuilleumier
Vuilleumier
Vytal
Watson
Wildgruber
Wildgruber
Zaki
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2016
Field of study

We aimed to progress understanding of prosodic emotion expression by establishing brain regions active when expressing specific emotions, those activated irrespective of the target emotion, and those whose activation intensity varied depending on individual performance. BOLD contrast data were acquired whilst participants spoke non-sense words in happy, angry or neutral tones, or performed jaw-movements. Emotion-specific analyses demonstrated that when expressing angry prosody, activated brain regions included the inferior frontal and superior temporal gyri, the insula, and the basal ganglia. When expressing happy prosody, the activated brain regions also included the superior temporal gyrus, insula, and basal ganglia, with additional activation in the anterior cingulate. Conjunction analysis confirmed that the superior temporal gyrus and basal ganglia were activated regardless of the specific emotion concerned. Nevertheless, disjunctive comparisons between the expression of angry and happy prosody established that anterior cingulate activity was significantly higher for angry prosody than for happy prosody production. Degree of inferior frontal gyrus activity correlated with the ability to express the target emotion through prosody. We conclude that expressing prosodic emotions (vs. neutral intonation) requires generic brain regions involved in comprehending numerous aspects of language, emotion-related processes such as experiencing emotions, and in the time-critical integration of speech information

University of Essex Research Repository

Maastricht University Research Portal

Crossref

Directory of Open Access Journals

Frontiers - Publisher Connector

PubMed Central

King's Research Portal

Adapting Prosody in a Text-to-Speech System

Author: Caglayan Erdem
Janez Stergar
Publication venue: 'IntechOpen'
Publication date: 02/11/2010
Field of study

IntechOpen

SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer

Author: Choi Yong-Hoon
Hong Seongho
Kim Daegyeom
Publication venue
Publication date: 19/07/2023
Field of study

Expressive speech synthesis models are trained by adding corpora with diverse speakers, various emotions, and different speaking styles to the dataset, in order to control various characteristics of speech and generate the desired voice. In this paper, we propose a style control (SC) VALL-E model based on the neural codec language model (called VALL-E), which follows the structure of the generative pretrained transformer 3 (GPT-3). The proposed SC VALL-E takes input from text sentences and prompt audio and is designed to generate controllable speech by not simply mimicking the characteristics of the prompt audio but by controlling the attributes to produce diverse voices. We identify tokens in the style embedding matrix of the newly designed style network that represent attributes such as emotion, speaking rate, pitch, and voice intensity, and design a model that can control these attributes. To evaluate the performance of SC VALL-E, we conduct comparative experiments with three representative expressive speech synthesis models: global style token (GST) Tacotron2, variational autoencoder (VAE) Tacotron2, and original VALL-E. We measure word error rate (WER), F0 voiced error (FVE), and F0 gross pitch error (F0GPE) as evaluation metrics to assess the accuracy of generated sentences. For comparing the quality of synthesized speech, we measure comparative mean option score (CMOS) and similarity mean option score (SMOS). To evaluate the style control ability of the generated speech, we observe the changes in F0 and mel-spectrogram by modifying the trained tokens. When using prompt audio that is not present in the training data, SC VALL-E generates a variety of expressive sounds and demonstrates competitive performance compared to the existing models. Our implementation, pretrained models, and audio samples are located on GitHub

arXiv.org e-Print Archive