587 research outputs found
Advanced Content and Interface Personalization through Conversational Behavior and Affective Embodied Conversational Agents
Conversation is becoming one of the key interaction modes in HMI. As a result, the conversational agents (CAs) have become an important tool in various everyday scenarios. From Apple and Microsoft to Amazon, Google, and Facebook, all have adapted their own variations of CAs. The CAs range from chatbots and 2D, carton-like implementations of talking heads to fully articulated embodied conversational agents performing interaction in various concepts. Recent studies in the field of face-to-face conversation show that the most natural way to implement interaction is through synchronized verbal and co-verbal signals (gestures and expressions). Namely, co-verbal behavior represents a major source of discourse cohesion. It regulates communicative relationships and may support or even replace verbal counterparts. It effectively retains semantics of the information and gives a certain degree of clarity in the discourse. In this chapter, we will represent a model of generation and realization of more natural machine-generated output
Expressivity in TTS from Semantics and Pragmatics
In this paper we present ongoing work to produce an expressive TTS reader that can be used both in text and dialogue applications. The system called SPARSAR has been used to read (English) poetry so far but it can now be applied to any text. The text is fully analyzed both at phonetic and phonological level, and at syntactic and semantic level. In addition, the system has access to a restricted list of typical pragmatically marked phrases and expressions that are used to convey specific discourse function and speech acts and need specialized intonational contours. The text is transformed into a poem-like structures, where each line corresponds to a Breath Group, semantically and syntactically consistent. Stanzas correspond to paragraph boundaries. Analogical parameters are related to ToBI theoretical in- dices but their number is doubled. In this paper, we concentrate on short stories and fables
Controllable Emphasis with zero data for text-to-speech
We present a scalable method to produce high quality emphasis for
text-to-speech (TTS) that does not require recordings or annotations. Many TTS
models include a phoneme duration model. A simple but effective method to
achieve emphasized speech consists in increasing the predicted duration of the
emphasised word. We show that this is significantly better than spectrogram
modification techniques improving naturalness by and correct testers'
identification of the emphasized word in a sentence by on a reference
female en-US voice. We show that this technique significantly closes the gap to
methods that require explicit recordings. The method proved to be scalable and
preferred in all four languages tested (English, Spanish, Italian, German), for
different voices and multiple speaking styles.Comment: In proceeding of 12th Speech Synthesis Workshop (SSW) 202
Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis
The spontaneous behavior that often occurs in conversations makes speech more
human-like compared to reading-style. However, synthesizing spontaneous-style
speech is challenging due to the lack of high-quality spontaneous datasets and
the high cost of labeling spontaneous behavior. In this paper, we propose a
semi-supervised pre-training method to increase the amount of spontaneous-style
speech and spontaneous behavioral labels. In the process of semi-supervised
learning, both text and speech information are considered for detecting
spontaneous behaviors labels in speech. Moreover, a linguistic-aware encoder is
used to model the relationship between each sentence in the conversation.
Experimental results indicate that our proposed method achieves superior
expressive speech synthesis performance with the ability to model spontaneous
behavior in spontaneous-style speech and predict reasonable spontaneous
behavior from text.Comment: Accepted by INTERSPEECH 202
Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis
Recent advances in text-to-speech have significantly improved the
expressiveness of synthesized speech. However, it is still challenging to
generate speech with contextually appropriate and coherent speaking style for
multi-sentence text in audiobooks. In this paper, we propose a context-aware
coherent speaking style prediction method for audiobook speech synthesis. To
predict the style embedding of the current utterance, a hierarchical
transformer-based context-aware style predictor with a mixture attention mask
is designed, considering both text-side context information and speech-side
style information of previous speeches. Based on this, we can generate
long-form speech with coherent style and prosody sentence by sentence.
Objective and subjective evaluations on a Mandarin audiobook dataset
demonstrate that our proposed model can generate speech with more expressive
and coherent speaking style than baselines, for both single-sentence and
multi-sentence test.Comment: Accepted by ICASSP 202
Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0
In English, prosody adds a broad range of information to segment sequences,
from information structure (e.g. contrast) to stylistic variation (e.g.
expression of emotion). However, when learning to control prosody in
text-to-speech voices, it is not clear what exactly the control is modifying.
Existing research on discrete representation learning for prosody has
demonstrated high naturalness, but no analysis has been performed on what these
representations capture, or if they can generate meaningfully-distinct variants
of an utterance. We present a phrase-level variational autoencoder with a
multi-modal prior, using the mode centres as "intonation codes". Our evaluation
establishes which intonation codes are perceptually distinct, finding that the
intonation codes from our multi-modal latent model were significantly more
distinct than a baseline using k-means clustering. We carry out a follow-up
qualitative study to determine what information the codes are carrying. Most
commonly, listeners commented on the intonation codes having a statement or
question style. However, many other affect-related styles were also reported,
including: emotional, uncertain, surprised, sarcastic, passive aggressive, and
upset.Comment: Published to the 10th ISCA International Conference on Speech Prosody
(SP2020
Towards Hierarchical Prosodic Prominence Generation in TTS Synthesis
We address the problem of identification (from text) and generation of pitch accents in HMM-based English TTS synthesis. We show, through a large scale perceptual test, that a large improvement of the binary discrimination between pitch accented and non-accented words has no effect on the quality of the speech generated by the system. On the other side adding a third accent type that emphatically marks words that convey ”contrastive” focus (automatically identified from text) produces beneficial effects on the synthesized speech. These results support the accounts on prosodic prominence that consider the prosodic patterns of utterances as hierarchical structured and point out the limits of a flattening of such structure resulting from a simple accent/non-accent distinction. Index Terms: speech synthesis, HMM, pitch accents, focus detection 1
RRL: A Rich Representation Language for the Description of Agent Behaviour in NECA
In this paper, we describe the Rich Representation Language (RRL) which is used in the NECA system. The NECA system generates interactions between two or more animated characters. The RRL is a formal framework for representing the information that is exchanged at the interfaces between the various NECA system modules
- …