1,021 research outputs found
An ongoing review of speech emotion recognition
User emotional status recognition is becoming a key feature in advanced Human Computer Interfaces (HCI). A key source of emotional information is the spoken expression, which may be part of the interaction between the human and the machine. Speech emotion recognition (SER) is a very active area of research that involves the application of current machine learning and neural networks tools. This ongoing review covers recent and classical approaches to SER reported in the literature.This work has been carried out with the support of project PID2020-116346GB-I00 funded by the Spanish MICIN
Hidden Markov Models
Hidden Markov Models (HMMs), although known for decades, have made a big career nowadays and are still in state of development. This book presents theoretical issues and a variety of HMMs applications in speech recognition and synthesis, medicine, neurosciences, computational biology, bioinformatics, seismology, environment protection and engineering. I hope that the reader will find this book useful and helpful for their own research
Identifying prosodic prominence patterns for English text-to-speech synthesis
This thesis proposes to improve and enrich the expressiveness of English Text-to-Speech (TTS) synthesis by identifying and generating natural patterns of prosodic
prominence.
In most state-of-the-art TTS systems the prediction from text of prosodic prominence
relations between words in an utterance relies on features that very loosely account
for the combined effects of syntax, semantics, word informativeness and salience,
on prosodic prominence.
To improve prosodic prominence prediction we first follow up the classic approach
in which prosodic prominence patterns are flattened into binary sequences of pitch accented
and pitch unaccented words. We propose and motivate statistic and syntactic
dependency based features that are complementary to the most predictive features proposed
in previous works on automatic pitch accent prediction and show their utility on
both read and spontaneous speech.
Different accentuation patterns can be associated to the same sentence. Such variability
rises the question on how evaluating pitch accent predictors when more patterns
are allowed. We carry out a study on prosodic symbols variability on a speech corpus
where different speakers read the same text and propose an information-theoretic definition
of optionality of symbolic prosodic events that leads to a novel evaluation metric
in which prosodic variability is incorporated as a factor affecting prediction accuracy.
We additionally propose a method to take advantage of the optionality of prosodic
events in unit-selection speech synthesis.
To better account for the tight links between the prosodic prominence of a word and
the discourse/sentence context, part of this thesis goes beyond the accent/no-accent dichotomy
and is devoted to a novel task, the automatic detection of contrast, where
contrast is meant as a (Information Structure’s) relation that ties two words that explicitly
contrast with each other. This task is mainly motivated by the fact that contrastive
words tend to be prosodically marked with particularly prominent pitch accents.
The identification of contrastive word pairs is achieved by combining lexical information,
syntactic information (which mainly aims to identify the syntactic parallelism
that often activates contrast) and semantic information (mainly drawn from the Word-
Net semantic lexicon), within a Support Vector Machines classifier.
Once we have identified patterns of prosodic prominence we propose methods to
incorporate such information in TTS synthesis and test its impact on synthetic speech
naturalness trough some large scale perceptual experiments. The results of these experiments cast some doubts on the utility of a simple accent/no-accent
distinction in Hidden Markov Model based speech synthesis while highlight the
importance of contrastive accents
- …