3 research outputs found
Expressive Speech Synthesis for Critical Situations
Presence of appropriate acoustic cues of affective features in the synthesized speech can be a prerequisite for the proper evaluation of the semantic content by the message recipient. In the recent work the authors have focused on the research of expressive speech synthesis capable of generating naturally sounding synthetic speech at various levels of arousal. Automatic information and warning systems can be used to inform, warn, instruct and navigate people in dangerous, critical situations, and increase the effectiveness of crisis management and rescue operations. One of the activities in the frame of the EU SF project CRISIS was called "Extremely expressive (hyper-expressive) speech synthesis for urgent warning messages generation''. It was aimed at research and development of speech synthesizers with high naturalness and intelligibility capable of generating messages with various expressive loads. The synthesizers will be applicable to generate public alert and warning messages in case of fires, floods, state security threats, etc. Early warning in relation to the situations mentioned above can be made thanks to fire and flood spread forecasting; modeling thereof is covered by other activities of the CRISIS project. The most important part needed for the synthesizer building is the expressive speech database. An original method is proposed to create such a database. The current version of the expressive speech database is introduced and first experiments with expressive synthesizers developed with this database are presented and discussed
StressDat – DATABASE OF SPEECH UNDER STRESS IN SLOVAK
The paper describes methodology for creating a Slovak database of speech
under stress and pilot observations. While the relationship between stress and speech
characteristics can be utilized in a wide domain of speech technology applications, its
research suffers from the lack of suitable databases, particularly in conversational speech. We
propose a novel procedure to record acted speech in the home of actors and using their own
smartphones. We describe both the collection of speech material under three levels of stress
and the subsequent annotation of stress levels in this material. First observations suggest
a reasonable inter-annotator agreement, as well as interesting avenues for the relationship
between the intended stress levels and those perceived in speech
Mapping Discrete Emotions in the Dimensional Space: An Acoustic Approach
A frequently used procedure to examine the relationship between categorical and dimensional descriptions of emotions is to ask subjects to place verbal expressions representing emotions
in a continuous multidimensional emotional space. This work chooses a different approach. It
aims at creating a system predicting the values of Activation and Valence (AV) directly from the
sound of emotional speech utterances without the use of its semantic content or any other additional
information. The system uses X-vectors to represent sound characteristics of the utterance and
Support Vector Regressor for the estimation the AV values. The system is trained on a pool of three
publicly available databases with dimensional annotation of emotions. The quality of regression is
evaluated on the test sets of the same databases. Mapping of categorical emotions to the dimensional
space is tested on another pool of eight categorically annotated databases. The aim of the work
was to test whether in each unseen database the predicted values of Valence and Activation will
place emotion-tagged utterances in the AV space in accordance with expectations based on Russell’s
circumplex model of affective space. Due to the great variability of speech data, clusters of emotions
create overlapping clouds. Their average location can be represented by centroids. A hypothesis
on the position of these centroids is formulated and evaluated. The system’s ability to separate the
emotions is evaluated by measuring the distance of the centroids. It can be concluded that the system
works as expected and the positions of the clusters follow the hypothesized rules. Although the
variance in individual measurements is still very high and the overlap of emotion clusters is large, it
can be stated that the AV coordinates predicted by the system lead to an observable separation of
the emotions in accordance with the hypothesis. Knowledge from training databases can therefore
be used to predict AV coordinates of unseen data of various origins. This could be used to detect
high levels of stress or depression. With the appearance of more dimensionally annotated training
data, the systems predicting emotional dimensions from speech sound will become more robust
and usable in practical applications in call-centers, avatars, robots, information-providing systems,
security applications, and the like