10 research outputs found
Expressive Speech Synthesis for Critical Situations
Presence of appropriate acoustic cues of affective features in the synthesized speech can be a prerequisite for the proper evaluation of the semantic content by the message recipient. In the recent work the authors have focused on the research of expressive speech synthesis capable of generating naturally sounding synthetic speech at various levels of arousal. Automatic information and warning systems can be used to inform, warn, instruct and navigate people in dangerous, critical situations, and increase the effectiveness of crisis management and rescue operations. One of the activities in the frame of the EU SF project CRISIS was called "Extremely expressive (hyper-expressive) speech synthesis for urgent warning messages generation''. It was aimed at research and development of speech synthesizers with high naturalness and intelligibility capable of generating messages with various expressive loads. The synthesizers will be applicable to generate public alert and warning messages in case of fires, floods, state security threats, etc. Early warning in relation to the situations mentioned above can be made thanks to fire and flood spread forecasting; modeling thereof is covered by other activities of the CRISIS project. The most important part needed for the synthesizer building is the expressive speech database. An original method is proposed to create such a database. The current version of the expressive speech database is introduced and first experiments with expressive synthesizers developed with this database are presented and discussed
Prediction of Stress Level from Speech – from Database to Regressor
The term stress can designate a number of situations and affective reactions. This work focuses on the immediate stress reaction caused by, for example, threat, danger, fear, or great concern. Could measuring stress from speech be a viable fast and non-invasive method? The article describes the development of a system predicting stress from voice – from the creation of the database, and preparation of the training data to the design and tests of the regressor. StressDat, an acted database of speech under stress in Slovak, was designed. After publishing the methodology during its development in [1], this work describes the final form, annotation, and basic acoustic analyses of the data. The utterances presenting various stress-inducing scenarios were acted at three intended stress levels. The annotators used a "stress thermometer" to rate the perceived stress in the utterance on a scale from 0 to 100. Thus, data with a resolution suitable for training the regressor was obtained. Several regressors were trained, tested and compared. On the test-set, the stress estimation works well (R square = 0.72, Concordance Correlation Coefficient = 0.83) but practical application will require much larger volumes of specific training data. StressDat was made publicly available
Speaker Authorization for Air Traffic Control Security
The number of incidents in which unauthorized persons break into
frequencies used by Air Traffic Controllers (ATCOs) and give false instructions
to pilots, or transmit fake emergency calls, is a permanent and apparently grow�ing threat. One of the measures against such attacks could be to use automatic
speaker recognition on the voice radio channel to disclose the potential unau�thorized speaker. This work describes the solution for a speaker authorization
system in the Security of Air Transport Infrastructures of Europe (SATIE) project,
presents the architecture of the system, gives details on training and testing proce�dures, analyses the influence of the number of authorized persons on the system’s
performance and describes how the system was adapted to work on the radio
channel
Weaknesses of voice biometrics - sensitivity of Speaker verification to emotional arousal
In our series of experiments we study weaknesses of the voice biometric systems and try to find solutions to improve their robustness. The acoustical features that represent human voices in the current automatic speaker verification systems change significantly when the person’s emotional arousal deviates from the neutral state. Speech templates of a given speaker used for enrollment are generally recorded in a neutral emotional state using "normal" speech effort. Therefore speaking with higher or lower voice tension causes a mismatch between training and testing resulting in a higher number of verification errors. The acoustical cues of increased emotional arousal in speech are highly non-specific. They are similar to those of Lombard speech, warning and insisting voice, emergency voice, extreme acute stress, shouting, and emotions like anger, fear, hate, and many others. As the available spontaneous emotional speech databases do not cover the full range of the emotional arousal for individual voices, and do not have enough utterances per speaker, we decided to use our CRISIS acted database containing speech utterances at six levels of tense emotional arousal per speaker. Sensitivity of the state of the art i-vector based speaker recognizer with PLDA scoring to arousal mismatch was validated. The speaker verification system was successfully implemented in the online “Speaker authorization” module developed in the frame of the European project Global ATM Security Management (GAMMA). It has been observed that at extreme arousal levels the reliability of the verification decreases. Mixed enrollments with various levels of arousal were used to create more robust models and have shown a promising improvement in the verification reliability compared to the baseline
Enhancing Air Traffic Management Security by Means of Conformance Monitoring and Speech Analysis
This document describes the concept of an air traffic management security
system and current validation activities. This system uses speech analysis techniques
to verify the speaker authorization and to measure the stress level within the airground
voice communication between pilots and air traffic controllers on one hand,
and on the other hand it monitors the current air traffic situation. The purpose of this
system is to close an existing security gap by using this multi-modal approach. First
validation results are discussed at the end of this article
Enhancing Air Traffic Management Security by Means of Conformance Monitoring and Speech Analysis
This document describes the concept of an air traffic management security
system and current validation activities. This system uses speech analysis techniques
to verify the speaker authorization and to measure the stress level within the airground
voice communication between pilots and air traffic controllers on one hand,
and on the other hand it monitors the current air traffic situation. The purpose of this
system is to close an existing security gap by using this multi-modal approach. First
validation results are discussed at the end of this article
Mapping Discrete Emotions in the Dimensional Space: An Acoustic Approach
A frequently used procedure to examine the relationship between categorical and dimensional descriptions of emotions is to ask subjects to place verbal expressions representing emotions
in a continuous multidimensional emotional space. This work chooses a different approach. It
aims at creating a system predicting the values of Activation and Valence (AV) directly from the
sound of emotional speech utterances without the use of its semantic content or any other additional
information. The system uses X-vectors to represent sound characteristics of the utterance and
Support Vector Regressor for the estimation the AV values. The system is trained on a pool of three
publicly available databases with dimensional annotation of emotions. The quality of regression is
evaluated on the test sets of the same databases. Mapping of categorical emotions to the dimensional
space is tested on another pool of eight categorically annotated databases. The aim of the work
was to test whether in each unseen database the predicted values of Valence and Activation will
place emotion-tagged utterances in the AV space in accordance with expectations based on Russell’s
circumplex model of affective space. Due to the great variability of speech data, clusters of emotions
create overlapping clouds. Their average location can be represented by centroids. A hypothesis
on the position of these centroids is formulated and evaluated. The system’s ability to separate the
emotions is evaluated by measuring the distance of the centroids. It can be concluded that the system
works as expected and the positions of the clusters follow the hypothesized rules. Although the
variance in individual measurements is still very high and the overlap of emotion clusters is large, it
can be stated that the AV coordinates predicted by the system lead to an observable separation of
the emotions in accordance with the hypothesis. Knowledge from training databases can therefore
be used to predict AV coordinates of unseen data of various origins. This could be used to detect
high levels of stress or depression. With the appearance of more dimensionally annotated training
data, the systems predicting emotional dimensions from speech sound will become more robust
and usable in practical applications in call-centers, avatars, robots, information-providing systems,
security applications, and the like