Search CORE

10 research outputs found

Expressive Speech Synthesis for Critical Situations

Author: Darjaa Sakhia
Ritomský Marian
Rusko Milan
Sabo Róbert
Trnka Marián
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 11/02/2015
Field of study

Presence of appropriate acoustic cues of affective features in the synthesized speech can be a prerequisite for the proper evaluation of the semantic content by the message recipient. In the recent work the authors have focused on the research of expressive speech synthesis capable of generating naturally sounding synthetic speech at various levels of arousal. Automatic information and warning systems can be used to inform, warn, instruct and navigate people in dangerous, critical situations, and increase the effectiveness of crisis management and rescue operations. One of the activities in the frame of the EU SF project CRISIS was called "Extremely expressive (hyper-expressive) speech synthesis for urgent warning messages generation''. It was aimed at research and development of speech synthesizers with high naturalness and intelligibility capable of generating messages with various expressive loads. The synthesizers will be applicable to generate public alert and warning messages in case of fires, floods, state security threats, etc. Early warning in relation to the situations mentioned above can be made thanks to fire and flood spread forecasting; modeling thereof is covered by other activities of the CRISIS project. The most important part needed for the synthesizer building is the expressive speech database. An original method is proposed to create such a database. The current version of the expressive speech database is introduced and first experiments with expressive synthesizers developed with this database are presented and discussed

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Prediction of Stress Level from Speech – from Database to Regressor

Author: Darjaa Sakhia
Rusko Milan
Sabo Róbert
Schaper Meilin
Stelkens-Kobsch Tim
Trnka Marián
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 31/01/2024
Field of study

The term stress can designate a number of situations and affective reactions. This work focuses on the immediate stress reaction caused by, for example, threat, danger, fear, or great concern. Could measuring stress from speech be a viable fast and non-invasive method? The article describes the development of a system predicting stress from voice – from the creation of the database, and preparation of the training data to the design and tests of the regressor. StressDat, an acted database of speech under stress in Slovak, was designed. After publishing the methodology during its development in [1], this work describes the final form, annotation, and basic acoustic analyses of the data. The utterances presenting various stress-inducing scenarios were acted at three intended stress levels. The annotators used a "stress thermometer" to rate the perceived stress in the utterance on a scale from 0 to 100. Thus, data with a resolution suitable for training the regressor was obtained. Several regressors were trained, tested and compared. On the test-set, the stress estimation works well (R square = 0.72, Concordance Correlation Coefficient = 0.83) but practical application will require much larger volumes of specific training data. StressDat was made publicly available

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Speaker Authorization for Air Traffic Control Security

Author: Darjaa Sakhia
Rusko Milan
Schaper Meilin
Stelkens-Kobsch Tim H.
Trnka Marián
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

The number of incidents in which unauthorized persons break into frequencies used by Air Traffic Controllers (ATCOs) and give false instructions to pilots, or transmit fake emergency calls, is a permanent and apparently grow�ing threat. One of the measures against such attacks could be to use automatic speaker recognition on the voice radio channel to disclose the potential unau�thorized speaker. This work describes the solution for a speaker authorization system in the Security of Air Transport Infrastructures of Europe (SATIE) project, presents the architecture of the system, gives details on training and testing proce�dures, analyses the influence of the number of authorized persons on the system’s performance and describes how the system was adapted to work on the radio channel

Institute of Transport Research:Publications

Weaknesses of voice biometrics - sensitivity of Speaker verification to emotional arousal

Author: Darjaa Sakhia
Finke Michael
Rusko Milan
Stelkens-Kobsch Tim H.
Trnka Marian
Publication venue
Publication date: 01/01/2018
Field of study

In our series of experiments we study weaknesses of the voice biometric systems and try to find solutions to improve their robustness. The acoustical features that represent human voices in the current automatic speaker verification systems change significantly when the person’s emotional arousal deviates from the neutral state. Speech templates of a given speaker used for enrollment are generally recorded in a neutral emotional state using "normal" speech effort. Therefore speaking with higher or lower voice tension causes a mismatch between training and testing resulting in a higher number of verification errors. The acoustical cues of increased emotional arousal in speech are highly non-specific. They are similar to those of Lombard speech, warning and insisting voice, emergency voice, extreme acute stress, shouting, and emotions like anger, fear, hate, and many others. As the available spontaneous emotional speech databases do not cover the full range of the emotional arousal for individual voices, and do not have enough utterances per speaker, we decided to use our CRISIS acted database containing speech utterances at six levels of tense emotional arousal per speaker. Sensitivity of the state of the art i-vector based speaker recognizer with PLDA scoring to arousal mismatch was validated. The speaker verification system was successfully implemented in the online “Speaker authorization” module developed in the frame of the European project Global ATM Security Management (GAMMA). It has been observed that at extreme arousal levels the reliability of the verification decreases. Mixed enrollments with various levels of arousal were used to create more robust models and have shown a promising improvement in the verification reliability compared to the baseline

Institute of Transport Research:Publications

Enhancing Air Traffic Management Security by Means of Conformance Monitoring and Speech Analysis

Author: Darjaa Sakhia
Finke Michael
Rajcáni Jakub
Rusko Milan
Stelkens-Kobsch Tim H.
Trnka Marian
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/10/2018
Field of study

This document describes the concept of an air traffic management security system and current validation activities. This system uses speech analysis techniques to verify the speaker authorization and to measure the stress level within the airground voice communication between pilots and air traffic controllers on one hand, and on the other hand it monitors the current air traffic situation. The purpose of this system is to close an existing security gap by using this multi-modal approach. First validation results are discussed at the end of this article

Institute of Transport Research:Publications

Enhancing Air Traffic Management Security by Means of Conformance Monitoring and Speech Analysis

Author: Darjaa Sakhia
Finke Michael
Rajcáni Jakub
Rusko Milan
Stelkens-Kobsch Tim H.
Trnka Marian
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2018
Field of study

Institute of Transport Research:Publications

Mapping Discrete Emotions in the Dimensional Space: An Acoustic Approach

Author: Darjaa Sakhia
Ritomský Marian
Rusko Milan
Sabo Róbert
Schaper Meilin
Stelkens-Kobsch Tim H.
Trnka Marián
Publication venue: 'MDPI AG'
Publication date: 01/01/2021
Field of study

A frequently used procedure to examine the relationship between categorical and dimensional descriptions of emotions is to ask subjects to place verbal expressions representing emotions in a continuous multidimensional emotional space. This work chooses a different approach. It aims at creating a system predicting the values of Activation and Valence (AV) directly from the sound of emotional speech utterances without the use of its semantic content or any other additional information. The system uses X-vectors to represent sound characteristics of the utterance and Support Vector Regressor for the estimation the AV values. The system is trained on a pool of three publicly available databases with dimensional annotation of emotions. The quality of regression is evaluated on the test sets of the same databases. Mapping of categorical emotions to the dimensional space is tested on another pool of eight categorically annotated databases. The aim of the work was to test whether in each unseen database the predicted values of Valence and Activation will place emotion-tagged utterances in the AV space in accordance with expectations based on Russell’s circumplex model of affective space. Due to the great variability of speech data, clusters of emotions create overlapping clouds. Their average location can be represented by centroids. A hypothesis on the position of these centroids is formulated and evaluated. The system’s ability to separate the emotions is evaluated by measuring the distance of the centroids. It can be concluded that the system works as expected and the positions of the clusters follow the hypothesized rules. Although the variance in individual measurements is still very high and the overlap of emotion clusters is large, it can be stated that the AV coordinates predicted by the system lead to an observable separation of the emotions in accordance with the hypothesis. Knowledge from training databases can therefore be used to predict AV coordinates of unseen data of various origins. This could be used to detect high levels of stress or depression. With the appearance of more dimensionally annotated training data, the systems predicting emotional dimensions from speech sound will become more robust and usable in practical applications in call-centers, avatars, robots, information-providing systems, security applications, and the like

Institute of Transport Research:Publications