28 research outputs found

    Automatic speaker recognition: modelling, feature extraction and effects of clinical environment

    Get PDF
    Speaker recognition is the task of establishing identity of an individual based on his/her voice. It has a significant potential as a convenient biometric method for telephony applications and does not require sophisticated or dedicated hardware. The Speaker Recognition task is typically achieved by two-stage signal processing: training and testing. The training process calculates speaker-specific feature parameters from the speech. The features are used to generate statistical models of different speakers. In the testing phase, speech samples from unknown speakers are compared with the models and classified. Current state of the art speaker recognition systems use the Gaussian mixture model (GMM) technique in combination with the Expectation Maximization (EM) algorithm to build the speaker models. The most frequently used features are the Mel Frequency Cepstral Coefficients (MFCC). This thesis investigated areas of possible improvements in the field of speaker recognition. The identified drawbacks of the current speaker recognition systems included: slow convergence rates of the modelling techniques and feature’s sensitivity to changes due aging of speakers, use of alcohol and drugs, changing health conditions and mental state. The thesis proposed a new method of deriving the Gaussian mixture model (GMM) parameters called the EM-ITVQ algorithm. The EM-ITVQ showed a significant improvement of the equal error rates and higher convergence rates when compared to the classical GMM based on the expectation maximization (EM) method. It was demonstrated that features based on the nonlinear model of speech production (TEO based features) provided better performance compare to the conventional MFCCs features. For the first time the effect of clinical depression on the speaker verification rates was tested. It was demonstrated that the speaker verification results deteriorate if the speakers are clinically depressed. The deterioration process was demonstrated using conventional (MFCC) features. The thesis also showed that when replacing the MFCC features with features based on the nonlinear model of speech production (TEO based features), the detrimental effect of the clinical depression on speaker verification rates can be reduced

    Detection of clinical depression in adolescents' using acoustic speech analysis

    Get PDF
    Clinical depression is a major risk factor in suicides and is associated with high mortality rates, therefore making it one of the leading causes of death worldwide every year. Symptoms of depression often first appear during adolescence at a time when the voice is changing, in both males and females, suggesting that specific studies of these phenomena in adolescent populations are warranted. The properties of acoustic speech have previously been investigated as possible cues for depression in adults. However, these studies were restricted to small populations of patients and the speech recordings were made during patient’s clinical interviews or fixed-text reading sessions. A collaborative effort with the Oregon research institute (ORI), USA allowed the development of a new speech corpus consisting of a large sample size of 139 adolescents (46 males and 93 females) that were divided into two groups (68 clinically depressed and 71 controls). The speech recordings were made during naturalistic interactions between adolescents and parents. Instead of covering a plethora of acoustic features in the investigation, this study takes the knowledge based from speech science and groups the acoustic features into five categories that relate to the physiological and perceptual areas of the speech production mechanism. These five acoustic feature categories consisted of the prosodic, cepstral, spectral, glottal and Teager energy operator (TEO) based features. The effectiveness in applying these acoustic feature categories in detecting adolescent’s depression was measured. The salient feature categories were determined by testing the feature categories and their combinations within a binary classification framework. In consistency with previous studies, it was observed that: - there are strong gender related differences in classification accuracy; - the glottal features provide an important enhancement of the classification accuracy when combined with other types of features; An important new contribution provided by this thesis was to observe that the TEO based features significantly outperformed prosodic, cepstral, spectral, glottal features and their combinations. An investigation into the possible reasons of such strong performance of the TEO features pointed into the importance of nonlinear mechanisms associated with the glottal flow formation as possible cues for depression

    Stress and emotion recognition in natural speech in the work and family environments

    Get PDF
    The speech stress and emotion recognition and classification technology has a potential to provide significant benefits to the national and international industry and society in general. The accuracy of an automatic emotion speech and emotion recognition relays heavily on the discrimination power of the characteristic features. This work introduced and examined a number of new linear and nonlinear feature extraction methods for an automatic detection of stress and emotion in speech. The proposed linear feature extraction methods included features derived from the speech spectrograms (SS-CB/BARK/ERB-AE, SS-AF-CB/BARK/ERB-AE, SS-LGF-OFS, SS-ALGF-OFS, SS-SP-ALGF-OFS and SS-sigma-pi), wavelet packets (WP-ALGF-OFS) and the empirical mode decomposition (EMD-AER). The proposed nonlinear feature extraction methods were based on the results of recent laryngological studies and nonlinear modelling of the phonation process. The proposed nonlinear features included the area under the TEO autocorrelation envelope based on different spectral decompositions (TEO-DWT, TEO-WP, TEO-PWP-S and TEO-PWP-G), as well as features representing spectral energy distribution of speech (AUSEES) and glottal waveform (AUSEEG). The proposed features were compared with features based on the classical linear model of speech production including F0, formants, MFCC and glottal time/frequency parameters. Two classifiers GMM and KNN were tested for consistency. The experiments used speech under actual stress from the SUSAS database (7 speakers; 3 female and 4 male) and speech with five naturally expressed emotions (neutral, anger, anxious, dysphoric and happy) from the ORI corpora (71 speakers; 27 female and 44 male). The nonlinear features clearly outperformed all the linear features. The classification results demonstrated consistency with the nonlinear model of the phonation process indicating that the harmonic structure and the spectral distribution of the glottal energy provide the most important cues for stress and emotion recognition in speech. The study also investigated if the automatic emotion recognition can determine differences in emotion expression between parents of depressed adolescents and parents of non-depressed adolescents. It was also investigated if there are differences in emotion expression between mothers and fathers in general. The experiment results indicated that parents of depressed adolescent produce stronger more exaggerated expressions of affect than parents of non-depressed children. And females in general provide easier to discriminate (more exaggerated) expressions of affect than males

    Depresszió detektálása korrelációs struktúrán alkalmazott konvolúciós hálók segítségével

    Get PDF
    Jelen kutatásban a depressziós állapot automatikus detektálásának lehetőségét vizsgáltuk a beszédjelből kinyert speciális korrelációs struktúrán alkalmazott konvolúciós neurális hálok segítségével. A depresszió korunk egyik legelterjedtebb gyógyítható pszichiátriai betegsége. A depressziótól szenvedő egyén életminőségét nagymértékben befolyásolja a depresszió súlyossága, ami extrém esetben öngyilkossághoz is vezethet. Ezek alapján kulcsfontosságú, hogy már korai stádiumában felismerhető legyen a betegség és az illető megfelelő kezelésben részesüljön, azonban a depresszió diagnosztizálása szakértelmet kíván, emiatt fontos a depresszió esetleges jelenlétének automatikus jelzése. Ebben a cikkben egy olyan eljárást mutatunk be, ami beszédjel feldolgozása alapján tisztán spektrális jellemzőkön keresztül képes felismerni a depressziót konvolúciós neurális hálók alkalmazásának segítségével. Bemutatjuk, hogyan változik a depresszió detektálásának pontossága különböző akusztikai-fonetikai jellemzők felhasználása alapján, illetve a korrelációs struktúrának változtatása következtében. A módszer alkalmazásával 84%-os pontossággal tudtuk elkülöníteni az egészséges és depressziós személyeket a beszédmintáik alapján

    Multimodal analysis of verbal and nonverbal behaviour on the example of clinical depression

    No full text
    Clinical depression is a common mood disorder that may last for long periods, vary in severity, and could impair an individual’s ability to cope with daily life. Depression affects 350 million people worldwide and is therefore considered a burden not only on a personal and social level, but also on an economic one. Depression is the fourth most significant cause of suffering and disability worldwide and it is predicted to be the leading cause in 2020. Although treatment of depression disorders has proven to be effective in most cases, misdiagnosing depressed patients is a common barrier. Not only because depression manifests itself in different ways, but also because clinical interviews and self-reported history are currently the only ways of diagnosis, which risks a range of subjective biases either from the patient report or the clinical judgment. While automatic affective state recognition has become an active research area in the past decade, methods for mood disorder detection, such as depression, are still in their infancy. Using the advancements of affective sensing techniques, the long-term goal is to develop an objective multimodal system that supports clinicians during the diagnosis and monitoring of clinical depression. This dissertation aims to investigate the most promising characteristics of depression that can be “heard” and “seen” by a computer system for the task of detecting depression objectively. Using audio-video recordings of a clinically validated Australian depression dataset, several experiments are conducted to characterise depression-related patterns from verbal and nonverbal cues. Of particular interest in this dissertation is the exploration of speech style, speech prosody, eye activity, and head pose modalities. Statistical analysis and automatic classification of extracted cues are investigated. In addition, multimodal fusion methods of these modalities are examined to increase the accuracy and confidence level of detecting depression. These investigations result in a proposed system that detects depression in a binary manner (e.g. depressed vs. non-depressed) using temporal depression behavioural cues. The proposed system: (1) uses audio-video recordings to investigate verbal and nonverbal modalities, (2) extracts functional features from verbal and nonverbal modalities over the entire subjects’ segments, (3) pre- and post-normalises the extracted features, (4) selects features using the T-test, (5) classifies depression in a binary manner (i.e. severely depressed vs. healthy controls), and finally (6) fuses the individual modalities. The proposed system was validated for scalability and usability using generalisation experiments. Close studies were made of American and German depression datasets individually, and then also in combination with the Australian one. Applying the proposed system to the three datasets showed remarkably high classification results - up to a 95% average recall for the individual sets and 86% for the three combined. Strong implications are that the proposed system has the ability to generalise to different datasets recorded under quite different conditions such as collection procedure and task, depression diagnosis testing and scale, as well as cultural and language background. High performance was found consistently in speech prosody and eye activity in both individual and combined datasets, with head pose features a little less remarkable. Strong indications are that the extracted features are robust to large variations in recording conditions. Furthermore, once the modalities were combined, the classification results improved substantially. Therefore, the modalities are shown both to correlate and complement each other, working in tandem as an innovative system for diagnoses of depression across large variations of population and procedure

    Models and analysis of vocal emissions for biomedical applications

    Get PDF
    This book of Proceedings collects the papers presented at the 3rd International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 2003, held 10-12 December 2003, Firenze, Italy. The workshop is organised every two years, and aims to stimulate contacts between specialists active in research and industrial developments, in the area of voice analysis for biomedical applications. The scope of the Workshop includes all aspects of voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications and related established and advanced technologies

    Noise reduction in industry based on virtual instrumentation

    Get PDF
    This paper discusses the reduction of background noise in an industrial environment to extend human-machine-interaction. In the Industry 4.0 era, the mass development of voice control (speech recognition) in various industrial applications is possible, especially as related to augmented reality (such as hands-free control via voice commands). As Industry 4.0 relies heavily on radiofrequency technologies, some brief insight into this problem is provided, including the Internet of things (IoT) and 5G deployment. This study was carried out in cooperation with the industrial partner Brose CZ spol. s.r.o., where sound recordings were made to produce a dataset. The experimental environment comprised three workplaces with background noise above 100 dB, consisting of a laser/magnetic welder and a press. A virtual device was developed from a given dataset in order to test selected commands from a commercial speech recognizer from Microsoft. We tested a hybrid algorithm for noise reduction and its impact on voice command recognition efficiency. Using virtual devices, the study was carried out on large speakers with 20 participants (10 men and 10 women). The experiments included a large number of repetitions (100 times for each command under different noise conditions). Statistical results confirmed the efficiency of the tested algorithms. Laser welding environment efficiency was 27% before applied filtering, 76% using the least mean square (LMS) algorithm, and 79% using LMS + independent component analysis (ICA). Magnetic welding environment efficiency was 24% before applied filtering, 70% with LMS, and 75% with LMS + ICA. Press workplace environment efficiency showed no success before applied filtering, was 52% with LMS, and was 54% with LMS + ICA.Web of Science6911096107

    Automatic Emotion Recognition from Mandarin Speech

    Get PDF

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies
    corecore