217 research outputs found

    Jitter and Shimmer measurements for speaker diarization

    Get PDF
    Jitter and shimmer voice quality features have been successfully used to characterize speaker voice traits and detect voice pathologies. Jitter and shimmer measure variations in the fundamental frequency and amplitude of speaker's voice, respectively. Due to their nature, they can be used to assess differences between speakers. In this paper, we investigate the usefulness of these voice quality features in the task of speaker diarization. The combination of voice quality features with the conventional spectral features, Mel-Frequency Cepstral Coefficients (MFCC), is addressed in the framework of Augmented Multiparty Interaction (AMI) corpus, a multi-party and spontaneous speech set of recordings. Both sets of features are independently modeled using mixture of Gaussians and fused together at the score likelihood level. The experiments carried out on the AMI corpus show that incorporating jitter and shimmer measurements to the baseline spectral features decreases the diarization error rate in most of the recordings.Peer ReviewedPostprint (published version

    Discriminative features for GMM and i-vector based speaker diarization

    Get PDF
    Speaker diarization has received several research attentions over the last decade. Among the different domains of speaker diarization, diarization in meeting domain is the most challenging one. It usually contains spontaneous speech and is, for example, susceptible to reverberation. The appropriate selection of speech features is one of the factors that affect the performance of speaker diarization systems. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used short-term speech features in speaker diarization. Other factors that affect the performance of speaker diarization systems are the techniques employed to perform both speaker segmentation and speaker clustering. In this thesis, we have proposed the use of jitter and shimmer long-term voice-quality features both for Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. The voice-quality features are used together with the state-of-the-art short-term cepstral and long-term speech ones. The long-term features consist of prosody and Glottal-to-Noise excitation ratio (GNE) descriptors. Firstly, the voice-quality, prosodic and GNE features are stacked in the same feature vector. Then, they are fused with cepstral coefficients at the score likelihood level both for the proposed Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. For the proposed GMM based speaker diarization system, independent HMM models are estimated from the short-term and long-term speech feature sets. The fusion of the short-term descriptors with the long-term ones in speaker segmentation is carried out by linearly weighting the log-likelihood scores of Viterbi decoding. In the case of speaker clustering, the fusion of the short-term cepstral features with the long-term ones is carried out by linearly fusing the Bayesian Information Criterion (BIC) scores corresponding to these feature sets. For the proposed i-vector based speaker diarization system, the speaker segmentation is carried out exactly the same as in the previously mentioned GMM based speaker diarization system. However, the speaker clustering technique is based on the recently introduced factor analysis paradigm. Two set of i-vectors are extracted from the speaker segmentation hypothesis. Whilst the first i-vector is extracted from short-term cepstral features, the second one is extracted from the voice quality, prosody and GNE descriptors. Then, the cosine-distance and Probabilistic Linear Discriminant Analysis (PLDA) scores of i-vectors are linearly weighted to obtain a fused similarity score. Finally, the fused score is used as speaker clustering distance. We have also proposed the use of delta dynamic features for speaker clustering. The motivation for using deltas in clustering is that delta dynamic features capture the transitional characteristics of the speech signal which contain speaker specific information. This information is not captured by the static cepstral coefficients. The delta features are used together with the short-term static cepstral coefficients and long-term speech features (i.e., voice-quality, prosody and GNE) both for GMM and i-vector based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction (AMI) meeting corpus. The experimental results show that the use of voice-quality, prosody, GNE and delta dynamic features improve the performance of both GMM and i-vector based speaker diarization systems.La diarización del altavoz ha recibido varias atenciones de investigación durante la última década. Entre los diferentes dominios de la diarización del hablante, la diarización en el dominio del encuentro es la más difícil. Normalmente contiene habla espontánea y, por ejemplo, es susceptible de reverberación. La selección apropiada de las características del habla es uno de los factores que afectan el rendimiento de los sistemas de diarización de los altavoces. Los Coeficientes Cepstral de Frecuencia Mel (MFCC) son las características de habla de corto plazo más utilizadas en la diarización de los altavoces. Otros factores que afectan el rendimiento de los sistemas de diarización del altavoz son las técnicas empleadas para realizar tanto la segmentación del altavoz como el agrupamiento de altavoces. En esta tesis, hemos propuesto el uso de jitter y shimmer características de calidad de voz a largo plazo tanto para GMM y i-vector basada en sistemas de diarización de altavoces. Las características de calidad de voz se utilizan junto con el estado de la técnica a corto plazo cepstral y de larga duración de habla. Las características a largo plazo consisten en la prosodia y los descriptores de relación de excitación Glottal-a-Ruido (GNE). En primer lugar, las características de calidad de voz, prosódica y GNE se apilan en el mismo vector de características. A continuación, se fusionan con coeficientes cepstrales en el nivel de verosimilitud de puntajes tanto para los sistemas de diarización de altavoces basados ¿¿en el modelo Gaussian Mixture Modeling (GMM) como en los sistemas basados ¿¿en i-vector. . Para el sistema de diarización de altavoces basado en GMM propuesto, se calculan modelos HMM independientes a partir de cada conjunto de características. En la segmentación de los altavoces, la fusión de los descriptores a corto plazo con los de largo plazo se lleva a cabo mediante la ponderación lineal de las puntuaciones log-probabilidad de decodificación Viterbi. En la agrupación de altavoces, la fusión de las características cepstrales a corto plazo con las de largo plazo se lleva a cabo mediante la fusión lineal de las puntuaciones Bayesian Information Criterion (BIC) correspondientes a estos conjuntos de características. Para el sistema de diarización de altavoces basado en un vector i, la fusión de características se realiza exactamente igual a la del sistema basado en GMM antes mencionado. Sin embargo, la técnica de agrupación de altavoces se basa en el paradigma de análisis de factores recientemente introducido. Dos conjuntos de i-vectores se extraen de la hipótesis de segmentación de altavoz. Mientras que el primer vector i se extrae de características espectrales a corto plazo, el segundo se extrae de los descriptores de calidad de voz apilados, prosódicos y GNE. A continuación, las puntuaciones de coseno-distancia y Probabilistic Linear Discriminant Analysis (PLDA) entre i-vectores se ponderan linealmente para obtener una puntuación de similitud fundida. Finalmente, la puntuación fusionada se utiliza como distancia de agrupación de altavoces. También hemos propuesto el uso de características dinámicas delta para la agrupación de locutores. La motivación para el uso de deltas en la agrupación es que las características dinámicas delta capturan las características de transición de la señal de voz que contienen información específica del locutor. Esta información no es capturada por los coeficientes cepstrales estáticos. Las características delta se usan junto con los coeficientes cepstrales estáticos a corto plazo y las características de voz a largo plazo (es decir, calidad de voz, prosodia y GNE) tanto para sistemas de diarización de altavoces basados en GMM como en sistemas i-vector. Los resultados experimentales sobre AMI muestran que el uso de calidad vocal, prosódica, GNE y dinámicas delta mejoran el rendimiento de los sistemas de diarización de altavoces basados en GMM e i-vector.Postprint (published version

    The use of long-term features for GMM- and i-vector-based speaker diarization systems

    Get PDF
    Several factors contribute to the performance of speaker diarization systems. For instance, the appropriate selection of speech features is one of the key aspects that affect speaker diarization systems. The other factors include the techniques employed to perform both segmentation and clustering. While the static mel frequency cepstral coefficients are the most widely used features in speech-related tasks including speaker diarization, several studies have shown the benefits of augmenting regular speech features with the static ones. In this work, we have proposed and assessed the use of voice-quality features (i.e., jitter, shimmer, and Glottal-to-Noise Excitation ratio) within the framework of speaker diarization. These acoustic attributes are employed together with the state-of-the-art short-term cepstral and long-term prosodic features. Additionally, the use of delta dynamic features is also explored separately both for segmentation and bottom-up clustering sub-tasks. The combination of the different feature sets is carried out at several levels. At the feature level, the long-term speech features are stacked in the same feature vector. At the score level, the short- and long-term speech features are independently modeled and fused at the score likelihood level. Various feature combinations have been applied both for Gaussian mixture modeling and i-vector-based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction meeting corpus. The best result, in terms of diarization error rate, is reported by using i-vector-based cosine-distance clustering together with a signal parameterization consisting of a combination of static cepstral coefficients, delta, voice-quality, and prosodic features. The best result shows about 24% relative diarization error rate improvement compared to the baseline system which is based on Gaussian mixture modeling and short-term static cepstral coefficients.Peer ReviewedPostprint (published version

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    USING DEEP LEARNING-BASED FRAMEWORK FOR CHILD SPEECH EMOTION RECOGNITION

    Get PDF
    Biological languages of the body through which human emotion can be detected abound including heart rate, facial expressions, movement of the eyelids and dilation of the eyes, body postures, skin conductance, and even the speech we make. Speech emotion recognition research started some three decades ago, and the popular Interspeech Emotion Challenge has helped to propagate this research area. However, most speech recognition research is focused on adults and there is very little research on child speech. This dissertation is a description of the development and evaluation of a child speech emotion recognition framework. The higher-level components of the framework are designed to sort and separate speech based on the speaker’s age, ensuring that focus is only on speeches made by children. The framework uses Baddeley’s Theory of Working Memory to model a Working Memory Recurrent Network that can process and recognize emotions from speech. Baddeley’s Theory of Working Memory offers one of the best explanations on how the human brain holds and manipulates temporary information which is very crucial in the development of neural networks that learns effectively. Experiments were designed and performed to provide answers to the research questions, evaluate the proposed framework, and benchmark the performance of the framework with other methods. Satisfactory results were obtained from the experiments and in many cases, our framework was able to outperform other popular approaches. This study has implications for various applications of child speech emotion recognition such as child abuse detection and child learning robots

    Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

    Get PDF
    More than a decade has passed since research on automatic recognition of emotion from speech has become a new field of research in line with its 'big brothers' speech and speaker recognition. This article attempts to provide a short overview on where we are today, how we got there and what this can reveal us on where to go next and how we could arrive there. In a first part, we address the basic phenomenon reflecting the last fifteen years, commenting on databases, modelling and annotation, the unit of analysis and prototypicality. We then shift to automatic processing including discussions on features, classification, robustness, evaluation, and implementation and system integration. From there we go to the first comparative challenge on emotion recognition from speech-the INTERSPEECH 2009 Emotion Challenge, organised by (part of) the authors, including the description of the Challenge's database, Sub-Challenges, participants and their approaches, the winners, and the fusion of results to the actual learnt lessons before we finally address the ever-lasting problems and future promising attempts. (C) 2011 Elsevier B.V. All rights reserved.Schuller B., Batliner A., Steidl S., Seppi D., ''Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge'', Speech communication, vol. 53, no. 9-10, pp. 1062-1087, November 2011.status: publishe

    Розробка структури системи розпізнавання емоційного стану диктора

    No full text
    У статті розглянуто сучасні підходи до автоматизованого розпізнавання емоцій і певних психологічних станів людини за її голосом. Запропоновано структуру системи ідентифікації емоцій, що використовує попередню обробку аудіо сигналу (шумозниження та сегментацію за учасниками), а також множини акустичних, просодичних та екстралінгвістичних характеристик мовлення для створення ознакового опису. Результати численнних досліджень вказують на необхідність застосування даних характеристик.Modern approaches to automated recognition of emotions and psychological conditions by voice are described. The structure system for speaker emotion identification that uses a preprocessing audio signal (noise reduction and segmentation by participants) and a set of acoustic and prosodic features of speech and extra linguistic feature to create describing vector are proposed. The results of numeric research point to the necessity to use these characteristics together

    Αναγνώριση Συναισθήματος με χρήση Βαθιάς Μάθησης και Πρωτότυπων Τεχνικών Επαύξησης Δεδομένων

    Get PDF
    υγκεκριμένα task. Γενικά, το συναίσθημα ενός ανθρώπου αναγνωρίζεται αναλύοντας εκφράσεις του προσώπου, χειρονομίες, τη στάση του σώματος, την ομιλία ή φυσιολογικές παραμέτρους όπως αυτές προκύπτουν από ηλεκτροεγκεφαλογραφήματα, ηλεκτροκαρδιογραφήματα κα. Ωστόσο, σε πολλές περιπτώσεις οι οπτικές πληροφορίες δεν διαθέσιμες ή κατάλληλες, ενώ η μέτρηση των φυσιολογικών παραμέτρων είναι δύσκολη, δύσχρηστη και απαιτεί εξειδικευμένο ακριβό εξοπλισμό. Συνεπώς, η ομιλία ίσως είναι η καλύτερη εναλλακτική. Οι συνηθισμένες τεχνικές μηχανικής μάθησης που χρησιμοποιούνται για το σκοπό αυτό εξάγουν ένα σύνολο γλωσσολογικών χαρακτηριστικών από τα δεδομένα, τα οποία χρησιμοποιούνται στη συνέχεια για την εκπαίδευση μοντέλων επιβλεπόμενης μάθησης (supervised learning). Στη διπλωματική αυτή χρησιμοποιείται ένα μοντέλο Συνελικτικού Νευρωνικού Δικτύου (Convolution Neural Network - CNN) που σε αντίθεση με τις παραδοσιακές προσεγγίσεις ανιχνεύει μόνο τα σημαντικά χαρακτηριστικά των δεδομένων που εισάγονται σε αυτό. Αξίζει να σημειωθεί, πως η αρχιτεκτονική ενός CNN είναι ανάλογη με τη συνδεσιμότητα των νευρώνων του ανθρώπινου εγκεφάλου και εμπνευσμένη από την οργάνωση του οπτικού φλοιού. Χρησιμοποιούνται τρια σύνολα ηχητικών δεδομένων (EMOVO, SAVEE, Emo-DB), από όπου εξάγονται τα αντίστοιχα φασματογραφηματα (spectrograms), τα οποία με τη σειρά τους χρησιμοποιούνται ως είσοδοι στο νευρωνικό δίκτυο. Για τη βέλτιστη απόδοση του αλγορίθμου εφαρμόζονται πρωτότυπες τεχνικές επαύξησης (data augmentation) των αρχικών δεδομένων πέραν της συνηθισμένης πρόσθεσης noise, όπως μετατόπιση του ηχητικού σήματος, αλλαγή της οξύτητας και της ταχύτητας του. Τέλος, χρησιμοποιούνται μέθοδοι καταπολέμησης της υπερπροσαρμογής (overfitting) όπως το dropout και τεχνικές ενίσχυσης της γενικευσιμότητας του μοντέλου όπως πρόσθεση επιπέδων κανονικοποίησης τοπικής απόκρισης (local response normalization layers), η λειτουργία των οποίων είναι εμπνευσμένη από την πλευρική αναστολή (lateral inhibition) των νευρώνων του εγκεφάλου. Τα αποτελέσματα είναι βελτιωμένα σε σχέση με άλλες παρόμοιες μελέτες. Ωστόσο, το μοντέλο δεν υποδεικνύει ανεξαρτησία από τη γλώσσα των ηχητικών σημάτων.Emotion recognition is quite important for various applications related to human-computer interaction or for understanding the user's mood in specific tasks. In general, a person's emotion is recognized by analyzing facial expressions, gestures, posture, speech or physiological parameters such as those occurring from electroencephalograms, electrocardiograms, etc. However, in many cases, the visual information is not available or appropriate, while the measurement of physiological parameters is difficult and requires specialized, expensive equipment. As a result, speech is probably the best alternative. The typical machine learning techniques used for this purpose extract a set of linguistic features from the data, which are then used to train supervised learning models. In this thesis, a Convolution Neural Network (CNN) is proposed, which, unlike traditional approaches, detects only the important features of raw data entered into it. It is worth noting that the architecture of a CNN is analogous to the connectivity of the neurons of the human brain and inspired by the organization of the visual cortex. The inputs to the neural network are the spectrograms that are extracted from audio signals. For the optimal performance of the algorithm, data augmentation techniques of the original data are applied such as adding noise, shifting of the audio signal, and changing its pitch or its speed. Finally, methods against overfitting are applied, such as dropout and local response normalization layers, the operation of which is inspired by lateral inhibition of the neurons of the human brain. Our approach outperformed previous work, without being established as a considerably language-independent one
    corecore