14 research outputs found

    Speech data analysis for semantic indexing of video of simulated medical crises.

    Get PDF
    The Simulation for Pediatric Assessment, Resuscitation, and Communication (SPARC) group within the Department of Pediatrics at the University of Louisville, was established to enhance the care of children by using simulation based educational methodologies to improve patient safety and strengthen clinician-patient interactions. After each simulation session, the physician must manually review and annotate the recordings and then debrief the trainees. The physician responsible for the simulation has recorded 100s of videos, and is seeking solutions that can automate the process. This dissertation introduces our developed system for efficient segmentation and semantic indexing of videos of medical simulations using machine learning methods. It provides the physician with automated tools to review important sections of the simulation by identifying who spoke, when and what was his/her emotion. Only audio information is extracted and analyzed because the quality of the image recording is low and the visual environment is static for most parts. Our proposed system includes four main components: preprocessing, speaker segmentation, speaker identification, and emotion recognition. The preprocessing consists of first extracting the audio component from the video recording. Then, extracting various low-level audio features to detect and remove silence segments. We investigate and compare two different approaches for this task. The first one is threshold-based and the second one is classification-based. The second main component of the proposed system consists of detecting speaker changing points for the purpose of segmenting the audio stream. We propose two fusion methods for this task. The speaker identification and emotion recognition components of our system are designed to provide users the capability to browse the video and retrieve shots that identify ”who spoke, when, and the speaker’s emotion” for further analysis. For this component, we propose two feature representation methods that map audio segments of arbitary length to a feature vector with fixed dimensions. The first one is based on soft bag-of-word (BoW) feature representations. In particular, we define three types of BoW that are based on crisp, fuzzy, and possibilistic voting. The second feature representation is a generalization of the BoW and is based on Fisher Vector (FV). FV uses the Fisher Kernel principle and combines the benefits of generative and discriminative approaches. The proposed feature representations are used within two learning frameworks. The first one is supervised learning and assumes that a large collection of labeled training data is available. Within this framework, we use standard classifiers including K-nearest neighbor (K-NN), support vector machine (SVM), and Naive Bayes. The second framework is based on semi-supervised learning where only a limited amount of labeled training samples are available. We use an approach that is based on label propagation. Our proposed algorithms were evaluated using 15 medical simulation sessions. The results were analyzed and compared to those obtained using state-of-the-art algorithms. We show that our proposed speech segmentation fusion algorithms and feature mappings outperform existing methods. We also integrated all proposed algorithms and developed a GUI prototype system for subjective evaluation. This prototype processes medical simulation video and provides the user with a visual summary of the different speech segments. It also allows the user to browse videos and retrieve scenes that provide answers to semantic queries such as: who spoke and when; who interrupted who? and what was the emotion of the speaker? The GUI prototype can also provide summary statistics of each simulation video. Examples include: for how long did each person spoke? What is the longest uninterrupted speech segment? Is there an unusual large number of pauses within the speech segment of a given speaker

    Reconnaissance automatique du locuteur par des GMM à grande marge

    Get PDF
    Depuis plusieurs dizaines d'années, la reconnaissance automatique du locuteur (RAL) fait l'objet de travaux de recherche entrepris par de nombreuses équipes dans le monde. La majorité des systèmes actuels sont basés sur l'utilisation des Modèles de Mélange de lois Gaussiennes (GMM) et/ou des modèles discriminants SVM, i.e., les machines à vecteurs de support. Nos travaux ont pour objectif général la proposition d'utiliser de nouveaux modèles GMM à grande marge pour la RAL qui soient une alternative aux modèles GMM génératifs classiques et à l'approche discriminante état de l'art GMM-SVM. Nous appelons ces modèles LM-dGMM pour Large Margin diagonal GMM. Nos modèles reposent sur une récente technique discriminante pour la séparation multi-classes, qui a été appliquée en reconnaissance de la parole. Exploitant les propriétés des systèmes GMM utilisés en RAL, nous présentons dans cette thèse des variantes d'algorithmes d'apprentissage discriminant des GMM minimisant une fonction de perte à grande marge. Des tests effectués sur les tâches de reconnaissance du locuteur de la campagne d'évaluation NIST-SRE 2006 démontrent l'intérêt de ces modèles en reconnaissance.Most of state-of-the-art speaker recognition systems are based on Gaussian Mixture Models (GMM), trained using maximum likelihood estimation and maximum a posteriori (MAP) estimation. The generative training of the GMM does not however directly optimize the classification performance. For this reason, discriminative models, e.g., Support Vector Machines (SVM), have been an interesting alternative since they address directly the classification problem, and they lead to good performances. Recently a new discriminative approach for multiway classification has been proposed, the Large Margin Gaussian mixture models (LM-GMM). As in SVM, the parameters of LM-GMM are trained by solving a convex optimization problem. However they differ from SVM by using ellipsoids to model the classes directly in the input space, instead of half-spaces in an extended high-dimensional space. While LM-GMM have been used in speech recognition, they have not been used in speaker recognition (to the best of our knowledge). In this thesis, we propose simplified, fast and more efficient versions of LM-GMM which exploit the properties and characteristics of speaker recognition applications and systems, the LM-dGMM models. In our LM-dGMM modeling, each class is initially modeled by a GMM trained by MAP adaptation of a Universal Background Model (UBM) or directly initialized by the UBM. The models mean vectors are then re-estimated under some Large Margin constraints. We carried out experiments on full speaker recognition tasks under the NIST-SRE 2006 core condition. The experimental results are very satisfactory and show that our Large Margin modeling approach is very promising

    Phonetic aware techniques for Speaker Verification

    Get PDF
    The goal of this thesis is to improve current state-of-the-art techniques in speaker verification (SV), typically based on âidentity-vectorsâ (i-vectors) and deep neural network (DNN), by exploiting diverse (phonetic) information extracted using various techniques such as automatic speech recognition (ASR). Different speakers span different subspaces within a universal acoustic space, usually modelled by âuniversal background modelâ. The speaker-specific subspace depends on the speakerâs voice characteristics, but also on the verbalised text of a speaker. In current state-of-the-art SV systems, i-vectors are extracted by applying a factor analysis technique to obtain low dimensional speaker-specific representation. Furthermore, DNN output is also employed in a conventional i-vector framework to model phonetic information embedded in the speech signal. This thesis proposes various techniques to exploit phonetic knowledge of speech to further enrich speaker characteristics. More specifically, the techniques proposed in this thesis are applied to various SV tasks, namely, text-independent and text-dependent SV. For text-independent SV task, several ASR systems are developed and applied to compute phonetic posterior probabilities, subsequently exploited to enhance the speaker-specific information included in i-vectors. These approaches are then extended for text-dependent SV task, exploiting temporal information in a principled way, i.e., by using dynamic time warping applied on speaker informative vectors. Finally, as opposed to train DNN with phonetic information, DNN is trained in an end-to-end fashion to directly discriminate speakers. The baseline end-to-end SV approach consists of mapping a variable length speech segment to a fixed dimensional speaker vector by estimating the mean of hidden representations in DNN structure. We improve upon this technique by computing a distance function between two utterances which takes into account common phonetic units. The whole network is optimized by employing a triplet-loss objective function. The proposed approaches are evaluated on commonly used datasets such as NIST SRE 2010 and RSR2015. Significant improvements are observed over the baseline systems on both the text-dependent and text-independent SV tasks by applying phonetic knowledge

    Reconocimiento automático de locutor e idioma mediante caracterización acústica de unidades lingüísticas

    Full text link
    Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones . Fecha de lectura: 30-06-201

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Session varaibility compensation in automatic speaker and language recognition

    Full text link
    Tesis doctoral inédita. Universidad Autónoma de Madrid, Escuela Politécnica Superior, octubre de 201

    Subspace Gaussian Mixture Models for Language Identification and Dysarthric Speech Intelligibility Assessment

    Get PDF
    En esta Tesis se ha investigado la aplicación de técnicas de modelado de subespacios de mezclas de Gaussianas en dos problemas relacionados con las tecnologías del habla, como son la identificación automática de idioma (LID, por sus siglas en inglés) y la evaluación automática de inteligibilidad en el habla de personas con disartria. Una de las técnicas más importantes estudiadas es el análisis factorial conjunto (JFA, por sus siglas en inglés). JFA es, en esencia, un modelo de mezclas de Gaussianas en el que la media de cada componente se expresa como una suma de factores de dimensión reducida, y donde cada factor representa una contribución diferente a la señal de audio. Esta factorización nos permite compensar nuestros modelos frente a contribuciones indeseadas presentes en la señal, como la información de canal. JFA se ha investigado como clasficador y como extractor de parámetros. En esta última aproximación se modela un solo factor que representa todas las contribuciones presentes en la señal. Los puntos en este subespacio se denominan i-Vectors. Así, un i-Vector es un vector de baja dimensión que representa una grabación de audio. Los i-Vectors han resultado ser muy útiles como vector de características para representar señales en diferentes problemas relacionados con el aprendizaje de máquinas. En relación al problema de LID, se han investigado dos sistemas diferentes de acuerdo al tipo de información extraída de la señal. En el primero, la señal se parametriza en vectores acústicos con información espectral a corto plazo. En este caso, observamos mejoras de hasta un 50% con el sistema basado en i-Vectors respecto al sistema que utilizaba JFA como clasificador. Se comprobó que el subespacio de canal del modelo JFA también contenía información del idioma, mientras que con los i-Vectors no se descarta ningún tipo de información, y además, son útiles para mitigar diferencias entre los datos de entrenamiento y de evaluación. En la fase de clasificación, los i-Vectors de cada idioma se modelaron con una distribución Gaussiana en la que la matriz de covarianza era común para todos. Este método es simple y rápido, y no requiere de ningún post-procesado de los i-Vectors. En el segundo sistema, se introdujo el uso de información prosódica y formántica en un sistema de LID basado en i-Vectors. La precisión de éste estaba por debajo de la del sistema acústico. Sin embargo, los dos sistemas son complementarios, y se obtuvo hasta un 20% de mejora con la fusión de los dos respecto al sistema acústico solo. Tras los buenos resultados obtenidos para LID, y dado que, teóricamente, los i-Vectors capturan toda la información presente en la señal, decidimos usarlos para la evaluar de manera automática la inteligibilidad en el habla de personas con disartria. Los logopedas están muy interesados en esta tecnología porque permitiría evaluar a sus pacientes de una manera objetiva y consistente. En este caso, los i-Vectors se obtuvieron a partir de información espectral a corto plazo de la señal, y la inteligibilidad se calculó a partir de los i-Vectors obtenidos para un conjunto de palabras dichas por el locutor evaluado. Comprobamos que los resultados eran mucho mejores si en el entrenamiento del sistema se incorporaban datos de la persona que iba a ser evaluada. No obstante, esta limitación podría aliviarse utilizando una mayor cantidad de datos para entrenar el sistema.In this Thesis, we investigated how to effciently apply subspace Gaussian mixture modeling techniques onto two speech technology problems, namely automatic spoken language identification (LID) and automatic intelligibility assessment of dysarthric speech. One of the most important of such techniques in this Thesis was joint factor analysis (JFA). JFA is essentially a Gaussian mixture model where the mean of the components is expressed as a sum of low-dimension factors that represent different contributions to the speech signal. This factorization makes it possible to compensate for undesired sources of variability, like the channel. JFA was investigated as final classiffer and as feature extractor. In the latter approach, a single subspace including all sources of variability is trained, and points in this subspace are known as i-Vectors. Thus, one i-Vector is defined as a low-dimension representation of a single utterance, and they are a very powerful feature for different machine learning problems. We have investigated two different LID systems according to the type of features extracted from speech. First, we extracted acoustic features representing short-time spectral information. In this case, we observed relative improvements with i-Vectors with respect to JFA of up to 50%. We realized that the channel subspace in a JFA model also contains language information whereas i-Vectors do not discard any language information, and moreover, they help to reduce mismatches between training and testing data. For classification, we modeled the i-Vectors of each language with a Gaussian distribution with covariance matrix shared among languages. This method is simple and fast, and it worked well without any post-processing. Second, we introduced the use of prosodic and formant information with the i-Vectors system. The performance was below the acoustic system but both were found to be complementary and we obtained up to a 20% relative improvement with the fusion with respect to the acoustic system alone. Given the success in LID and the fact that i-Vectors capture all the information that is present in the data, we decided to use i-Vectors for other tasks, specifically, the assessment of speech intelligibility in speakers with different types of dysarthria. Speech therapists are very interested in this technology because it would allow them to objectively and consistently rate the intelligibility of their patients. In this case, the input features were extracted from short-term spectral information, and the intelligibility was assessed from the i-Vectors calculated from a set of words uttered by the tested speaker. We found that the performance was clearly much better if we had available data for training of the person that would use the application. We think that this limitation could be relaxed if we had larger databases for training. However, the recording process is not easy for people with disabilities, and it is difficult to obtain large datasets of dysarthric speakers open to the research community. Finally, the same system architecture for intelligibility assessment based on i-Vectors was used for predicting the accuracy that an automatic speech recognizer (ASR) system would obtain with dysarthric speakers. The only difference between both was the ground truth label set used for training. Predicting the performance response of an ASR system would increase the confidence of speech therapists in these systems and would diminish health related costs. The results were not as satisfactory as in the previous case, probably because an ASR is a complex system whose accuracy can be very difficult to be predicted only with acoustic information. Nonetheless, we think that we opened a door to an interesting research direction for the two problems

    A speaker classification framework for non-intrusive user modeling : speech-based personalization of in-car services

    Get PDF
    Speaker Classification, i.e. the automatic detection of certain characteristics of a person based on his or her voice, has a variety of applications in modern computer technology and artificial intelligence: As a non-intrusive source for user modeling, it can be employed for personalization of human-machine interfaces in numerous domains. This dissertation presents a principled approach to the design of a novel Speaker Classification system for automatic age and gender recognition which meets these demands. Based on literature studies, methods and concepts dealing with the underlying pattern recognition task are developed. The final system consists of an incremental GMM-SVM supervector architecture with several optimizations. An extensive data-driven experiment series explores the parameter space and serves as evaluation of the component. Further experiments investigate the language-independence of the approach. As an essential part of this thesis, a framework is developed that implements all tasks associated with the design and evaluation of Speaker Classification in an integrated development environment that is able to generate efficient runtime modules for multiple platforms. Applications from the automotive field and other domains demonstrate the practical benefit of the technology for personalization, e.g. by increasing local danger warning lead time for elderly drivers.Die Sprecherklassifikation, also die automatische Erkennung bestimmter Merkmale einer Person anhand ihrer Stimme, besitzt eine Vielzahl von Anwendungsmöglichkeiten in der modernen Computertechnik und Künstlichen Intelligenz: Als nicht-intrusive Wissensquelle für die Benutzermodellierung kann sie zur Personalisierung in vielen Bereichen eingesetzt werden. In dieser Dissertation wird ein fundierter Ansatz zum Entwurf eines neuartigen Sprecherklassifikationssystems zur automatischen Bestimmung von Alter und Geschlecht vorgestellt, welches diese Anforderungen erfüllt. Ausgehend von Literaturstudien werden Konzepte und Methoden zur Behandlung des zugrunde liegenden Mustererkennungsproblems entwickelt, welche zu einer inkrementell arbeitenden GMM-SVM-Supervector-Architektur mit diversen Optimierungen führen. Eine umfassende datengetriebene Experimentalreihe dient der Erforschung des Parameterraumes und zur Evaluierung der Komponente. Weitere Studien untersuchen die Sprachunabhängigkeit des Ansatzes. Als wesentlicher Bestandteil der Arbeit wird ein Framework entwickelt, das alle im Zusammenhang mit Entwurf und Evaluierung von Sprecherklassifikation anfallenden Aufgaben in einer integrierten Entwicklungsumgebung implementiert, welche effiziente Laufzeitmodule für verschiedene Plattformen erzeugen kann. Anwendungen aus dem Automobilbereich und weiteren Domänen demonstrieren den praktischen Nutzen der Technologie zur Personalisierung, z.B. indem die Vorlaufzeit von lokalen Gefahrenwarnungen für ältere Fahrer erhöht wird

    Selected Topics in Audio-based Recommendation of TV Content

    Get PDF
    corecore