56 research outputs found

    Source Coding Optimization for Distributed Average Consensus

    Consensus is a common method for computing a function of the data distributed among the nodes of a network. Of particular interest is distributed average consensus, whereby the nodes iteratively compute the sample average of the data stored at all the nodes of the network using only near-neighbor communications. In real-world scenarios, these communications must undergo quantization, which introduces distortion to the internode messages. In this thesis, a model for the evolution of the network state statistics at each iteration is developed under the assumptions of Gaussian data and additive quantization error. It is shown that minimization of the communication load in terms of aggregate source coding rate can be posed as a generalized geometric program, for which an equivalent convex optimization can efficiently solve for the global minimum. Optimization procedures are developed for rate-distortion-optimal vector quantization, uniform entropy-coded scalar quantization, and fixed-rate uniform quantization. Numerical results demonstrate the performance of these approaches. For small numbers of iterations, the fixed-rate optimizations are verified using exhaustive search. Comparison to the prior art suggests competitive performance under certain circumstances but strongly motivates the incorporation of more sophisticated coding strategies, such as differential, predictive, or Wyner-Ziv coding.Comment: Master's Thesis, Electrical Engineering, North Carolina State Universit

    Efficient speaker recognition for mobile devices

    Emotion Recognition from Speech with Acoustic, Non-Linear and Wavelet-based Features Extracted in Different Acoustic Conditions

    ABSTRACT: In the last years, there has a great progress in automatic speech recognition. The challenge now it is not only recognize the semantic content in the speech but also the called "paralinguistic" aspects of the speech, including the emotions, and the personality of the speaker. This research work aims in the development of a methodology for the automatic emotion recognition from speech signals in non-controlled noise conditions. For that purpose, different sets of acoustic, non-linear, and wavelet based features are used to characterize emotions in different databases created for such purpose


    This thesis describes research into effective voice biometrics (speaker recognition) under mismatched noise conditions. Over the last two decades, this class of biometrics has been the subject of considerable research due to its various applications in such areas as telephone banking, remote access control and surveillance. One of the main challenges associated with the deployment of voice biometrics in practice is that of undesired variations in speech characteristics caused by environmental noise. Such variations can in turn lead to a mismatch between the corresponding test and reference material from the same speaker. This is found to adversely affect the performance of speaker recognition in terms of accuracy. To address the above problem, a novel approach is introduced and investigated. The proposed method is based on minimising the noise mismatch between reference speaker models and the given test utterance, and involves a new form of Test-Normalisation (T-Norm) for further enhancing matching scores under the aforementioned adverse operating conditions. Through experimental investigations, based on the two main classes of speaker recognition (i.e. verification/ open-set identification), it is shown that the proposed approach can significantly improve the performance accuracy under mismatched noise conditions. In order to further improve the recognition accuracy in severe mismatch conditions, an approach to enhancing the above stated method is proposed. This, which involves providing a closer adjustment of the reference speaker models to the noise condition in the test utterance, is shown to considerably increase the accuracy in extreme cases of noisy test data. Moreover, to tackle the computational burden associated with the use of the enhanced approach with open-set identification, an efficient algorithm for its realisation in this context is introduced and evaluated. The thesis presents a detailed description of the research undertaken, describes the experimental investigations and provides a thorough analysis of the outcomes

    Privacy-Protecting Techniques for Behavioral Data: A Survey

    Our behavior (the way we talk, walk, or think) is unique and can be used as a biometric trait. It also correlates with sensitive attributes like emotions. Hence, techniques to protect individuals privacy against unwanted inferences are required. To consolidate knowledge in this area, we systematically reviewed applicable anonymization techniques. We taxonomize and compare existing solutions regarding privacy goals, conceptual operation, advantages, and limitations. Our analysis shows that some behavioral traits (e.g., voice) have received much attention, while others (e.g., eye-gaze, brainwaves) are mostly neglected. We also find that the evaluation methodology of behavioral anonymization techniques can be further improved

    Subspace Gaussian Mixture Models for Language Identification and Dysarthric Speech Intelligibility Assessment

    En esta Tesis se ha investigado la aplicaciĂłn de tĂ©cnicas de modelado de subespacios de mezclas de Gaussianas en dos problemas relacionados con las tecnologĂ­as del habla, como son la identificaciĂłn automĂĄtica de idioma (LID, por sus siglas en inglĂ©s) y la evaluaciĂłn automĂĄtica de inteligibilidad en el habla de personas con disartria. Una de las tĂ©cnicas mĂĄs importantes estudiadas es el anĂĄlisis factorial conjunto (JFA, por sus siglas en inglĂ©s). JFA es, en esencia, un modelo de mezclas de Gaussianas en el que la media de cada componente se expresa como una suma de factores de dimensiĂłn reducida, y donde cada factor representa una contribuciĂłn diferente a la señal de audio. Esta factorizaciĂłn nos permite compensar nuestros modelos frente a contribuciones indeseadas presentes en la señal, como la informaciĂłn de canal. JFA se ha investigado como clasficador y como extractor de parĂĄmetros. En esta Ășltima aproximaciĂłn se modela un solo factor que representa todas las contribuciones presentes en la señal. Los puntos en este subespacio se denominan i-Vectors. AsĂ­, un i-Vector es un vector de baja dimensiĂłn que representa una grabaciĂłn de audio. Los i-Vectors han resultado ser muy Ăștiles como vector de caracterĂ­sticas para representar señales en diferentes problemas relacionados con el aprendizaje de mĂĄquinas. En relaciĂłn al problema de LID, se han investigado dos sistemas diferentes de acuerdo al tipo de informaciĂłn extraĂ­da de la señal. En el primero, la señal se parametriza en vectores acĂșsticos con informaciĂłn espectral a corto plazo. En este caso, observamos mejoras de hasta un 50% con el sistema basado en i-Vectors respecto al sistema que utilizaba JFA como clasificador. Se comprobĂł que el subespacio de canal del modelo JFA tambiĂ©n contenĂ­a informaciĂłn del idioma, mientras que con los i-Vectors no se descarta ningĂșn tipo de informaciĂłn, y ademĂĄs, son Ăștiles para mitigar diferencias entre los datos de entrenamiento y de evaluaciĂłn. En la fase de clasificaciĂłn, los i-Vectors de cada idioma se modelaron con una distribuciĂłn Gaussiana en la que la matriz de covarianza era comĂșn para todos. Este mĂ©todo es simple y rĂĄpido, y no requiere de ningĂșn post-procesado de los i-Vectors. En el segundo sistema, se introdujo el uso de informaciĂłn prosĂłdica y formĂĄntica en un sistema de LID basado en i-Vectors. La precisiĂłn de Ă©ste estaba por debajo de la del sistema acĂșstico. Sin embargo, los dos sistemas son complementarios, y se obtuvo hasta un 20% de mejora con la fusiĂłn de los dos respecto al sistema acĂșstico solo. Tras los buenos resultados obtenidos para LID, y dado que, teĂłricamente, los i-Vectors capturan toda la informaciĂłn presente en la señal, decidimos usarlos para la evaluar de manera automĂĄtica la inteligibilidad en el habla de personas con disartria. Los logopedas estĂĄn muy interesados en esta tecnologĂ­a porque permitirĂ­a evaluar a sus pacientes de una manera objetiva y consistente. En este caso, los i-Vectors se obtuvieron a partir de informaciĂłn espectral a corto plazo de la señal, y la inteligibilidad se calculĂł a partir de los i-Vectors obtenidos para un conjunto de palabras dichas por el locutor evaluado. Comprobamos que los resultados eran mucho mejores si en el entrenamiento del sistema se incorporaban datos de la persona que iba a ser evaluada. No obstante, esta limitaciĂłn podrĂ­a aliviarse utilizando una mayor cantidad de datos para entrenar el sistema.In this Thesis, we investigated how to effciently apply subspace Gaussian mixture modeling techniques onto two speech technology problems, namely automatic spoken language identification (LID) and automatic intelligibility assessment of dysarthric speech. One of the most important of such techniques in this Thesis was joint factor analysis (JFA). JFA is essentially a Gaussian mixture model where the mean of the components is expressed as a sum of low-dimension factors that represent different contributions to the speech signal. This factorization makes it possible to compensate for undesired sources of variability, like the channel. JFA was investigated as final classiffer and as feature extractor. In the latter approach, a single subspace including all sources of variability is trained, and points in this subspace are known as i-Vectors. Thus, one i-Vector is defined as a low-dimension representation of a single utterance, and they are a very powerful feature for different machine learning problems. We have investigated two different LID systems according to the type of features extracted from speech. First, we extracted acoustic features representing short-time spectral information. In this case, we observed relative improvements with i-Vectors with respect to JFA of up to 50%. We realized that the channel subspace in a JFA model also contains language information whereas i-Vectors do not discard any language information, and moreover, they help to reduce mismatches between training and testing data. For classification, we modeled the i-Vectors of each language with a Gaussian distribution with covariance matrix shared among languages. This method is simple and fast, and it worked well without any post-processing. Second, we introduced the use of prosodic and formant information with the i-Vectors system. The performance was below the acoustic system but both were found to be complementary and we obtained up to a 20% relative improvement with the fusion with respect to the acoustic system alone. Given the success in LID and the fact that i-Vectors capture all the information that is present in the data, we decided to use i-Vectors for other tasks, specifically, the assessment of speech intelligibility in speakers with different types of dysarthria. Speech therapists are very interested in this technology because it would allow them to objectively and consistently rate the intelligibility of their patients. In this case, the input features were extracted from short-term spectral information, and the intelligibility was assessed from the i-Vectors calculated from a set of words uttered by the tested speaker. We found that the performance was clearly much better if we had available data for training of the person that would use the application. We think that this limitation could be relaxed if we had larger databases for training. However, the recording process is not easy for people with disabilities, and it is difficult to obtain large datasets of dysarthric speakers open to the research community. Finally, the same system architecture for intelligibility assessment based on i-Vectors was used for predicting the accuracy that an automatic speech recognizer (ASR) system would obtain with dysarthric speakers. The only difference between both was the ground truth label set used for training. Predicting the performance response of an ASR system would increase the confidence of speech therapists in these systems and would diminish health related costs. The results were not as satisfactory as in the previous case, probably because an ASR is a complex system whose accuracy can be very difficult to be predicted only with acoustic information. Nonetheless, we think that we opened a door to an interesting research direction for the two problems

    Advanced Biometrics with Deep Learning

    Biometrics, such as fingerprint, iris, face, hand print, hand vein, speech and gait recognition, etc., as a means of identity management have become commonplace nowadays for various applications. Biometric systems follow a typical pipeline, that is composed of separate preprocessing, feature extraction and classification. Deep learning as a data-driven representation learning approach has been shown to be a promising alternative to conventional data-agnostic and handcrafted pre-processing and feature extraction for biometric systems. Furthermore, deep learning offers an end-to-end learning paradigm to unify preprocessing, feature extraction, and recognition, based solely on biometric data. This Special Issue has collected 12 high-quality, state-of-the-art research papers that deal with challenging issues in advanced biometric systems based on deep learning. The 12 papers can be divided into 4 categories according to biometric modality; namely, face biometrics, medical electronic signals (EEG and ECG), voice print, and others
