101 research outputs found

    Optimizing laryngeal pathology detection by using combined cepstral features

    Get PDF
    ABSTRACT There are several diseases that affect the human voice quality which can be organic or neurological. Acoustic analysis of voice features can be used as a complementary and noninvasive tool for the diagnosis of laryngeal pathologies. The degree of reliability and effectiveness of the discriminating process depends on the appropriate acoustic feature extraction. This work presents a parametric method based on cepstral features to discriminate pathological voices of speakers affected by vocal fold edema and paralysis from healthy voices. Cepstral, weighted cepstral, delta cepstral, and weighted delta cepstral coefficients are obtained from speech signals. A Vector Quantization is carried out individually for each feature in the classification process, associated with a distortion measurement. The goal is to evaluate a performance of a classifier based on the individual and combined cepstral features. The average, the product and the weighted average are the different combination strategies applied yielding a multiple classifier that is more efficient than each individual technique. To assess the accuracy of the system, 153 speech files of sustained vowel /ah/ (53 healthy, 44 vocal fold edema and 56 paralysis) of the Disordered Voice Database from Massachusetts Eye and Ear Infirmary (MEEI) are used. Results show that the employed parameters are complementary and they can be used to detect vocal disorders caused by the presence of vocal fold pathologies

    Intra- and Inter-database Study for Arabic, English, and German Databases:Do Conventional Speech Features Detect Voice Pathology?

    Get PDF
    A large population around the world has voice complications. Various approaches for subjective and objective evaluations have been suggested in the literature. The subjective approach strongly depends on the experience and area of expertise of a clinician, and human error cannot be neglected. On the other hand, the objective or automatic approach is noninvasive. Automatic developed systems can provide complementary information that may be helpful for a clinician in the early screening of a voice disorder. At the same time, automatic systems can be deployed in remote areas where a general practitioner can use them and may refer the patient to a specialist to avoid complications that may be life threatening. Many automatic systems for disorder detection have been developed by applying different types of conventional speech features such as the linear prediction coefficients, linear prediction cepstral coefficients, and Mel-frequency cepstral coefficients (MFCCs). This study aims to ascertain whether conventional speech features detect voice pathology reliably, and whether they can be correlated with voice quality. To investigate this, an automatic detection system based on MFCC was developed, and three different voice disorder databases were used in this study. The experimental results suggest that the accuracy of the MFCC-based system varies from database to database. The detection rate for the intra-database ranges from 72% to 95%, and that for the inter-database is from 47% to 82%. The results conclude that conventional speech features are not correlated with voice, and hence are not reliable in pathology detection

    Discriminative features for GMM and i-vector based speaker diarization

    Get PDF
    Speaker diarization has received several research attentions over the last decade. Among the different domains of speaker diarization, diarization in meeting domain is the most challenging one. It usually contains spontaneous speech and is, for example, susceptible to reverberation. The appropriate selection of speech features is one of the factors that affect the performance of speaker diarization systems. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used short-term speech features in speaker diarization. Other factors that affect the performance of speaker diarization systems are the techniques employed to perform both speaker segmentation and speaker clustering. In this thesis, we have proposed the use of jitter and shimmer long-term voice-quality features both for Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. The voice-quality features are used together with the state-of-the-art short-term cepstral and long-term speech ones. The long-term features consist of prosody and Glottal-to-Noise excitation ratio (GNE) descriptors. Firstly, the voice-quality, prosodic and GNE features are stacked in the same feature vector. Then, they are fused with cepstral coefficients at the score likelihood level both for the proposed Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. For the proposed GMM based speaker diarization system, independent HMM models are estimated from the short-term and long-term speech feature sets. The fusion of the short-term descriptors with the long-term ones in speaker segmentation is carried out by linearly weighting the log-likelihood scores of Viterbi decoding. In the case of speaker clustering, the fusion of the short-term cepstral features with the long-term ones is carried out by linearly fusing the Bayesian Information Criterion (BIC) scores corresponding to these feature sets. For the proposed i-vector based speaker diarization system, the speaker segmentation is carried out exactly the same as in the previously mentioned GMM based speaker diarization system. However, the speaker clustering technique is based on the recently introduced factor analysis paradigm. Two set of i-vectors are extracted from the speaker segmentation hypothesis. Whilst the first i-vector is extracted from short-term cepstral features, the second one is extracted from the voice quality, prosody and GNE descriptors. Then, the cosine-distance and Probabilistic Linear Discriminant Analysis (PLDA) scores of i-vectors are linearly weighted to obtain a fused similarity score. Finally, the fused score is used as speaker clustering distance. We have also proposed the use of delta dynamic features for speaker clustering. The motivation for using deltas in clustering is that delta dynamic features capture the transitional characteristics of the speech signal which contain speaker specific information. This information is not captured by the static cepstral coefficients. The delta features are used together with the short-term static cepstral coefficients and long-term speech features (i.e., voice-quality, prosody and GNE) both for GMM and i-vector based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction (AMI) meeting corpus. The experimental results show that the use of voice-quality, prosody, GNE and delta dynamic features improve the performance of both GMM and i-vector based speaker diarization systems.La diarización del altavoz ha recibido varias atenciones de investigación durante la última década. Entre los diferentes dominios de la diarización del hablante, la diarización en el dominio del encuentro es la más difícil. Normalmente contiene habla espontánea y, por ejemplo, es susceptible de reverberación. La selección apropiada de las características del habla es uno de los factores que afectan el rendimiento de los sistemas de diarización de los altavoces. Los Coeficientes Cepstral de Frecuencia Mel (MFCC) son las características de habla de corto plazo más utilizadas en la diarización de los altavoces. Otros factores que afectan el rendimiento de los sistemas de diarización del altavoz son las técnicas empleadas para realizar tanto la segmentación del altavoz como el agrupamiento de altavoces. En esta tesis, hemos propuesto el uso de jitter y shimmer características de calidad de voz a largo plazo tanto para GMM y i-vector basada en sistemas de diarización de altavoces. Las características de calidad de voz se utilizan junto con el estado de la técnica a corto plazo cepstral y de larga duración de habla. Las características a largo plazo consisten en la prosodia y los descriptores de relación de excitación Glottal-a-Ruido (GNE). En primer lugar, las características de calidad de voz, prosódica y GNE se apilan en el mismo vector de características. A continuación, se fusionan con coeficientes cepstrales en el nivel de verosimilitud de puntajes tanto para los sistemas de diarización de altavoces basados ¿¿en el modelo Gaussian Mixture Modeling (GMM) como en los sistemas basados ¿¿en i-vector. . Para el sistema de diarización de altavoces basado en GMM propuesto, se calculan modelos HMM independientes a partir de cada conjunto de características. En la segmentación de los altavoces, la fusión de los descriptores a corto plazo con los de largo plazo se lleva a cabo mediante la ponderación lineal de las puntuaciones log-probabilidad de decodificación Viterbi. En la agrupación de altavoces, la fusión de las características cepstrales a corto plazo con las de largo plazo se lleva a cabo mediante la fusión lineal de las puntuaciones Bayesian Information Criterion (BIC) correspondientes a estos conjuntos de características. Para el sistema de diarización de altavoces basado en un vector i, la fusión de características se realiza exactamente igual a la del sistema basado en GMM antes mencionado. Sin embargo, la técnica de agrupación de altavoces se basa en el paradigma de análisis de factores recientemente introducido. Dos conjuntos de i-vectores se extraen de la hipótesis de segmentación de altavoz. Mientras que el primer vector i se extrae de características espectrales a corto plazo, el segundo se extrae de los descriptores de calidad de voz apilados, prosódicos y GNE. A continuación, las puntuaciones de coseno-distancia y Probabilistic Linear Discriminant Analysis (PLDA) entre i-vectores se ponderan linealmente para obtener una puntuación de similitud fundida. Finalmente, la puntuación fusionada se utiliza como distancia de agrupación de altavoces. También hemos propuesto el uso de características dinámicas delta para la agrupación de locutores. La motivación para el uso de deltas en la agrupación es que las características dinámicas delta capturan las características de transición de la señal de voz que contienen información específica del locutor. Esta información no es capturada por los coeficientes cepstrales estáticos. Las características delta se usan junto con los coeficientes cepstrales estáticos a corto plazo y las características de voz a largo plazo (es decir, calidad de voz, prosodia y GNE) tanto para sistemas de diarización de altavoces basados en GMM como en sistemas i-vector. Los resultados experimentales sobre AMI muestran que el uso de calidad vocal, prosódica, GNE y dinámicas delta mejoran el rendimiento de los sistemas de diarización de altavoces basados en GMM e i-vector.Postprint (published version

    A Comprehensive Survey of Automatic Dysarthric Speech Recognition

    Get PDF
    Automatic dysarthric speech recognition (DSR) is very crucial for many human computer interaction systems that enables the human to interact with machine in natural way. The objective of this paper is to analyze the literature survey of various Machine learning (ML) and deep learning (DL) based dysarthric speech recognition systems (DSR). This article presents a comprehensive survey of the recent advances in the automatic Dysarthric Speech Recognition (DSR) using machine learning and deep learning paradigms. It focuses on the methodology, database, evaluation metrics and major findings from the study of previous approaches.The proposed survey presents the various challenges related with DSR such as individual variability, limited training data, contextual understanding, articulation variability, vocal quality changes, and speaking rate variations.From the literature survey it provides the gaps between exiting work and previous work on DSR and provides the future direction for improvement of DSR.&nbsp

    Entropies from Markov Models as Complexity Measures of Embedded Attractors

    Get PDF
    ABSTRACT: This paper addresses the problem of measuring complexity from embedded attractors as a way to characterize changes in the dynamical behavior of different types of systems with a quasi-periodic behavior by observing their outputs. With the aim of measuring the stability of the trajectories of the attractor along time, this paper proposes three new estimations of entropy that are derived from a Markov model of the embedded attractor. The proposed estimators are compared with traditional nonparametric entropy measures, such as approximate entropy, sample entropy and fuzzy entropy, which only take into account the spatial dimension of the trajectory. The method proposes the use of an unsupervised algorithm to find the principal curve, which is considered as the “profile trajectory”, that will serve to adjust the Markov model. The new entropy measures are evaluated using three synthetic experiments and three datasets of physiological signals. In terms of consistency and discrimination capabilities, the results show that the proposed measures perform better than the other entropy measures used for comparison purposes

    Analysis and detection of human emotion and stress from speech signals

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Time series classification methodology using reproducing kernel Hilbert spaces embedding

    Get PDF
    La clasificación de series de tiempo es una tarea fundamental en las áreas de aprendizaje de máquina y reconocimiento de patrones, debido a las múltiples aplicaciones que existen en el estado del arte, tales como análisis en mercados bursátiles, medicina, redes de sensores, experimentos científicos de objetos en movimiento, biología y clasificación de formas. La mayoría de modelos basados en datos suponen que las observaciones son independientes e idénticamente distribuidas. Sin embargo, al suponer lo anterior ciertos factores discriminantes pueden ser pasados por alto
    corecore