2,013 research outputs found

    Jitter and Shimmer measurements for speaker diarization

    Get PDF
    Jitter and shimmer voice quality features have been successfully used to characterize speaker voice traits and detect voice pathologies. Jitter and shimmer measure variations in the fundamental frequency and amplitude of speaker's voice, respectively. Due to their nature, they can be used to assess differences between speakers. In this paper, we investigate the usefulness of these voice quality features in the task of speaker diarization. The combination of voice quality features with the conventional spectral features, Mel-Frequency Cepstral Coefficients (MFCC), is addressed in the framework of Augmented Multiparty Interaction (AMI) corpus, a multi-party and spontaneous speech set of recordings. Both sets of features are independently modeled using mixture of Gaussians and fused together at the score likelihood level. The experiments carried out on the AMI corpus show that incorporating jitter and shimmer measurements to the baseline spectral features decreases the diarization error rate in most of the recordings.Peer ReviewedPostprint (published version

    Speaker segmentation and clustering

    Get PDF
    This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved

    Development of a Speaker Diarization System for Speaker Tracking in Audio Broadcast News: a Case Study

    Get PDF
    A system for speaker tracking in broadcast-news audio data is presented and the impacts of the main components of the system to the overall speaker-tracking performance are evaluated. The process of speaker tracking in continuous audio streams involves several processing tasks and is therefore treated as a multistage process. The main building blocks of such system include the components for audio segmentation, speech detection, speaker clustering and speaker identification. The aim of the first three processes is to find homogeneous regions in continuous audio streams that belong to one speaker and to join each region of the same speaker together. The task of organizing the audio data in this way is known as speaker diarization and plays an important role in various speech-processing applications. In our case the impact of speaker diarization was assessed in a speaker-tracking system by performing a comparative study of how each of the component influenced the overall speaker-detection results. The evaluation experiments were performed on broadcast-news audio data with a speaker-tracking system, which was capable of detecting 41 target speakers. We implemented several different approaches in each component of the system and compared their performances by inspecting the final speaker-tracking results. The evaluation results indicate the importance of the audio-segmentation and speech-detection components, while no significant improvement of the overall results was achieved by additionally including a speaker-clustering component to the speaker-tracking system

    Discriminative features for GMM and i-vector based speaker diarization

    Get PDF
    Speaker diarization has received several research attentions over the last decade. Among the different domains of speaker diarization, diarization in meeting domain is the most challenging one. It usually contains spontaneous speech and is, for example, susceptible to reverberation. The appropriate selection of speech features is one of the factors that affect the performance of speaker diarization systems. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used short-term speech features in speaker diarization. Other factors that affect the performance of speaker diarization systems are the techniques employed to perform both speaker segmentation and speaker clustering. In this thesis, we have proposed the use of jitter and shimmer long-term voice-quality features both for Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. The voice-quality features are used together with the state-of-the-art short-term cepstral and long-term speech ones. The long-term features consist of prosody and Glottal-to-Noise excitation ratio (GNE) descriptors. Firstly, the voice-quality, prosodic and GNE features are stacked in the same feature vector. Then, they are fused with cepstral coefficients at the score likelihood level both for the proposed Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. For the proposed GMM based speaker diarization system, independent HMM models are estimated from the short-term and long-term speech feature sets. The fusion of the short-term descriptors with the long-term ones in speaker segmentation is carried out by linearly weighting the log-likelihood scores of Viterbi decoding. In the case of speaker clustering, the fusion of the short-term cepstral features with the long-term ones is carried out by linearly fusing the Bayesian Information Criterion (BIC) scores corresponding to these feature sets. For the proposed i-vector based speaker diarization system, the speaker segmentation is carried out exactly the same as in the previously mentioned GMM based speaker diarization system. However, the speaker clustering technique is based on the recently introduced factor analysis paradigm. Two set of i-vectors are extracted from the speaker segmentation hypothesis. Whilst the first i-vector is extracted from short-term cepstral features, the second one is extracted from the voice quality, prosody and GNE descriptors. Then, the cosine-distance and Probabilistic Linear Discriminant Analysis (PLDA) scores of i-vectors are linearly weighted to obtain a fused similarity score. Finally, the fused score is used as speaker clustering distance. We have also proposed the use of delta dynamic features for speaker clustering. The motivation for using deltas in clustering is that delta dynamic features capture the transitional characteristics of the speech signal which contain speaker specific information. This information is not captured by the static cepstral coefficients. The delta features are used together with the short-term static cepstral coefficients and long-term speech features (i.e., voice-quality, prosody and GNE) both for GMM and i-vector based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction (AMI) meeting corpus. The experimental results show that the use of voice-quality, prosody, GNE and delta dynamic features improve the performance of both GMM and i-vector based speaker diarization systems.La diarización del altavoz ha recibido varias atenciones de investigación durante la última década. Entre los diferentes dominios de la diarización del hablante, la diarización en el dominio del encuentro es la más difícil. Normalmente contiene habla espontánea y, por ejemplo, es susceptible de reverberación. La selección apropiada de las características del habla es uno de los factores que afectan el rendimiento de los sistemas de diarización de los altavoces. Los Coeficientes Cepstral de Frecuencia Mel (MFCC) son las características de habla de corto plazo más utilizadas en la diarización de los altavoces. Otros factores que afectan el rendimiento de los sistemas de diarización del altavoz son las técnicas empleadas para realizar tanto la segmentación del altavoz como el agrupamiento de altavoces. En esta tesis, hemos propuesto el uso de jitter y shimmer características de calidad de voz a largo plazo tanto para GMM y i-vector basada en sistemas de diarización de altavoces. Las características de calidad de voz se utilizan junto con el estado de la técnica a corto plazo cepstral y de larga duración de habla. Las características a largo plazo consisten en la prosodia y los descriptores de relación de excitación Glottal-a-Ruido (GNE). En primer lugar, las características de calidad de voz, prosódica y GNE se apilan en el mismo vector de características. A continuación, se fusionan con coeficientes cepstrales en el nivel de verosimilitud de puntajes tanto para los sistemas de diarización de altavoces basados ¿¿en el modelo Gaussian Mixture Modeling (GMM) como en los sistemas basados ¿¿en i-vector. . Para el sistema de diarización de altavoces basado en GMM propuesto, se calculan modelos HMM independientes a partir de cada conjunto de características. En la segmentación de los altavoces, la fusión de los descriptores a corto plazo con los de largo plazo se lleva a cabo mediante la ponderación lineal de las puntuaciones log-probabilidad de decodificación Viterbi. En la agrupación de altavoces, la fusión de las características cepstrales a corto plazo con las de largo plazo se lleva a cabo mediante la fusión lineal de las puntuaciones Bayesian Information Criterion (BIC) correspondientes a estos conjuntos de características. Para el sistema de diarización de altavoces basado en un vector i, la fusión de características se realiza exactamente igual a la del sistema basado en GMM antes mencionado. Sin embargo, la técnica de agrupación de altavoces se basa en el paradigma de análisis de factores recientemente introducido. Dos conjuntos de i-vectores se extraen de la hipótesis de segmentación de altavoz. Mientras que el primer vector i se extrae de características espectrales a corto plazo, el segundo se extrae de los descriptores de calidad de voz apilados, prosódicos y GNE. A continuación, las puntuaciones de coseno-distancia y Probabilistic Linear Discriminant Analysis (PLDA) entre i-vectores se ponderan linealmente para obtener una puntuación de similitud fundida. Finalmente, la puntuación fusionada se utiliza como distancia de agrupación de altavoces. También hemos propuesto el uso de características dinámicas delta para la agrupación de locutores. La motivación para el uso de deltas en la agrupación es que las características dinámicas delta capturan las características de transición de la señal de voz que contienen información específica del locutor. Esta información no es capturada por los coeficientes cepstrales estáticos. Las características delta se usan junto con los coeficientes cepstrales estáticos a corto plazo y las características de voz a largo plazo (es decir, calidad de voz, prosodia y GNE) tanto para sistemas de diarización de altavoces basados en GMM como en sistemas i-vector. Los resultados experimentales sobre AMI muestran que el uso de calidad vocal, prosódica, GNE y dinámicas delta mejoran el rendimiento de los sistemas de diarización de altavoces basados en GMM e i-vector.Postprint (published version

    Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization

    Full text link
    The performance of most speaker diarization systems with x-vector embeddings is both vulnerable to noisy environments and lacks domain robustness. Earlier work on speaker diarization using generative adversarial network (GAN) with an encoder network (ClusterGAN) to project input x-vectors into a latent space has shown promising performance on meeting data. In this paper, we extend the ClusterGAN network to improve diarization robustness and enable rapid generalization across various challenging domains. To this end, we fetch the pre-trained encoder from the ClusterGAN and fine-tune it by using prototypical loss (meta-ClusterGAN or MCGAN) under the meta-learning paradigm. Experiments are conducted on CALLHOME telephonic conversations, AMI meeting data, DIHARD II (dev set) which includes challenging multi-domain corpus, and two child-clinician interaction corpora (ADOS, BOSCC) related to the autism spectrum disorder domain. Extensive analyses of the experimental data are done to investigate the effectiveness of the proposed ClusterGAN and MCGAN embeddings over x-vectors. The results show that the proposed embeddings with normalized maximum eigengap spectral clustering (NME-SC) back-end consistently outperform Kaldi state-of-the-art z-vector diarization system. Finally, we employ embedding fusion with x-vectors to provide further improvement in diarization performance. We achieve a relative diarization error rate (DER) improvement of 6.67% to 53.93% on the aforementioned datasets using the proposed fused embeddings over x-vectors. Besides, the MCGAN embeddings provide better performance in the number of speakers estimation and short speech segment diarization as compared to x-vectors and ClusterGAN in telephonic data.Comment: Submitted to IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSIN
    corecore