110 research outputs found

    Jitter and Shimmer measurements for speaker diarization

    Get PDF
    Jitter and shimmer voice quality features have been successfully used to characterize speaker voice traits and detect voice pathologies. Jitter and shimmer measure variations in the fundamental frequency and amplitude of speaker's voice, respectively. Due to their nature, they can be used to assess differences between speakers. In this paper, we investigate the usefulness of these voice quality features in the task of speaker diarization. The combination of voice quality features with the conventional spectral features, Mel-Frequency Cepstral Coefficients (MFCC), is addressed in the framework of Augmented Multiparty Interaction (AMI) corpus, a multi-party and spontaneous speech set of recordings. Both sets of features are independently modeled using mixture of Gaussians and fused together at the score likelihood level. The experiments carried out on the AMI corpus show that incorporating jitter and shimmer measurements to the baseline spectral features decreases the diarization error rate in most of the recordings.Peer ReviewedPostprint (published version

    Discriminative features for GMM and i-vector based speaker diarization

    Get PDF
    Speaker diarization has received several research attentions over the last decade. Among the different domains of speaker diarization, diarization in meeting domain is the most challenging one. It usually contains spontaneous speech and is, for example, susceptible to reverberation. The appropriate selection of speech features is one of the factors that affect the performance of speaker diarization systems. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used short-term speech features in speaker diarization. Other factors that affect the performance of speaker diarization systems are the techniques employed to perform both speaker segmentation and speaker clustering. In this thesis, we have proposed the use of jitter and shimmer long-term voice-quality features both for Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. The voice-quality features are used together with the state-of-the-art short-term cepstral and long-term speech ones. The long-term features consist of prosody and Glottal-to-Noise excitation ratio (GNE) descriptors. Firstly, the voice-quality, prosodic and GNE features are stacked in the same feature vector. Then, they are fused with cepstral coefficients at the score likelihood level both for the proposed Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. For the proposed GMM based speaker diarization system, independent HMM models are estimated from the short-term and long-term speech feature sets. The fusion of the short-term descriptors with the long-term ones in speaker segmentation is carried out by linearly weighting the log-likelihood scores of Viterbi decoding. In the case of speaker clustering, the fusion of the short-term cepstral features with the long-term ones is carried out by linearly fusing the Bayesian Information Criterion (BIC) scores corresponding to these feature sets. For the proposed i-vector based speaker diarization system, the speaker segmentation is carried out exactly the same as in the previously mentioned GMM based speaker diarization system. However, the speaker clustering technique is based on the recently introduced factor analysis paradigm. Two set of i-vectors are extracted from the speaker segmentation hypothesis. Whilst the first i-vector is extracted from short-term cepstral features, the second one is extracted from the voice quality, prosody and GNE descriptors. Then, the cosine-distance and Probabilistic Linear Discriminant Analysis (PLDA) scores of i-vectors are linearly weighted to obtain a fused similarity score. Finally, the fused score is used as speaker clustering distance. We have also proposed the use of delta dynamic features for speaker clustering. The motivation for using deltas in clustering is that delta dynamic features capture the transitional characteristics of the speech signal which contain speaker specific information. This information is not captured by the static cepstral coefficients. The delta features are used together with the short-term static cepstral coefficients and long-term speech features (i.e., voice-quality, prosody and GNE) both for GMM and i-vector based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction (AMI) meeting corpus. The experimental results show that the use of voice-quality, prosody, GNE and delta dynamic features improve the performance of both GMM and i-vector based speaker diarization systems.La diarización del altavoz ha recibido varias atenciones de investigación durante la última década. Entre los diferentes dominios de la diarización del hablante, la diarización en el dominio del encuentro es la más difícil. Normalmente contiene habla espontánea y, por ejemplo, es susceptible de reverberación. La selección apropiada de las características del habla es uno de los factores que afectan el rendimiento de los sistemas de diarización de los altavoces. Los Coeficientes Cepstral de Frecuencia Mel (MFCC) son las características de habla de corto plazo más utilizadas en la diarización de los altavoces. Otros factores que afectan el rendimiento de los sistemas de diarización del altavoz son las técnicas empleadas para realizar tanto la segmentación del altavoz como el agrupamiento de altavoces. En esta tesis, hemos propuesto el uso de jitter y shimmer características de calidad de voz a largo plazo tanto para GMM y i-vector basada en sistemas de diarización de altavoces. Las características de calidad de voz se utilizan junto con el estado de la técnica a corto plazo cepstral y de larga duración de habla. Las características a largo plazo consisten en la prosodia y los descriptores de relación de excitación Glottal-a-Ruido (GNE). En primer lugar, las características de calidad de voz, prosódica y GNE se apilan en el mismo vector de características. A continuación, se fusionan con coeficientes cepstrales en el nivel de verosimilitud de puntajes tanto para los sistemas de diarización de altavoces basados ¿¿en el modelo Gaussian Mixture Modeling (GMM) como en los sistemas basados ¿¿en i-vector. . Para el sistema de diarización de altavoces basado en GMM propuesto, se calculan modelos HMM independientes a partir de cada conjunto de características. En la segmentación de los altavoces, la fusión de los descriptores a corto plazo con los de largo plazo se lleva a cabo mediante la ponderación lineal de las puntuaciones log-probabilidad de decodificación Viterbi. En la agrupación de altavoces, la fusión de las características cepstrales a corto plazo con las de largo plazo se lleva a cabo mediante la fusión lineal de las puntuaciones Bayesian Information Criterion (BIC) correspondientes a estos conjuntos de características. Para el sistema de diarización de altavoces basado en un vector i, la fusión de características se realiza exactamente igual a la del sistema basado en GMM antes mencionado. Sin embargo, la técnica de agrupación de altavoces se basa en el paradigma de análisis de factores recientemente introducido. Dos conjuntos de i-vectores se extraen de la hipótesis de segmentación de altavoz. Mientras que el primer vector i se extrae de características espectrales a corto plazo, el segundo se extrae de los descriptores de calidad de voz apilados, prosódicos y GNE. A continuación, las puntuaciones de coseno-distancia y Probabilistic Linear Discriminant Analysis (PLDA) entre i-vectores se ponderan linealmente para obtener una puntuación de similitud fundida. Finalmente, la puntuación fusionada se utiliza como distancia de agrupación de altavoces. También hemos propuesto el uso de características dinámicas delta para la agrupación de locutores. La motivación para el uso de deltas en la agrupación es que las características dinámicas delta capturan las características de transición de la señal de voz que contienen información específica del locutor. Esta información no es capturada por los coeficientes cepstrales estáticos. Las características delta se usan junto con los coeficientes cepstrales estáticos a corto plazo y las características de voz a largo plazo (es decir, calidad de voz, prosodia y GNE) tanto para sistemas de diarización de altavoces basados en GMM como en sistemas i-vector. Los resultados experimentales sobre AMI muestran que el uso de calidad vocal, prosódica, GNE y dinámicas delta mejoran el rendimiento de los sistemas de diarización de altavoces basados en GMM e i-vector.Postprint (published version

    The use of long-term features for GMM- and i-vector-based speaker diarization systems

    Get PDF
    Several factors contribute to the performance of speaker diarization systems. For instance, the appropriate selection of speech features is one of the key aspects that affect speaker diarization systems. The other factors include the techniques employed to perform both segmentation and clustering. While the static mel frequency cepstral coefficients are the most widely used features in speech-related tasks including speaker diarization, several studies have shown the benefits of augmenting regular speech features with the static ones. In this work, we have proposed and assessed the use of voice-quality features (i.e., jitter, shimmer, and Glottal-to-Noise Excitation ratio) within the framework of speaker diarization. These acoustic attributes are employed together with the state-of-the-art short-term cepstral and long-term prosodic features. Additionally, the use of delta dynamic features is also explored separately both for segmentation and bottom-up clustering sub-tasks. The combination of the different feature sets is carried out at several levels. At the feature level, the long-term speech features are stacked in the same feature vector. At the score level, the short- and long-term speech features are independently modeled and fused at the score likelihood level. Various feature combinations have been applied both for Gaussian mixture modeling and i-vector-based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction meeting corpus. The best result, in terms of diarization error rate, is reported by using i-vector-based cosine-distance clustering together with a signal parameterization consisting of a combination of static cepstral coefficients, delta, voice-quality, and prosodic features. The best result shows about 24% relative diarization error rate improvement compared to the baseline system which is based on Gaussian mixture modeling and short-term static cepstral coefficients.Peer ReviewedPostprint (published version

    Prosodic and other Long-Term Features for Speaker Diarization

    Full text link

    Predicting continuous conflict perception with Bayesian Gaussian processes

    Get PDF
    Conflict is one of the most important phenomena of social life, but it is still largely neglected by the computing community. This work proposes an approach that detects common conversational social signals (loudness, overlapping speech, etc.) and predicts the conflict level perceived by human observers in continuous, non-categorical terms. The proposed regression approach is fully Bayesian and it adopts Automatic Relevance Determination to identify the social signals that influence most the outcome of the prediction. The experiments are performed over the SSPNet Conflict Corpus, a publicly available collection of 1430 clips extracted from televised political debates (roughly 12 hours of material for 138 subjects in total). The results show that it is possible to achieve a correlation close to 0.8 between actual and predicted conflict perception

    Detection and handling of overlapping speech for speaker diarization

    Get PDF
    For the last several years, speaker diarization has been attracting substantial research attention as one of the spoken language technologies applied for the improvement, or enrichment, of recording transcriptions. Recordings of meetings, compared to other domains, exhibit an increased complexity due to the spontaneity of speech, reverberation effects, and also due to the presence of overlapping speech. Overlapping speech refers to situations when two or more speakers are speaking simultaneously. In meeting data, a substantial portion of errors of the conventional speaker diarization systems can be ascribed to speaker overlaps, since usually only one speaker label is assigned per segment. Furthermore, simultaneous speech included in training data can eventually lead to corrupt single-speaker models and thus to a worse segmentation. This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on distant microphone channel data. Spatial features from different microphone pairs are fused by means of principal component analysis, linear discriminant analysis, or by a multi-layer perceptron. In addition, we also investigate the possibility of employing longterm prosodic information. The most suitable subset from a set of candidate prosodic features is determined in two steps. Firstly, a ranking according to mRMR criterion is obtained, and then, a standard hill-climbing wrapper approach is applied in order to determine the optimal number of features. The novel spatial as well as prosodic parameters are used in combination with spectral-based features suggested previously in the literature. In experiments conducted on AMI meeting data, we show that the newly proposed features do contribute to the detection of overlapping speech, especially on data originating from a single recording site. In speaker diarization, for segments including detected speaker overlap, a second speaker label is picked, and such segments are also discarded from the model training. The proposed overlap labeling technique is integrated in Viterbi decoding, a part of the diarization algorithm. During the system development it was discovered that it is favorable to do an independent optimization of overlap exclusion and labeling with respect to the overlap detection system. We report improvements over the baseline diarization system on both single- and multi-site AMI data. Preliminary experiments with NIST RT data show DER improvement on the RT ¿09 meeting recordings as well. The addition of beamforming and TDOA feature stream into the baseline diarization system, which was aimed at improving the clustering process, results in a bit higher effectiveness of the overlap labeling algorithm. A more detailed analysis on the overlap exclusion behavior reveals big improvement contrasts between individual meeting recordings as well as between various settings of the overlap detection operation point. However, a high performance variability across different recordings is also typical of the baseline diarization system, without any overlap handling

    Review of Research on Speech Technology: Main Contributions From Spanish Research Groups

    Get PDF
    In the last two decades, there has been an important increase in research on speech technology in Spain, mainly due to a higher level of funding from European, Spanish and local institutions and also due to a growing interest in these technologies for developing new services and applications. This paper provides a review of the main areas of speech technology addressed by research groups in Spain, their main contributions in the recent years and the main focus of interest these days. This description is classified in five main areas: audio processing including speech, speaker characterization, speech and language processing, text to speech conversion and spoken language applications. This paper also introduces the Spanish Network of Speech Technologies (RTTH. Red Temática en Tecnologías del Habla) as the research network that includes almost all the researchers working in this area, presenting some figures, its objectives and its main activities developed in the last years

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Advances in Subspace-based Solutions for Diarization in the Broadcast Domain

    Get PDF
    La motivación de esta tesis es la necesidad de soluciones robustas al problema de diarización. Estas técnicas de diarización deben proporcionar valor añadido a la creciente cantidad disponible de datos multimedia mediante la precisa discriminación de los locutores presentes en la señal de audio. Desafortunadamente, hasta tiempos recientes este tipo de tecnologías solamente era viable en condiciones restringidas, quedando por tanto lejos de una solución general. Las razones detrás de las limitadas prestaciones de los sistemas de diarización son múltiples. La primera causa a tener en cuenta es la alta complejidad de la producción de la voz humana, en particular acerca de los procesos fisiológicos necesarios para incluir las características discriminativas de locutor en la señal de voz. Esta complejidad hace del proceso inverso, la estimación de dichas características a partir del audio, una tarea ineficiente por medio de las técnicas actuales del estado del arte. Consecuentemente, en su lugar deberán tenerse en cuenta aproximaciones. Los esfuerzos en la tarea de modelado han proporcionado modelos cada vez más elaborados, aunque no buscando la explicación última de naturaleza fisiológica de la señal de voz. En su lugar estos modelos aprenden relaciones entre la señales acústicas a partir de un gran conjunto de datos de entrenamiento. El desarrollo de modelos aproximados genera a su vez una segunda razón, la variabilidad de dominio. Debido al uso de relaciones aprendidas a partir de un conjunto de entrenamiento concreto, cualquier cambio de dominio que modifique las condiciones acústicas con respecto a los datos de entrenamiento condiciona las relaciones asumidas, pudiendo causar fallos consistentes en los sistemas.Nuestra contribución a las tecnologías de diarización se ha centrado en el entorno de radiodifusión. Este dominio es actualmente un entorno todavía complejo para los sistemas de diarización donde ninguna simplificación de la tarea puede ser tenida en cuenta. Por tanto, se deberá desarrollar un modelado eficiente del audio para extraer la información de locutor y como inferir el etiquetado correspondiente. Además, la presencia de múltiples condiciones acústicas debido a la existencia de diferentes programas y/o géneros en el domino requiere el desarrollo de técnicas capaces de adaptar el conocimiento adquirido en un determinado escenario donde la información está disponible a aquellos entornos donde dicha información es limitada o sencillamente no disponible.Para este propósito el trabajo desarrollado a lo largo de la tesis se ha centrado en tres subtareas: caracterización de locutor, agrupamiento y adaptación de modelos. La primera subtarea busca el modelado de un fragmento de audio para obtener representaciones precisas de los locutores involucrados, poniendo de manifiesto sus propiedades discriminativas. En este área se ha llevado a cabo un estudio acerca de las actuales estrategias de modelado, especialmente atendiendo a las limitaciones de las representaciones extraídas y poniendo de manifiesto el tipo de errores que pueden generar. Además, se han propuesto alternativas basadas en redes neuronales haciendo uso del conocimiento adquirido. La segunda tarea es el agrupamiento, encargado de desarrollar estrategias que busquen el etiquetado óptimo de los locutores. La investigación desarrollada durante esta tesis ha propuesto nuevas estrategias para estimar el mejor reparto de locutores basadas en técnicas de subespacios, especialmente PLDA. Finalmente, la tarea de adaptación de modelos busca transferir el conocimiento obtenido de un conjunto de entrenamiento a dominios alternativos donde no hay datos para extraerlo. Para este propósito los esfuerzos se han centrado en la extracción no supervisada de información de locutor del propio audio a diarizar, sinedo posteriormente usada en la adaptación de los modelos involucrados.<br /
    corecore