7 research outputs found

    Analysis of the utility of classical and novel speech quality measures for speaker verification

    Full text link
    Proceedings of Third International Conference, ICB 2009, Alghero, ItalyThe final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-01793-3_45In this work, we analyze several quality measures for speaker verification from the point of view of their utility, i.e., their ability to predict performance in an authentication task. We select several quality measures derived from classic indicators of speech degradation, namely ITU P.563 estimator of subjective quality, signal to noise ratio and kurtosis of linear predictive coefficients. Moreover, we propose a novel quality measure derived from what we have called Universal Background Model Likelihood (UBML), which indicates the degradation of a speech utterance in terms of its divergence with respect to a given universal model. Utility of quality measures is evaluated following the protocols and databases of NIST Speaker Recognition Evaluation (SRE) 2006 and 2008 (telephone-only subset), and ultimately by means of error-vs.-rejection plots as recommended by NIST. Results presented in this study show significant utility for all the quality measures analyzed, and also a moderate decorrelation among them.This work has been supported by the Spanish Ministry of Education under project TEC2006-13170-C02-01

    Phone-based cepstral polynomial SVM system for speaker recognition,” in

    Get PDF
    Abstract We have been using a phone-based cepstral system with polynomial features in NIST evaluations for the past two years. This system uses three broad phone classes, three states per class, and third-order polynomial features obtained from MFCC features. In this paper, we present a complete analysis of the system. We start from a simpler system that does not use phones or states and show that the addition of phones gives a significant improvement. We show that adding state information does not provide improvement on its own but provides a significant improvement when used with phone classes. We complete the system by applying nuisance attribute projection (NAP) and score normalization. We show that splitting features after a joint NAP over all phone classes results in a significant improvement. Overall, we obtain about 25% performance improvement with polynomial features based on phones and states, and obtain a system with performance comparable to a state-of-the-art SVM system

    Reconocimiento automático de locutor con hermanos españoles: hermanos gemelos (monozigóticos y dizigóticos) y no gemelos

    Get PDF
    The performance of the automatic speaker recognition (ASR) system BatvoxTM (Version 4.1) has been tested with a male population of 24 monozygotic (MZ) twins, 10 dizygotic (DZ) twins, 8 non-twin siblings and 12 unrelated speakers (aged 18–52 with Standard Peninsular Spanish as their mother tongue). Since the cepstral features in which this ASR system is based depend largely on anatomical–physiological foundations, we hypothesized that such features ought to be gene-dependent. Therefore, higher similarity values should be found in MZ twins (100% shared genes) than in DZ twins, in brothers (B) or in a reference population of unrelated speakers (US). Results corroborated the expected decreasing scale MZ > DZ > B > US since the similarity coefficients yielded by the automatic system for these speakers decreased exactly in the same direction as the kinship degree of the four speaker groups diminishes. This suggests that the system features are to a great extent genetically conditioned and that they are hence useful and robust for comparing speech samples of known and unknown origin, as found in legal cases. Furthermore, the 9.9% EER (Equal Error Rate) obtained when testing MZ pairs lies around the same value (11% EER) found in Künzel (2010) with German twins.Hemos utilizado el sistema de reconocimiento automático BatvoxTM (versión 4.1) con una población de hablantes masculinos compuesta de 24 gemelos monocigóticos, 10 gemelos dicigóticos, 8 hermanos no gemelares y 12 hablantes no emparentados (edades comprendidas entre 18 y 52 años, con español centropeninsular como lengua materna). Puesto que los parámetros cepstrales en los que se basa BatvoxTM dependen en gran medida de las bases anatómicas y fisiológicas del tracto vocal del hablante, se propuso que estos debían estar influenciados genéticamente. Esta hipótesis se pudo corroborar, puesto que los coeficientes de similitud arrojados por el sistema automático decrecen exactamente en la misma dirección en la que disminuye el grado de parentesco de las parejas de hablantes, es decir: gemelos monocigóticos, dicigóticos, hermanos no gemelares y hablantes no emparentados. Esto es, los gemelos monocigóticos obtuvieron valores más altos que los dicigóticos; estos, a su vez, mayores que los hermanos no gemelares, y, finalmente, estos últimos mayores que los hablantes no emparentados. Estos resultados sugieren que los parámetros en los que está basado este sistema de reconocimiento están condicionados en gran medida por aspectos genéticos y, por tanto, resultan útiles y robustos para la comparación de muestras de voz dubitadas e indubitadas que encontramos en un caso típicamente forense. Por otro lado, el EER (Equal Error Rate) del 9 % que se obtuvo en las comparaciones exclusivamente de gemelos monocigóticos supone un valor muy similar al hallado en estudios anteriores con gemelos monocigóticos alemanes, como Künzel (2010): EER del 11 %

    A Speaker Verification Backend with Robust Performance across Conditions

    Full text link
    In this paper, we address the problem of speaker verification in conditions unseen or unknown during development. A standard method for speaker verification consists of extracting speaker embeddings with a deep neural network and processing them through a backend composed of probabilistic linear discriminant analysis (PLDA) and global logistic regression score calibration. This method is known to result in systems that work poorly on conditions different from those used to train the calibration model. We propose to modify the standard backend, introducing an adaptive calibrator that uses duration and other automatically extracted side-information to adapt to the conditions of the inputs. The backend is trained discriminatively to optimize binary cross-entropy. When trained on a number of diverse datasets that are labeled only with respect to speaker, the proposed backend consistently and, in some cases, dramatically improves calibration, compared to the standard PLDA approach, on a number of held-out datasets, some of which are markedly different from the training data. Discrimination performance is also consistently improved. We show that joint training of the PLDA and the adaptive calibrator is essential -- the same benefits cannot be achieved when freezing PLDA and fine-tuning the calibrator. To our knowledge, the results in this paper are the first evidence in the literature that it is possible to develop a speaker verification system with robust out-of-the-box performance on a large variety of conditions

    Big Data analytics to assess personality based on voice analysis

    Full text link
    Trabajo Fin de Grado en Ingeniería de Tecnologías y Servicios de TelecomunicaciónWhen humans speak, the produced series of acoustic signs do not encode only the linguistic message they wish to communicate, but also several other types of information about themselves and their states that show glimpses of their personalities and can be apprehended by judgers. As there is nowadays a trend to film job candidate’s interviews, the aim of this Thesis is to explore possible correlations between speech features extracted from interviews and personality characteristics established by experts, and to try to predict in a candidate the Big Five personality traits: Conscientiousness, Agreeableness, Neuroticism, Openness to Experience and Extraversion. The features were extracted from a genuine database of 44 women video recordings acquired in 2020, and 78 in 2019 and before from a previous study. Even though many significant correlations were found for each years’ dataset, lots of them were proven to be inconsistent through both studies. Only extraversion, and openness in a more limited way, showed a good number of clear correlations. Essentially, extraversion has been found to be related to the variation in the slope of the pitch (usually at the end of sentences), which indicates that a more "singing" voice could be associated with a higher score. In addition, spectral entropy and roll-off measurements have also been found to indicate that larger changes in the spectrum (which may also be related to more "singing" voices) could be associated with greater extraversion too. Regarding predictive modelling algorithms, aimed to estimate personality traits from the speech features obtained for the study, results were observed to be very limited in terms of accuracy and RMSE, and also through scatter plots for regression models and confusion matrixes for classification evaluation. Nevertheless, various results encourage to believe that there are some predicting capabilities, and extraversion and openness also ended up being the most predictable personality traits. Better outcomes were achieved when predictions were performed based on one specific feature instead of all of them or a reduced group, as it was the case for openness when estimated through linear and logistic regression based on time over 90% of the variation range of the deltas from the entropy of the spectrum module. Extraversion too, as it correlates well with features relating variation in F0 decreasing slope and variations in the spectrum. For the predictions, several machine learning algorithms have been used, such as linear regression, logistic regression and random forests

    Formant trajectories in forensic speaker recognition

    Get PDF
    Die vorliegende Arbeit untersucht das Leistungsverhalten eines Ansatzes der forensischen Sprechererkennung, der auf parametrischen Repräsentationen von Formantverläufen basiert. Quadratische und kubische Polynomfunktionen werden dabei an Formantverläufe von Diphthongen angenähert. Die resultierenden Koeffizienten sowie die ersten drei bzw. vier Komponenten der Diskreten Kosinustransformation (DCT) werden in Folge verwendet, um die dynamischen Eigenschaften der zugrundeliegenden akustischen Merkmale der Sprache und damit der Sprechercharakteristika zu erfassen. Am Ende steht eine Repräsentation bestehend aus wenigen dekorrelierten Parametern, die für die forensische Sprechererkennung verwendet werden. Die in der Untersuchung durchgeführte Evaluierung beinhaltet die Berechnung von Likelihood-Ratio-Werten für die Anwendung im Bayesschen Ansatz für die Bewertung von forensischen Beweisstücken. Die Vorteile dieses Systems und die derzeitigen Beschränkungen werden behandelt. Für die Berechnung der Likelihood-Ratio-Werte wird eine von Aitken & Lucy (2004) entwickelte multivariate Kernel-Density-Formel verwendet, die sowohl Zwischen-Sprecher- als auch Inner-Sprecher-Variabilität berücksichtigt. Automatische Kalibrierungs- und Fusionstechniken, wie sie in Systemen zur automatischen Sprecheridentifikation verwendet werden, werden auf die Ergebniswerte angewendet. Um die Bedeutung von Längenaspekten von Diphthongen für die forensische Sprechererkennung näher zu untersuchen wird ein Experiment durchgeführt, in dem der Effekt von Zeitnormalisierung sowie die Modellierung der Dauer durch einen expliziten Parameter evaluiert werden. Die Leistungsfähigkeit der parametrischen Repräsentationen verglichen mit anderen Methoden sowie die Effekte der Kalibrierung und Fusion werden unter Verwendung üblicher Bewertungswerkzeuge wie des Erkennungsfehlerabwägungs-(DET)-Diagramms, des Tippett-Diagramms und des angewandten Fehlerwahrscheinlichkeits-(APE)-Diagramms, sowie numerischer Kennziffern wie der Gleichfehlerrate (EER) und der Cllr-Metrik evaluiert.The present work investigates the performance of an approach for forensic speaker recognition that is based on parametric representations of formant trajectories. Quadratic and cubic polynomial functions are fitted to formant contours of diphthongs. The resulting coefficients as well as the first three to four components derived from discrete cosine transform (DCT) are used in order to capture the dynamic properties of the underlying speech acoustics, and thus of the speaker characteristics. This results in a representation based on only a small number of decorrelated parameters that are in turn used for forensic speaker recognition. The evaluation conducted in the study incorporates the calculation of likelihood ratios for use in the Bayesian approach of evidence evaluation. The advantages of this framework and its current limitations are discussed. For the calculation of the likelihood ratios a multivariate kernel density formula developed by Aitken & Lucy (2004) is used which takes both between-speaker and within-speaker variability into account. Automatic calibration and fusion techniques as they are used in automatic speaker identification systems are applied to the resulting scores. To further investigate the importance of duration aspects of the diphthongs for speaker recognition an experiment is undertaken that evaluates the effect of time-normalisation as well as modelling segment durations using an explicit parameter. The performance of the parametric representation approach compared with other methods as well as the effects of calibration and fusion are evaluated using standard evaluation tools like the detection error trade-off (DET) plots, the applied probability of error (APE) plot, the Tippett plot as well as numerical indices like the EER and the Cllr metric
    corecore