905 research outputs found

    Evaluating automatic speaker recognition systems: an overview of the nist speaker recognition evaluations (1996-2014)

    Get PDF
    2014 CSIC. Manuscripts published in this Journal are the property of the Consejo Superior de Investigaciones Científicas, and quoting this source is a requirement for any partial or full reproduction.Automatic Speaker Recognition systems show interesting properties, such as speed of processing or repeatability of results, in contrast to speaker recognition by humans. But they will be usable just if they are reliable. Testability, or the ability to extensively evaluate the goodness of the speaker detector decisions, becomes then critical. In the last 20 years, the US National Institute of Standards and Technology (NIST) has organized, providing the proper speech data and evaluation protocols, a series of text-independent Speaker Recognition Evaluations (SRE). Those evaluations have become not just a periodical benchmark test, but also a meeting point of a collaborative community of scientists that have been deeply involved in the cycle of evaluations, allowing tremendous progress in a specially complex task where the speaker information is spread across different information levels (acoustic, prosodic, linguistic…) and is strongly affected by speaker intrinsic and extrinsic variability factors. In this paper, we outline how the evaluations progressively challenged the technology including new speaking conditions and sources of variability, and how the scientific community gave answers to those demands. Finally, NIST SREs will be shown to be not free of inconveniences, and future challenges to speaker recognition assessment will also be discussed

    Cross-entropy analysis of the information in forensic speaker recognition

    Full text link
    Proceedings of Odyssey 2008: The Speaker and Language Recognition Workshop, Stellenbosch, South AfricaIn this work we analyze the average information supplied by a forensic speaker recognition system in an information theoretical way. The objective is the transparent reporting of the performance of the system in terms of information, according to the needs of transparency and testability in forensic science. This analysis allows the derivation of a proper measure of goodness for forensic speaker recognition, the empirical cross-entropy (ECE), according to previous work in the literature. We also propose an intuitive representation, namely the ECE plot, which allows forensic scientists to explain the average information given by the evidence analysis process in a clear and intuitive way. Such representation allows the forensic scientist to assess the evidence evaluation process with independence of the prior information, which is province of the court. Then, fact finders may check the average information given by the evidence analysis with the incorporation of prior information. An experimental example following NIST SRE 2006 protocol is presented in order to highlight the adequacy of the proposed framework in the forensic inferential process. An example of the presentation of the average information supplied by the forensic analysis of the speech evidence in court is also provided, simulating a real case.This work has been supported by the Spanish Ministry of Education under project TEC2006-13170-C02-01

    Linguistically-constrained formant-based i-vectors for automatic speaker recognition

    Full text link
    This is the author’s version of a work that was accepted for publication in Speech Communication. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Speech Communication, VOL 76 (2016) DOI 10.1016/j.specom.2015.11.002This paper presents a large-scale study of the discriminative abilities of formant frequencies for automatic speaker recognition. Exploiting both the static and dynamic information in formant frequencies, we present linguistically-constrained formant-based i-vector systems providing well calibrated likelihood ratios per comparison of the occurrences of the same isolated linguistic units in two given utterances. As a first result, the reported analysis on the discriminative and calibration properties of the different linguistic units provide useful insights, for instance, to forensic phonetic practitioners. Furthermore, it is shown that the set of units which are more discriminative for every speaker vary from speaker to speaker. Secondly, linguistically-constrained systems are combined at score-level through average and logistic regression speaker-independent fusion rules exploiting the different speaker-distinguishing information spread among the different linguistic units. Testing on the English-only trials of the core condition of the NIST 2006 SRE (24,000 voice comparisons of 5 minutes telephone conversations from 517 speakers -219 male and 298 female-), we report equal error rates of 9.57 and 12.89% for male and female speakers respectively, using only formant frequencies as speaker discriminative information. Additionally, when the formant-based system is fused with a cepstral i-vector system, we obtain relative improvements of ∼6% in EER (from 6.54 to 6.13%) and ∼15% in minDCF (from 0.0327 to 0.0279), compared to the cepstral system alone.This work has been supported by the Spanish Ministry of Economy and Competitiveness (project CMC-V2: Caracterizacion, Modelado y Compensacion de Variabilidad en la Señal de Voz, TEC2012-37585-C02-01). Also, the authors would like to thank SRI for providing the Decipher phonetic transcriptions of the NIST 2004, 2005 and 2006 SREs that have allowed to carry out this work

    Forensic automatic speaker recognition: fiction or science?

    Full text link
    Proceedings of Interspeech 2008, Brisbane (Australia)Incluye presentación en Power Point ofrecida durante el congreso.Hollywood films and CSI-like movies show a technology landscape far from real, both in forensic speaker recognition and other identification-of-the-source forensic areas. Lay persons are used to good-looking scientist-and-investigators performing voice identifications ("we got a match!") or smart fancy devices producing voice transformations causing one actor to instantaneously talk with the voice of other. Simultaneously, Forensic Identification Science is facing a global challenge impelled firstly by progressively higher requirements for admissibility of expert testimony in Court and secondly by the transparent and testable nature of DNA typing, which is now seen as the new gold-standard model of a scientifically defensible approach to be emulated by all other identification-of-the-source areas. In this presentation we will show how forensic speaker recognition can comply with the requirements of transparency and testability in forensic science This will lead to fulfilling the court requirements about role separation between scientists and judges/juries, and bring about integration in a forensically adequate framework in which the scientist provides the appropriate information necessary to the court's decision processes

    Information-theoretical comparison of evidence evaluation methods for score-based biometric systems

    Full text link
    Ponencia presentada en la Seventh International Conference on Forensic Inference and Statistics, The University of Lausanne, Switzerland, August 2008Biometric systems are a powerful tool in many forensic disciplines in order to aid scientists to evaluate the weight of the evidence. However, uprising requirements of admissibility in forensic science demand scientific methods in order to test the accuracy of the forensic evidence evaluation process. In this work we analyze and compare several evidence analysis methods for score-based biometric systems. For all of them, the score given by the system is transformed into a likelihood ratio ( LR) which expresses the weight of the evidence. The accuracy of each LR computation method will be assessed by classical Tippett plots- We also propose measuring accuracy in terms of average information given by the evidence evaluation process, by means of Empirical Cross-Entropy (EC-E) plots. Preliminary results are presented using a voice biometric system and the NIST SRE 2006 experimental protocol

    Gaussian Mixture Models of Between-Source Variation for Likelihood Ratio Computation from Multivariate Data

    Full text link
    Franco-Pedroso J, Ramos D, Gonzalez-Rodriguez J (2016) Gaussian Mixture Models of Between-Source Variation for Likelihood Ratio Computation from Multivariate Data. PLoS ONE 11(2): e0149958. doi:10.1371/journal.pone.0149958In forensic science, trace evidence found at a crime scene and on suspect has to be evaluated from the measurements performed on them, usually in the form of multivariate data (for example, several chemical compound or physical characteristics). In order to assess the strength of that evidence, the likelihood ratio framework is being increasingly adopted. Several methods have been derived in order to obtain likelihood ratios directly from univariate or multivariate data by modelling both the variation appearing between observations (or features) coming from the same source (within-source variation) and that appearing between observations coming from different sources (between-source variation). In the widely used multivariate kernel likelihood-ratio, the within-source distribution is assumed to be normally distributed and constant among different sources and the between-source variation is modelled through a kernel density function (KDF). In order to better fit the observed distribution of the between-source variation, this paper presents a different approach in which a Gaussian mixture model (GMM) is used instead of a KDF. As it will be shown, this approach provides better-calibrated likelihood ratios as measured by the log-likelihood ratio cost (C-llr) in experiments performed on freely available forensic datasets involving different trace evidences: inks, glass fragments and car paints.JFP recieved funding from "Ministerio de Economia y Competitividad (ES)" (http://www.mineco.gob.es/) through the project "CMC-V2: Caracterizacion, Modelado y Compensacion de Variabilidad en la Senal de Voz", with grant number TEC2012-37585-C02-01. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

    Von Mises-Fisher models in the total variability subspace for language recognition

    Full text link
    Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. I. Lopez-Moreno, D. Ramos, J. Gonzalez-Dominguez, and J. Gonzalez-Rodriguez, "Von Mises-Fisher models in the total variability subspace for language recognition", IEEE Signal Processing Letters, vol. 18, no. 12, pp. 705-708, October 2011This letter proposes a new modeling approach for the Total Variability subspace within a Language Recognition task. Motivated by previous works in directional statistics, von Mises-Fisher distributions are used for assigning language-conditioned probabilities to language data, assumed to be spherically distributed in this subspace. The two proposed methods use Kernel Density Functions or Finite Mixture Models of such distributions. Experiments conducted on NIST LRE 2009 show that the proposed techniques significantly outperform the baseline cosine distance approach in most of the considered experimental conditions, including different speech conditions, durations and the presence of unseen languages.This work was supported by the Ministerio de Ciencia e Innovación under FPI Grant TEC2009-14719-C02-01 and cátedra UAM-Telefónic

    Likelihood ratio calibration in a transparent and testable forensic speaker recognition framework

    Full text link
    Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. D. Ramos, J. González-Rodríguez, J. Ortega-garcía, "Likelihood Ratio Calibration in a Transparent and Testable Forensic Speaker Recognition Framework " in The Speaker and Language Recognition Workshop, ODYSSEY, San Juan (Puerto Rico), 2006, 1 - 8A recently reopened debate about the infallibility of some classical forensic disciplines is leading to new requirements in forensic science. Standardization of procedures, proficiency testing, transparency in the scientific evaluation of the evidence and testability of the system and protocols are emphasized in order to guarantee the scientific objectivity of the procedures. Those ideas will be exploited in this paper in order to walk towards an appropriate framework for the use of forensic speaker recognition in courts. Evidence is interpreted using the Bayesian approach for the analysis of the evidence, as a scientific and logical methodology, in a two-stage approach based in the similarity-typicality pair, which facilitates the transparency in the process. The concept of calibration as a way of reporting reliable and accurate opinions is also deeply addressed, presenting experimental results which illustrate its effects. The testability of the system is then accomplished by the use of the NIST SRE 2005 evaluation protocol. Recently proposed application-independent evaluation techniques (Cllr and APE curves) are finally addressed as a proper way for presenting results of proficiency testing in courts, as these evaluation metrics clearly show the influence of calibration errors in the accuracy of the inferential decision processThis work has been supported by the Spanish Ministry for Science and Technology under project TIC2003-09068-C02-01

    ATVS-UAM NIST LRE 2009 System Description

    Full text link
    Official contribution of the National Institute of Standards and Technology; not subject to copyright in the United States.ATVS-UAM submits a fast, light and efficient single system. The use of a task-adapted nonspeech-recognition-based VAD (apart from NIST conversation labels) and gender-dependent total variability compensation technology allows our submitted system to obtain excellent development results with SRE08 data with exceptional computational efficiency. In order to test the VAD influence in the evaluation results, a contrastive equivalent system has been submitted exclusively changing ATVS VAD labels with BUT publicly contributed ones. In all contributed systems, two gender-independent calibrations have been trained with respectively telephone-only and mic (either mic-tel, tel-mic or mic-mic) data. The submitted systems have been designed for English speech in an application-independent way, all results being interpretable in the form of calibrated likelihood ratios to be properly evaluated with Cllr. Sample development results with English SRE08 data are 0.53% (male) and 1.11% (female) EER in tel-tel data (optimistic as all English speakers in SRE08 are included in total variability matrices), going up to 3.5% (tel-tel) to 5.1% EER (tel-mic) in pessimistic cross-validation experiments (25% of test speakers totally excluded from development data in each xval set). The submitted system is extremely light in computational resources, running 77 times faster than real time. Moreover, once VAD and feature extraction are performed (the heaviest components of our system), training and testing are performed respectively at 5300 and 2950 times faster than real time

    dOTM: a mechanism for distributing centralized multi-party video conferencing in the cloud

    Get PDF
    One of the key factors for a given application to take advantage of cloud computing is the ability to scale in an efficient, fast and reliable way. In centralized multi-party video conferencing, dynamically scaling a running conversation is a complex problem. In this paper we propose a methodology to divide the Multipoint Control Unit (the video conferencing server) into more simple units, broadcasters. Each broadcaster receives the media from a participant, processes it and forwards it to the rest. These broadcasters can be distributed among a group of CPUs. By using this methodology, video conferencing systems can scale in a more granular way, improving the deployment
    corecore