5 research outputs found

    Toward Fail-Safe Speaker Recognition: Trial-Based Calibration with a Reject Option

    Get PDF
    The output scores of most of the speaker recognition systems are not directly interpretable as stand-alone values. For this reason, a calibration step is usually performed on the scores to convert them into proper likelihood ratios, which have a clear probabilistic interpretation. The standard calibration approach transforms the system scores using a linear function trained using data selected to closely match the evaluation conditions. This selection, though, is not feasible when the evaluation conditions are unknown. In previous work, we proposed a calibration approach for this scenario called trial-based calibration (TBC). TBC trains a separate calibration model for each test trial using data that is dynamically selected from a candidate training set to match the conditions of the trial. In this work, we extend the TBC method, proposing: 1) a new similarity metric for selecting training data that result in significant gains over the one proposed in the original work; 2) a new option that enables the system to reject a trial when not enough matched data are available for training the calibration model; and 3) the use of regularization to improve the robustness of the calibration models trained for each trial. We test the proposed algorithms on a development set composed of several conditions and on the Federal Bureau of Investigation multi-condition speaker recognition dataset, and we demonstrate that the proposed approach reduces calibration loss to values close to 0 for most of the conditions when matched calibration data are available for selection, and that it can reject most of the trials for which relevant calibration data are unavailable.Fil: Ferrer, Luciana. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas. Oficina de CoordinaciĂłn Administrativa Ciudad Universitaria. Instituto de InvestigaciĂłn en Ciencias de la ComputaciĂłn. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de InvestigaciĂłn en Ciencias de la ComputaciĂłn; ArgentinaFil: Nandwana, Mahesh Kumar. No especifĂ­ca;Fil: McLaren, Mitchell. No especifĂ­ca;Fil: Castan, Diego. No especifĂ­ca;Fil: Lawson, Aaron. No especifĂ­ca

    A Speaker Verification Backend with Robust Performance across Conditions

    Full text link
    In this paper, we address the problem of speaker verification in conditions unseen or unknown during development. A standard method for speaker verification consists of extracting speaker embeddings with a deep neural network and processing them through a backend composed of probabilistic linear discriminant analysis (PLDA) and global logistic regression score calibration. This method is known to result in systems that work poorly on conditions different from those used to train the calibration model. We propose to modify the standard backend, introducing an adaptive calibrator that uses duration and other automatically extracted side-information to adapt to the conditions of the inputs. The backend is trained discriminatively to optimize binary cross-entropy. When trained on a number of diverse datasets that are labeled only with respect to speaker, the proposed backend consistently and, in some cases, dramatically improves calibration, compared to the standard PLDA approach, on a number of held-out datasets, some of which are markedly different from the training data. Discrimination performance is also consistently improved. We show that joint training of the PLDA and the adaptive calibrator is essential -- the same benefits cannot be achieved when freezing PLDA and fine-tuning the calibrator. To our knowledge, the results in this paper are the first evidence in the literature that it is possible to develop a speaker verification system with robust out-of-the-box performance on a large variety of conditions

    L’effet de la familiarité sur l’identification des locuteurs : pour un perfectionnement de la parade vocale

    Full text link
    La présente étude porte sur les effets de la familiarité dans l’identification d’individus en situation de parade vocale. La parade vocale est une technique inspirée d’une procédure paralégale d’identification visuelle d’individus. Elle consiste en la présentation de plusieurs voix avec des aspects acoustiques similaires définis selon des critères reconnus dans la littérature. L’objectif principal de la présente étude était de déterminer si la familiarité d’une voix dans une parade vocale peut donner un haut taux d’identification correcte (> 99 %) de locuteurs. Cette étude est la première à quantifier le critère de familiarité entre l’identificateur et une personne associée à « une voix-cible » selon quatre paramètres liés aux contacts (communications) entre les individus, soit la récence du contact (à quand remonte la dernière rencontre avec l’individu), la durée et la fréquence moyenne du contact et la période pendant laquelle avaient lieu les contacts. Trois différentes parades vocales ont été élaborées, chacune contenant 10 voix d’hommes incluant une voix-cible pouvant être très familière; ce degré de familiarité a été établi selon un questionnaire. Les participants (identificateurs, n = 44) ont été sélectionnés selon leur niveau de familiarité avec la voix-cible. Toutes les voix étaient celles de locuteurs natifs du franco-québécois et toutes avaient des fréquences fondamentales moyennes similaires à la voix-cible (à un semi-ton près). Aussi, chaque parade vocale contenait des énoncés variant en longueur selon un nombre donné de syllabes (1, 4, 10, 18 syll.). Les résultats démontrent qu’en contrôlant le degré de familiarité et avec un énoncé de 4 syllabes ou plus, on obtient un taux d’identification avec une probabilité exacte d’erreur de p 99 %). Our study is the first to quantify the familiarity criterion linking an identifier and a « target voice ». The quantification was based on four parameters bearing on the degree of contact between individuals: recency, frenquency, duration, and the period during which the contact occurred. Three different voice line-ups were elaborated, each containing 10 voices, including one target voice which was well known by the identifier according to a questionnaire that served to quantify familiarity. Participants (identifiers, n = 44) were selected on the basis of their familiarity with the target voice. The speakers used in the voice line-ups were native speakers of Quebec French and all presented voices had similar fundamental frequencies (to within one semitone). In each line-up we used utterances of 4 different lengths (1, 4, 10, and 18 syll.). The results show that by controlling the familiarity criterion, a correct identification rate of a 100 % is obtained with an exact error probability of p < 1 x 10-12.These rates are superior to current automatic systems of voice identification
    corecore