150 research outputs found
Effects of Waveform PMF on Anti-Spoofing Detection
International audienceIn the context of detection of speaker recognition identity impersonation , we observed that the waveform probability mass function (PMF) of genuine speech differs from significantly of of PMF from identity theft extracts. This is true for synthesized or converted speech as well as for replayed speech. In this work, we mainly ask whether this observation has a significant impact on spoofing detection performance. In a second step, we want to reduce the distribution gap of waveforms between authentic speech and spoofing speech. We propose a genuiniza-tion of the spoofing speech (by analogy with Gaussianisation), i.e. to obtain spoofing speech with a PMF close to the PMF of genuine speech. Our genuinization is evaluated on ASVspoof 2019 challenge datasets, using the baseline system provided by the challenge organization. In the case of constant Q cep-stral coefficients (CQCC) features, the genuinization leads to a degradation of the baseline system performance by a factor of 10, which shows a potentially large impact of the distribution os waveforms on spoofing detection performance. However, by ''playing" with all configurations, we also observed different behaviors, including performance improvements in specific cases. This leads us to conclude that waveform distribution plays an important role and must be taken into account by anti-spoofing systems
Constrained discriminative speaker verification specific to normalized i-vectors
International audienceThis paper focuses on discriminative trainings (DT) applied to i-vectors after Gaussian probabilistic linear discriminant analysis (PLDA). If DT has been successfully used with non-normalized vectors, this technique struggles to improve speaker detection when i-vectors have been first normalized, whereas the latter option has proven to achieve best performance in speaker verification. We propose an additional normalization procedure which limits the amount of coefficient to discriminatively train, with a minimal loss of accuracy. Adaptations of logistic regression based-DT to this new configuration are proposed, then we introduce a discriminative classifier for speaker verification which is a novelty in the field
Typicality extraction in a Speaker Binary Keys model
International audienceIn the field of speaker recognition, the recently proposed notion of "Speaker Binary Key" provides a representation of each acoustic frame in a discriminant binary space. This approach relies on an unique acoustic model composed by a large set of speaker specific local likelihood peaks (called specificities). The model proposes a spatial coverage where each frame is characterized in terms of neighborhood. The most frequent specificities, picked up to represent the whole utterance, generate a binary key vector. The flexibility of this modeling allows to capture non-parametric behaviors. In this paper, we introduce a concept of "typicality" between binary keys, with a discriminant goal. We describe an algorithm able to extract such typicalities, which involves a singular value decomposition in a binary space. The theoretical aspects of this decomposition as well as its potential in terms of future developments are presented. All the propositions are also experimentally validated using NIST SRE 2008 framework
Exploring some limits of Gaussian PLDA modeling for i-vector distributions
International audienceGaussian-PLDA (G-PLDA) modeling for i-vector based speaker verification has proven to be competitive versus heavy-tailed PLDA (HT-PLDA) based on Student's t-distribution, when the latter is much more computationally expensive. However , its results are achieved using a length-normalization, which projects i-vectors on the non-linear and finite surface of a hypersphere. This paper investigates the limits of linear and Gaussian G-PLDA modeling when distribution of data is spherical. In particular, assumptions of homoscedasticity are questionable: the model assumes that the within-speaker variability can be estimated by a unique and linear parameter. A non-probabilistic approach is proposed, competitive with state-of-the-art, which reveals some limits of the Gaussian modeling in terms of goodness of fit. We carry out an analysis of residue, which finds out a relation between the dispersion of a speaker-class and its location and, thus, shows that homoscedasticity assumptions are not fulfilled
Introduction to the Special Issue “Speaker and Language Characterization and Recognition: Voice Modeling, Conversion, Synthesis and Ethical Aspects”
International audienceWelcome to this special issue on Speaker and Language Characterization which features, among other contributions, some of the most remarkable ideas presented and discussed at Odyssey 2018: the Speaker and Language Recognition Workshop, held in Les Sables d'Olonne, France, in June 2018. This issue perpetuates the series proposed by ISCA Speaker and language Characterization Special Interest Group in coordination with ISCA Speaker Odyssey workshops [1, 2, 3]. Voice is one of the most casual modalities for natural and intuitive interactions between humans as well as between humans and machines. Voice is also a central part of our identity. Voice-based solutions are currently deployed in a growing variety of applications, including person authentication through automatic speaker verification (ASV). A related technology concerns digital cloning of personal voice characteristics for text-to-speech (TTS) and voice conversion (VC). In the last years, the impressive advancements of the VC/TTS field opened the way for numerous new consumer applications. Especially, VC is offering new solutions for privacy protection. However, VC/TTS also brings the possibility of misuse of the technology in order to spoof ASV systems (for example presentation attacks implemented using voice conversion). As a direct consequence, spoofing countermeasures raises a growing interest during the past years. Moreover, voice is a central part of our identity and is also bringing othe
Characterization of the Pathological Voices (Dysphonia) in the frequency space
International audienceThis paper is related to the dysphonic voice assessment. It aims at studying the characteristic of dysphonia on the frequency domain. In this context, a GMM based automatic classication system is coupled to a frequency subband architecture in order to investigate which frequency bands are relevant for dysphonia characterization. Through various experiments, the low frequencies [0- 3000] Hz tend to be more interesting for dysphonia discrimination compared with higher frequencies
Analyse Phonétique dans le Domaine Fréquentiel pour la Classification des Voix Dysphoniques
International audienceConcerned with pathological voice assessment, this paper aims at characterizing dysphonia in the frequency domain for a better understanding of related phenomena while most of the studies have focused only on improving classification systems for diagnosis help purposes. Based on a first study which demonstrates that the low frequencies ([0-3000]Hz) are more relevant for dysphonia discrimination compared with higher frequencies, the authors propose in this paper to pursue by analyzing the impact of the restricted frequency subband ([0-3000]Hz) on the dysphonic voice discrimination from a phonetical point of view. In this sense, performance of the GMM-based automatic dysphonic voice classification system is measured according to different phoneme classes and frequency bands ([0-3000] and [0-8000]Hz).Ce travail vise à caractériser la dysphonie dans le domaine fréquentiel pour une meilleure compréhension des phénomènes de dysfonctionnement. Fondée sur une expérience qui démontre que les basses fréquences ([0-3000] hertz) sont plus appropriées pour la discrimination des dysphonies, les auteurs proposent dans ce document de poursuivre en analysant l'impact de la sous-bande restreinte de fréquence ([0-3000] hertz) sur la discrimination de voix en fonction des segments phonétiques. Dans ce sens, la performance d'un système fondé sur les GMM pour la classification automatique du grade de dysphonie est mesurée selon différentes classes de phonème et des bandes de fréquence ([0-3000] et [0-8000] hertz)
Privacy attacks for automatic speech recognition acoustic models in a federated learning framework
This paper investigates methods to effectively retrieve speaker information
from the personalized speaker adapted neural network acoustic models (AMs) in
automatic speech recognition (ASR). This problem is especially important in the
context of federated learning of ASR acoustic models where a global model is
learnt on the server based on the updates received from multiple clients. We
propose an approach to analyze information in neural network AMs based on a
neural network footprint on the so-called Indicator dataset. Using this method,
we develop two attack models that aim to infer speaker identity from the
updated personalized models without access to the actual users' speech data.
Experiments on the TED-LIUM 3 corpus demonstrate that the proposed approaches
are very effective and can provide equal error rate (EER) of 1-2%.Comment: Submitted to ICASSP 202
Parole de locuteur (performance et confiance en identification biométrique vocale)
Ce travail de thèse explore l usage biométrique de la parole dont les applications sont très nombreuses (sécurité, environnements intelligents, criminalistique, surveillance du territoire ou authentification de transactions électroniques). La parole est soumise à de nombreuses contraintes fonction des origines du locuteur (géographique, sociale et culturelle) mais également fonction de ses objectifs performatifs. Le locuteur peut être considéré comme un facteur de variation de la parole, parmi d autres. Dans ce travail, nous présentons des éléments de réponses aux deux questions suivantes : Tous les extraits de parole d un même locuteur sont-ils équivalents pour le reconnaître ? Comment se structurent les différentes sources de variation qui véhiculent directement ou indirectement la spécificité du locuteur ? Nous construisons, dans un premier temps, un protocole pour évaluer la capacité humaine à discriminer un locuteur à partir d un extrait de parole en utilisant les données de la campagne NIST-HASR 2010. La tâche ainsi posée est difficile pour nos auditeurs, qu ils soient naïfs ou plus expérimentés.Dans ce cadre, nous montrons que ni la (quasi)unanimité des auditeurs ni l auto-évaluation de leurs jugements ne sont des gages de confiance dans la véracité de la réponse soumise.Nous quantifions, dans un second temps, l influence du choix d un extrait de parole sur la performance des systèmes automatiques. Nous avons utilisé deux bases de données, NIST et BREF ainsi que deux systèmes de RAL, ALIZE/SpkDet (LIA) et Idento (SRI). Les systèmes de RAL, aussi bienfondés sur une approche UBM-GMM que sur une approche i-vector montrent des écarts de performances importants mesurés à l aide d un taux de variation autour de l EER moyen, Vr (pour NIST, VrIdento = 1.41 et VrALIZE/SpkDet = 1.47 et pour BREF, Vr = 3.11) selon le choix du fichier d apprentissage utilisé pour chaque locuteur. Ces variations de performance, très importantes, montrent la sensibilité des systèmes automatiques au choix des extraits de parole, sensibilité qu il est important de mesurer et de réduire pour rendre les systèmes de RAL plus fiables.Afin d expliquer l importance du choix des extraits de parole, nous cherchons les indices les plus pertinents pour distinguer les locuteurs de nos corpus en mesurant l effet du facteur Locuteur sur la variance des indices (h2). La F0 est fortement dépendante du facteur Locuteur, et ce indépendamment de la voyelle. Certains phonèmes sont plus discriminants pour le locuteur : les consonnes nasales, les fricatives, les voyelles nasales, voyelles orales mi-fermées à ouvertes.Ce travail constitue un premier pas vers une étude plus précise de ce qu est le locuteur aussi bien pour la perception humaine que pour les systèmes automatiques. Si nous avons montré qu il existait bien une différence cepstrale qui conduisait à des modèles plus ou moins performants, il reste encore à comprendre comment lier le locuteur à la production de la parole. Enfin, suite à ces travaux, nous souhaitons explorer plus en détail l influence de la langue sur la reconnaissance du locuteur. En effet, même si nos résultats indiquent qu en anglais américain et en français, les mêmes catégories de phonèmes sont les plus porteuses d information sur le locuteur, il reste à confirmer ce point et à évaluer ce qu il en est pour d autres languesThis thesis explores the use of biometric speech. Speech is subjected to many constraints based on origins of the speaker (geographical , social and cultural ), but also according to his performative goals. The speaker may be regarded as a factor of variation in the speech , among others. In this work, we present some answers to the following two questions:- Are all speech samples equivalent to recognize a speaker?- How are structured the different acoustic cues carrying information about the speaker ?In a first step, a protocol to assess the human ability to discriminate a speaker from a speech sample using NIST-HASR 2010 data is presented. This task is difficult for our listeners who are naive or experienced. In this context, neither the (quasi) unanimity or the self-assessment do not assure the confidence in the veracity of the submitted answer .In a second step, the influence of the choice of a sample speech on the performance of automatic systems is quantified using two databases, NIST and BREF and two systems RAL , Alize / SpkDet (LIA, UBM-GMM system) and Idento (SRI, i-vector system).The two RAL systems show significant differences in performance measured using a measure of relative variation around the average EER, Vr (for NIST Idento Vr = 1.41 and Vr Alize / SpkDet = 1.47 and BREF, Vr = 3.11) depending on the choice of the training file used for each speaker. These very large variations in performance show the sensitivity of automatic systems to the speech sample. This sensitivity must be measured to make the systems more reliable .To explain the importance of the choice of the speech sample and find the relevant cues, the effect of the speaker on the variance of various acoustics features is measured ( 2) . F0 is strongly dependent of the speaker, independently of the vowel. Some phonemes are more discriminative : nasal consonants, fricatives , nasal vowels, oral half closed to open vowels .This work is a first step towards to understand where is the speaker in speech using as well the human perception as automatic systems . If we have shown that there was a cepstral difference between the more and less efficient models, it remains to understand how to bind the speaker to the speech production. Finally, following this work, we wish to explore more in detail the influence of language on speaker recognition. Even if our results indicate that for American English and French , the same categories of phonemes are the carriers of information about the speaker , it remains to confirm this on other languages .AVIGNON-BU Centrale (840072103) / SudocSudocFranceF
Modélisation statistique et infomations pertinentes pour la caractérisation des voix pathologiques (dysphonies)
International audienceCet article porte sur l'importance du type d'information appropriée pour une tâche de classification automatique de voix produite par des patients atteints de dysfonctionnement vocal. En employant un système de classification GMM (dérivé de la reconnaissance automatique du locuteur), le focus a été mis sur trois classes principales d'information : une information portant sur l'énergie, une deuxième sur les parties voisées, et une troisième en fonction des segments phonétiques. Les expériences, qui ont porté sur un corpus de dysphoniques, ont montré que cette information phonétique est particulièrement intéressante dans ce contexte puisqu'elle permet d'analyser le résultat en fonction du phonème ou de la classe de phonème
- …