7 research outputs found

    Intra- and inter-speaker variability of sibilant fricative /s/ in Argentine Spanish

    Get PDF
    En este trabajo se propone estudiar y valorizar el poder discriminativo de la fricativa sibilante /s/, de manera de incorporar este conocimiento en futuros sistemas automáticos de reconocimiento de hablantes. Se seleccionó esta fricativa por ser la principal consonante de acuerdo a la frecuencia de aparición en el corpus. Se determinó un ranking de los parámetros acústicos de dicho fonema que mejor discriminan a un hablante, teniendo en cuenta la menor variabilidad intrahablante y la máxima variabilidad inter-hablante. El material de evaluación fue extraído de la base de datos SpeechDat, con muestras de habla de telefonía fija en Español de Argentina. Los parámetros con mejor puntaje fueron: la Intensidad, el tercer formante (F3), el primer formante (F1) y el primer momento espectral o Centro de Gravedad (CG). El poder discriminante de la fricativa sibilante /s/, con respecto al resto de los fonemas, ha quedado corroborado por su importante aporte a la tasa de reconocimiento de hablantes obtenida, siendo el sexto fonema en importancia después de las vocales /e/, /a/, /o/ y /i/, y la nasal /n/. La tasa de igual error, empleando solamente este fonema, resultó un 35% menor que la media del total de los 30 fonemas involucrados.This paper focuses on the analysis of the discriminative power of the sibilant fricative /s/, in order to incorporate this knowledge in future automatic speaker recognition systems. The selected fricative is the most frequent consonant in the corpus. An acoustical parameter ranking of /s/ was performed based on minor intra-speaker variability and maximun inter-speaker variability. Evaluation is performed on Argentine-Spanish voice samples from the SpeechDat database recorded on a fixed phone environment. The intensity, the third formant (F3), the first formant (F1) and the first spectral moment or Center of Gravity (CG) were the best ranked parameters. The sibilant fricative /s/, considered in isolation, has a speaker recognition equal error rate (EER) of 35% lower than the average of the total of 30 phonemes involved, confirming the importance of this phoneme for the discrimination of speakers as the sixth phoneme in importance, preceded by the vowels /e/, /a/, /o/ and /i/, and the nasal /n/

    Localization and Selection of Speaker Specific Information with Statistical Modeling

    Get PDF
    International audienceStatistical modeling of the speech signal has been widely used in speaker recognition. The performance obtained with this type of modeling is excellent in laboratories but decreases dramatically for telephone or noisy speech. Moreover, it is difficult to know which piece of information is taken into account by the system. In order to solve this problem and to improve the current systems, a better understanding of the nature of the information used by statistical methods is needed. This knowledge should allow to select only the relevant information or to add new sources of information. The first part of this paper presents experiments that aim at localizing the most useful acoustic events for speaker recognition. The relation between the discriminant ability and the speech's events nature is studied. Particularly, the phonetic content, the signal stability and the frequency domain are explored. Finally, the potential of dynamic information contained in the relation between a frame and its p neighbours is investigated. In the second part, the authors suggest a new selection procedure designed to select the pertinent features. Conventional feature selection techniques (ascendant selection, knockout) allow only global and a posteriori knowledge about the relevance of an information source. However, some speech clusters may be very efficient to recognize a particular speaker, whereas they can be non informative for another one. Moreover, some information classes may be corrupted or even missing for particular recording conditions. This necessity fo

    Anàlisi de la variació intra i inter parlant en la fricativa alveolar [s] del castellà amb finalitat forense

    Get PDF
    Aquest estudi presenta els resultats d'una anàlisi de la variació intraparlant i interparlant de la fricativa alveolar del castellà amb una finalitat forense. Es pretén caracteritzar aquest element minuciosament pensant en la seva utilitat com un element important en les tasques fonètiques forenses d'identificació o verificació del locutor. Per dur a terme aquesta anàlisi s'han emprat dos mètodes de comparació a partir de diferents paràmetres acústics. D'una banda, s'han analitzat diversos paràmetres considerats tradicionalment en l'estudi de les fricatives (la durada, els "zero crossings", la freqüència de màxima intensitat, el centre de gravetat, l'asimetria, la curtosi i la desviació estàndard). I, d'altra banda, l'atenció s'ha centrat en l'LTAS (en anglès Long Term Average Spectrum) d'aquest segment.This paper presents the results of an analysis of intraspeaker and interspeaker variation in the Castilian alveolar fricative [s] for forensic purposes. The aim is to achieve a highly detailed characterization of this sound by considering its importance in forensic phonetic analyses aming at identifying or verifying a speker. In order to carry out the analysis, we used two methods of comparison based on different acoustic parameters. On one hand, we analysed various parameters traditionally taken into account when studying fricatives, namely duration, zero crossings, frequency peaks, centre of gravity, skewness, kurtosis, and standard deviation. On the other hand, we also focussed on the LTAS (Long Term Average Spectrum) of the segment

    Anàlisi de la variació intra i inter parlant en la fricativa alveolar [s] del castellà amb finalitat forense

    Get PDF
    Aquest estudi presenta els resultats d'una anàlisi de la variació intraparlant i interparlant de la fricativa alveolar del castellà amb una finalitat forense. Es pretén caracteritzar aquest element minuciosament pensant en la seva utilitat com un element important en les tasques fonètiques forenses d'identificació o verificació del locutor. Per dur a terme aquesta anàlisi s'han emprat dos mètodes de comparació a partir de diferents paràmetres acústics. D'una banda, s'han analitzat diversos paràmetres considerats tradicionalment en l'estudi de les fricatives (la durada, els 'zero crossings', la freqüència de màxima intensitat, el centre de gravetat, l'asimetria, la curtosi i la desviació estàndard). I, d'altra banda, l'atenció s'ha centrat en l'LTAS (en anglès Long Term Average Spectrum) d'aquest segment.This paper presents the results of an analysis of intraspeaker and interspeaker variation in the Castilian alveolar fricative [s] for forensic purposes. The aim is to achieve a highly detailed characterization of this sound by considering its importance in forensic phonetic analyses aming at identifying or verifying a speker. In order to carry out the analysis, we used two methods of comparison based on different acoustic parameters. On one hand, we analysed various parameters traditionally taken into account when studying fricatives, namely duration, zero crossings, frequency peaks, centre of gravity, skewness, kurtosis, and standard deviation. On the other hand, we also focussed on the LTAS (Long Term Average Spectrum) of the segment

    Effect Of Utterance Duration And Phonetic Content On Speaker Identification Using Second-Order Statistical Methods

    No full text
    Second-order statistical methods show very good results for automatic speaker identification in controlled recording conditions [2]. These approaches are generally used on the entire speech material available. In this paper, we study the influence of the content of the test speech material on the performances of such methods, i.e. under a more analytical approach [3]. The goal is to investigate on the kind of information which is used by these methods, and where it is located in the speech signal. Liquids and glides together, vowels, and more particularly nasal vowels and nasal consonants, are found to be particularly speaker specific: test utterances of 1 second, composed in majority of acoustic material from one of these classes provide better speaker identification results than phonetically balanced test utterances, even though the training is done, in both cases, with 15 seconds of phonetically balanced speech. Nevertheless, results with other phoneme classes are never dramatically..

    EFFECT OF UTTERANCE DURATION AND PHONETIC CONTENT ON SPEAKER IDENTIFICATION USING SECOND-ORDER STATISTICAL METHODS

    No full text
    Second-order statistical methods show very good results for automatic speaker identi cation in controlled recording conditions [2]. These approaches are generally used on the entire speech material available. In this paper, we study the in uence of the content of the test speech material on the performances of such methods, i.e. under a more analytical approach [3]. The goal is to investigate on the kind of information which is used by these methods, and where it is located in the speech signal. Liquids and glides together, vowels, and more particularly nasal vowels and nasal consonants, are found to be particularly speaker speci c: test utterances of 1 second, composed in majority of acoustic material from one of these classes provide better speaker identi cation results than phonetically balanced test utterances, even though the training is done, in both cases, with 15 seconds of phonetically balanced speech. Nevertheless, results with other phoneme classes are never dramatically poor. These results tend to show that the speaker-dependent information captured by long-term second-order statistics is consistently common to all phonetic classes, and that the homogeneity of the test material may improve the quality of the estimates. 1

    Parole de locuteur (performance et confiance en identification biométrique vocale)

    Get PDF
    Ce travail de thèse explore l usage biométrique de la parole dont les applications sont très nombreuses (sécurité, environnements intelligents, criminalistique, surveillance du territoire ou authentification de transactions électroniques). La parole est soumise à de nombreuses contraintes fonction des origines du locuteur (géographique, sociale et culturelle) mais également fonction de ses objectifs performatifs. Le locuteur peut être considéré comme un facteur de variation de la parole, parmi d autres. Dans ce travail, nous présentons des éléments de réponses aux deux questions suivantes : Tous les extraits de parole d un même locuteur sont-ils équivalents pour le reconnaître ? Comment se structurent les différentes sources de variation qui véhiculent directement ou indirectement la spécificité du locuteur ? Nous construisons, dans un premier temps, un protocole pour évaluer la capacité humaine à discriminer un locuteur à partir d un extrait de parole en utilisant les données de la campagne NIST-HASR 2010. La tâche ainsi posée est difficile pour nos auditeurs, qu ils soient naïfs ou plus expérimentés.Dans ce cadre, nous montrons que ni la (quasi)unanimité des auditeurs ni l auto-évaluation de leurs jugements ne sont des gages de confiance dans la véracité de la réponse soumise.Nous quantifions, dans un second temps, l influence du choix d un extrait de parole sur la performance des systèmes automatiques. Nous avons utilisé deux bases de données, NIST et BREF ainsi que deux systèmes de RAL, ALIZE/SpkDet (LIA) et Idento (SRI). Les systèmes de RAL, aussi bienfondés sur une approche UBM-GMM que sur une approche i-vector montrent des écarts de performances importants mesurés à l aide d un taux de variation autour de l EER moyen, Vr (pour NIST, VrIdento = 1.41 et VrALIZE/SpkDet = 1.47 et pour BREF, Vr = 3.11) selon le choix du fichier d apprentissage utilisé pour chaque locuteur. Ces variations de performance, très importantes, montrent la sensibilité des systèmes automatiques au choix des extraits de parole, sensibilité qu il est important de mesurer et de réduire pour rendre les systèmes de RAL plus fiables.Afin d expliquer l importance du choix des extraits de parole, nous cherchons les indices les plus pertinents pour distinguer les locuteurs de nos corpus en mesurant l effet du facteur Locuteur sur la variance des indices (h2). La F0 est fortement dépendante du facteur Locuteur, et ce indépendamment de la voyelle. Certains phonèmes sont plus discriminants pour le locuteur : les consonnes nasales, les fricatives, les voyelles nasales, voyelles orales mi-fermées à ouvertes.Ce travail constitue un premier pas vers une étude plus précise de ce qu est le locuteur aussi bien pour la perception humaine que pour les systèmes automatiques. Si nous avons montré qu il existait bien une différence cepstrale qui conduisait à des modèles plus ou moins performants, il reste encore à comprendre comment lier le locuteur à la production de la parole. Enfin, suite à ces travaux, nous souhaitons explorer plus en détail l influence de la langue sur la reconnaissance du locuteur. En effet, même si nos résultats indiquent qu en anglais américain et en français, les mêmes catégories de phonèmes sont les plus porteuses d information sur le locuteur, il reste à confirmer ce point et à évaluer ce qu il en est pour d autres languesThis thesis explores the use of biometric speech. Speech is subjected to many constraints based on origins of the speaker (geographical , social and cultural ), but also according to his performative goals. The speaker may be regarded as a factor of variation in the speech , among others. In this work, we present some answers to the following two questions:- Are all speech samples equivalent to recognize a speaker?- How are structured the different acoustic cues carrying information about the speaker ?In a first step, a protocol to assess the human ability to discriminate a speaker from a speech sample using NIST-HASR 2010 data is presented. This task is difficult for our listeners who are naive or experienced. In this context, neither the (quasi) unanimity or the self-assessment do not assure the confidence in the veracity of the submitted answer .In a second step, the influence of the choice of a sample speech on the performance of automatic systems is quantified using two databases, NIST and BREF and two systems RAL , Alize / SpkDet (LIA, UBM-GMM system) and Idento (SRI, i-vector system).The two RAL systems show significant differences in performance measured using a measure of relative variation around the average EER, Vr (for NIST Idento Vr = 1.41 and Vr Alize / SpkDet = 1.47 and BREF, Vr = 3.11) depending on the choice of the training file used for each speaker. These very large variations in performance show the sensitivity of automatic systems to the speech sample. This sensitivity must be measured to make the systems more reliable .To explain the importance of the choice of the speech sample and find the relevant cues, the effect of the speaker on the variance of various acoustics features is measured ( 2) . F0 is strongly dependent of the speaker, independently of the vowel. Some phonemes are more discriminative : nasal consonants, fricatives , nasal vowels, oral half closed to open vowels .This work is a first step towards to understand where is the speaker in speech using as well the human perception as automatic systems . If we have shown that there was a cepstral difference between the more and less efficient models, it remains to understand how to bind the speaker to the speech production. Finally, following this work, we wish to explore more in detail the influence of language on speaker recognition. Even if our results indicate that for American English and French , the same categories of phonemes are the carriers of information about the speaker , it remains to confirm this on other languages .AVIGNON-BU Centrale (840072103) / SudocSudocFranceF
    corecore