6 research outputs found

    The use of long-term features for GMM- and i-vector-based speaker diarization systems

    Get PDF
    Several factors contribute to the performance of speaker diarization systems. For instance, the appropriate selection of speech features is one of the key aspects that affect speaker diarization systems. The other factors include the techniques employed to perform both segmentation and clustering. While the static mel frequency cepstral coefficients are the most widely used features in speech-related tasks including speaker diarization, several studies have shown the benefits of augmenting regular speech features with the static ones. In this work, we have proposed and assessed the use of voice-quality features (i.e., jitter, shimmer, and Glottal-to-Noise Excitation ratio) within the framework of speaker diarization. These acoustic attributes are employed together with the state-of-the-art short-term cepstral and long-term prosodic features. Additionally, the use of delta dynamic features is also explored separately both for segmentation and bottom-up clustering sub-tasks. The combination of the different feature sets is carried out at several levels. At the feature level, the long-term speech features are stacked in the same feature vector. At the score level, the short- and long-term speech features are independently modeled and fused at the score likelihood level. Various feature combinations have been applied both for Gaussian mixture modeling and i-vector-based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction meeting corpus. The best result, in terms of diarization error rate, is reported by using i-vector-based cosine-distance clustering together with a signal parameterization consisting of a combination of static cepstral coefficients, delta, voice-quality, and prosodic features. The best result shows about 24% relative diarization error rate improvement compared to the baseline system which is based on Gaussian mixture modeling and short-term static cepstral coefficients.Peer ReviewedPostprint (published version

    The correlation of gender and voice

    Get PDF
    Aim of the study. The fundamental frequency of speech (f0), the fundamental frequency minimum and maximum (f0min-f0max) and the vowel formant frequencies (F1-F3) are acoustical elements that make the difference between the voice of a man and a woman. Making a solid judgment of ones gender based on his/hers voice is not easy, but based on the acoustical elements mentioned above it can be judged quite reliably. F0 is considered to be the best acoustical element in making gender judgments, but also F1-F3 and f0max-f0min have been proven to be important. No Finnish research has been made. The aim of this study was to gather data about acoustical properties (f0, f0min-f0max, F1-F3) of voice and investigate the correlation between the acoustical properties and perceived gender and voice femininity or masculinity. This study investigated also did the subjective evaluations and the listeners evaluations differ from each other. Methods. Ten female, nine men, three male-to-female transgender people and two female-to-male transgender people participated as speakers in this study. They evaluated subjectively did their voice sound like male or female and how feminine or masculine it sounded on a VAS scale. They also gave prolonged vowel, reading and spontaneous speech samples that were acoustically analysed. The vowels /a/, /i/ and /u/ and two read sentences were collected into a listening test that was held for 25 people performing as listeners. They evaluated on a separate forms did the voice sound like male or female and how feminine or masculine it sounded on a VAS scale. Results. The men of this research had the lowest and the women the highest acoustical properties of voice (f0, f0min-max, F1-F3). Between all of the acoustical properties (except /i/ F2) of voice and evaluation of gender and voice masculinity or femininity there were statistically significant or very significant strong or very strong correlation. When the fundamental frequency was 153–244 Hz, the fundamental frequency minimum was 68–137 Hz and maximum was 233–359 Hz the voice was evaluated more often (over 50 % of the evaluations) as female voice and feminine. Also the higher the vowel formant frequencies were the more often the voice was evaluated as female and feminine. The listeners evaluated the reading speech very differently than the speakers evaluated their own speech. With the women's and men's gender judgments there was almost a 100 % agreement, but this was not the case with the transgender people. The listeners evaluations of the voice femininity or masculinity differed most of the time (75 %) statistically significantly from the speakers own evaluations.Tavoitteet. Puheen perustaajuus (f0), perustaajuuden minimi- ja maksimiarvot (f0min-f0max) sekä vokaalien formanttitaajuudet (F1, F2 ja F3) ovat akustisia ominaisuuksia, jotka ovat erisuuruisia naisilla ja miehillä. Sukupuoliarvion tekeminen äänen perusteella ei ole helppoa, mutta edellä mainittujen puheen akustisten ominaisuuksien perusteella se voidaan arvioida kuitenkin melko luotettavasti. Perustaajuutta pidetään formanttitaajuuksia parempana sukupuolen erottavana tekijänä, mutta myös perustaajuuden minimi- ja maksimiarvojen sekä formanttitaajuuksien on osoitettu olevan tärkeitä. Suomalaista tutkimusta aiheesta ei ole. Tämän tutkimuksen tavoitteena oli tutkia suomalaisten miesten, naisten ja transsukupuolisten miesten ja transsukupuolisten naisten äänen akustisia ominaisuuksia (f0, f0min-f0max, F1-F3) ja niiden eroja sekä tutkia niiden yhteyttä sukupuoli- ja feminiinisyys-maskuliinisuusarvion tekoon. Lisäksi tutkimuksessa selvitettiin, erosivatko tutkittavien subjektiiviset arviot omasta äänestä ja kuuntelijoiden arviot toisistaan. Menetelmät. Tutkimukseen osallistui puhujiksi 24 tutkittavaa, joista 10 oli naisia, 9 miehiä, kolme transsukupuolisuus-diagnoosin saaneita naisia ja kaksi transsukupuolisuus-diagnoosin saaneita miehiä. Heiltä kerättiin subjektiiviset arviot siitä, kuulostiko oma ääni naisen vai miehen ääneltä ja kuinka feminiiniseltä tai maskuliiniselta se kuulosti VAS-janalla arvioituna. Lisäksi heiltä äänitettiin vokaali-, luku- ja spontaanipuhenäytteet, jotka analysoitiin akustisesti. Vokaalit /a/, /i/ ja /u/ sekä kaksi luettua virkettä koottiin kuuntelukokeeksi, johon osallistui 25 kuuntelijaa. Kuuntelijat arvioivat erilliselle lomakkeelle, kuulostiko ääni naisen vai miehen tuottamalta ja arvioivat VAS-janalla, kuinka feminiiniseltä tai maskuliiniselta ääni kuulosti. Tulokset. Tutkimuksen miehillä oli kaikista matalimmat äänen akustiset (f0, f0min-max, F1-F3) ominaisuudet ja naisilla korkeimmat. Kaikkien äänen akustisten ominaisuuksien (paitsi /i/ F2) ja sukupuoli- ja feminiinisyys-maskuliinisuusarvioiden välillä oli tilastollisesti merkitsevä tai erittäin merkitsevä voimakas tai erittäin voimakas yhteys. Kun perustaajuus oli 153–244 Hz, perustaajuuden minimiarvo 68–137 Hz ja maksimiarvo 233–359 Hz, ääni arvioitiin useammin (yli 50 % arvioista) naisen ääneksi ja feminiiniseksi. Lisäksi mitä korkeammat vokaalien formanttitaajuudet olivat, sitä useammin ääni arvioitiin naisen ääneksi ja feminiiniseksi. Kuuntelijat arvioivat lukupuheäänen hyvin eri tavalla kuin puhujat itse arvioivat oman äänensä. Naisten ja miesten oman äänen sukupuoliarvioiden kanssa kuuntelijat olivat lähes 100 % samaa mieltä, mutta transsukupuolisten henkilöiden kanssa oli enemmän eroavaisuuksia. Kuuntelijoiden arviot äänen feminiinisyydestä tai maskuliinisuudesta erosivat suurimmaksi osaksi (75 %) tilastollisesti merkitsevästi puhujien omista arvioista

    Influencia de los segmentos del discurso en la discriminación del locutor

    Get PDF
    La autenticación de la identidad de las personas es hoy en día una tarea crucial, ya que una amplia variedad de sistemas precisan de un método fiable, bien para determinar o bien para confirmar la identidad de los individuos. Entre los métodos de autenticación, el “reconocimiento biométrico” ha recibido una considerable atención en los últimos años debido principalmente a dos motivos: el fuerte crecimiento de la demanda de aplicaciones de seguridad, tanto comerciales como militares y el rápido desarrollo de la tecnología que las soporta. Su finalidad es la determinación de la identidad de las personas basándose en uno o más rasgos físicos o de comportamiento, elementos, que a diferencia de los utilizados por otras técnicas, siempre acompañan al individuo. En este área, la utilización de la voz humana como rasgo presenta un conjunto de características que la hacen especialmente practica y la convierten en la mejor opción, cuando no la única, en un amplio conjunto de aplicaciones. El esquema general del proceso de reconocimiento define dos grandes etapas: la extracción de la información relevante de las muestras de voz capturadas, y la comparación de dicha información con otra de las mismas características previamente almacenada; comparación, esta última, para lo cual se suele hacer uso de técnicas de clasificación provenientes del área de la inteligencia artificial. Dado el estado actual de los algoritmos de clasificación, parece difícil pensar que los sistemas de reconocimiento biométrico puedan mejorar sustancialmente sus tasas a partir de la mejora de los mismos; es necesario, por tanto mejorar la calidad de la información que se les suministra. En este trabajo, el autor presenta un nuevo enfoque que permite la mejora de las tasas del reconocimiento del locutor mediante la selección de la dicha información, proponiendo, asimismo, un sencillo algoritmo que realiza este filtrado. Sus resultados no sólo son aplicables al diseño de nuevos sistemas, sino que resultan útiles a la hora de mejorar las prestaciones de los que se encuentran en funcionamiento. ---------------------------------------------The authentication of people identity is nowadays a crucial task, since a wide variety of systems requires a reliable method either to determine or to confirm the identity of individuals. Among all the authentication methods, the “biometric recognition” has received considerable attention in the recent years mainly due to two reasons: the strong growth in demand for security applications them, both commercial and military, and the rapid development of technology supporting it. Its purpose is to determine the identity of the person based on one or more physical or behavioural traits, elements that unlike those used by other techniques, always go with the individual. In this area, the use of the human voice as a trait has a set of characteristics that make it especially practical and it becomes the best choice, if not the only available one, for a wide range of applications. The general scheme of the recognition process is defined in two mayor stages: extracting the relevant information from the captured voice samples, and matching that information to another one previously stored of the same trait; matching, the latter, for which usually makes use of classification techniques inherit from the artificial intelligence area. Considering the current state of classification algorithms, it seems hard to believe that biometric recognition systems can substantially improve their rates just by improving them, it is therefore necessary to pay attention to improve the quality of information supplied. In this document, the author presents a new approach which allows the improvement of speaker recognition rates by the selection of such information, proposing, likewise, a simple algorithm that performs this filtering. Their results are not only applicable to the design of new systems, but also are useful in improving the performance of those which are in operation

    Effects of forensically-relevant facial concealment on acoustic and perceptual properties of consonants

    Get PDF
    This thesis offers a thorough investigation into the effects of forensically-relevant facial concealment on speech acoustics and perception. Specifically, it explores the extent to which selected acoustic-phonetic and auditory-perceptual properties of consonants are affected when the talker is wearing ‘facewear’ while speaking. In this context, the term ‘facewear’ refers to the various types of face-concealing garments and headgear that are worn by people in common daily communication situations; for work and leisure, or as an expression of religious, social and cultural affiliation (e.g. surgical masks, motorcycle helmets, ski and cycling masks, or full-face veils such as the niqāb). It also denotes the face or head coverings that are typically used as deliberate (visual) disguises during the commission of crimes and in situations of public disorder (e.g. balaclavas, hooded sweatshirts, or scarves). The present research centres on the question: does facewear influence the way that consonants are produced, transmitted, and perceived? To examine the effects of facewear on the acoustic speech signal, various intensity, spectral, and temporal properties of spoken English consonants were measured. It was found that facewear can considerably alter the acoustic-phonetic characteristics of consonants. This was likely to be the result of both deliberate and involuntary changes to the talker’s speech productions, and of sound energy absorption by the facewear material. The perceptual consequences of the acoustic modifications to speech were assessed by way of a consonant identification study and a talker discrimination study. The results of these studies showed that auditory-only and auditory-visual consonant intelligibility, as well as the discrimination of unfamiliar talkers, may be greatly compromised when the observer’s judgements are based on ‘facewear speech’. The findings reported in this thesis contribute to our understanding of how auditory and visual information interact during natural speech processing. Furthermore, the results have important practical implications for legal cases in which speech produced through facewear is of pivotal importance. Forensic speech scientists are therefore advised to take the possible effects of facewear on speech into account when interpreting the outcome of their acoustic and auditory analyses of evidential speech recordings, and when evaluating the reliability of earwitness testimony

    A speaker classification framework for non-intrusive user modeling : speech-based personalization of in-car services

    Get PDF
    Speaker Classification, i.e. the automatic detection of certain characteristics of a person based on his or her voice, has a variety of applications in modern computer technology and artificial intelligence: As a non-intrusive source for user modeling, it can be employed for personalization of human-machine interfaces in numerous domains. This dissertation presents a principled approach to the design of a novel Speaker Classification system for automatic age and gender recognition which meets these demands. Based on literature studies, methods and concepts dealing with the underlying pattern recognition task are developed. The final system consists of an incremental GMM-SVM supervector architecture with several optimizations. An extensive data-driven experiment series explores the parameter space and serves as evaluation of the component. Further experiments investigate the language-independence of the approach. As an essential part of this thesis, a framework is developed that implements all tasks associated with the design and evaluation of Speaker Classification in an integrated development environment that is able to generate efficient runtime modules for multiple platforms. Applications from the automotive field and other domains demonstrate the practical benefit of the technology for personalization, e.g. by increasing local danger warning lead time for elderly drivers.Die Sprecherklassifikation, also die automatische Erkennung bestimmter Merkmale einer Person anhand ihrer Stimme, besitzt eine Vielzahl von Anwendungsmöglichkeiten in der modernen Computertechnik und Künstlichen Intelligenz: Als nicht-intrusive Wissensquelle für die Benutzermodellierung kann sie zur Personalisierung in vielen Bereichen eingesetzt werden. In dieser Dissertation wird ein fundierter Ansatz zum Entwurf eines neuartigen Sprecherklassifikationssystems zur automatischen Bestimmung von Alter und Geschlecht vorgestellt, welches diese Anforderungen erfüllt. Ausgehend von Literaturstudien werden Konzepte und Methoden zur Behandlung des zugrunde liegenden Mustererkennungsproblems entwickelt, welche zu einer inkrementell arbeitenden GMM-SVM-Supervector-Architektur mit diversen Optimierungen führen. Eine umfassende datengetriebene Experimentalreihe dient der Erforschung des Parameterraumes und zur Evaluierung der Komponente. Weitere Studien untersuchen die Sprachunabhängigkeit des Ansatzes. Als wesentlicher Bestandteil der Arbeit wird ein Framework entwickelt, das alle im Zusammenhang mit Entwurf und Evaluierung von Sprecherklassifikation anfallenden Aufgaben in einer integrierten Entwicklungsumgebung implementiert, welche effiziente Laufzeitmodule für verschiedene Plattformen erzeugen kann. Anwendungen aus dem Automobilbereich und weiteren Domänen demonstrieren den praktischen Nutzen der Technologie zur Personalisierung, z.B. indem die Vorlaufzeit von lokalen Gefahrenwarnungen für ältere Fahrer erhöht wird
    corecore