183 research outputs found

    Deep learning for speaker characterization

    Get PDF
    La caracterización de un locutor es una de las tareas más relevantes en muchas aplicaciones de inteligencia artificial. Del mismo modo que estas tecnologías mejoran y se aumenta la cantidad de datos disponibles, es también importante adaptarlas a distintos idiomas y contextos. En este proyecto, se propone una red neuronal para clasificar el género, edad y acento de un locutor en catalán. Distintas variaciones serán exploradas, incluyendo algunas t´ecnicas punteras como el mecanismo de pooling basado en Double Multi-Head Attention. Por otro lado, se llevará a cabo análisis de datos con tal de obtener los mejores resultados posibles, incluyendo la aplicación de técnicas nivel estado del arte de aumento de datos. Algunos de los resultados son prometedores, al demostrar una considerable mejora respecto otras técnicas clásicas no basadas en aprendizaje automático.La caracterització d'un locutor és una de les tasques més rellevants en moltes aplicacions d'intel·ligència artificial. Tan bon punt s'afinen aquestes tecnologies i augmenta la quantitat de dades disponibles, és important adaptar-les a diferents idiomes i contexts. En aquest projecte, es proposa una xarxa neuronal per classificar el gènere, l'edat i l'accent d'un parlant de català. Vàries variacions dels blocs més convencionals seran explorades, incloent algunes tècniques punteres com un mecanisme de pooling basat en Double Multi- Head Attention. Per altra banda, es durà a terme anàlisis de dades per tal de millorar els resultats obtinguts, incloent l'aplicació de tècniques estat de l'art d'augment de dades. Alguns dels resultats són força prometedors, al demostrar una considerable millora respecte altres tècniques clàssiques no basades en l'aprenentatge automàtic.Speech characterization is one of the most relevant tasks in a lot of voice-related artificial intelligence applications. As these technologies thrive and the amount of data available increases, it is also salient their adaptation to different languages and contexts. In this project, a network to classify the gender, age and accent of a Catalan speaker through their voice is proposed. Different variations of the main models blocks are going to be explored, including some innovative techniques as the Double Multi-Head Attention pooling mechanism. In addition, some data analysis- including the application of some state-of-art voice data augmentation techniques- will be done aiming for better results. Some results show strong promise, as they indicate improvement in comparison to some classical methods not based on machine learning

    Glottal Source Cepstrum Coefficients Applied to NIST SRE 2010

    Get PDF
    Through the present paper, a novel feature set for speaker recognition based on glottal estimate information is presented. An iterative algorithm is used to derive the vocal tract and glottal source estimations from speech signal. In order to test the importance of glottal source information in speaker characterization, the novel feature set has been tested in the 2010 NIST Speaker Recognition Evaluation (NIST SRE10). The proposed system uses glottal estimate parameter templates and classical cepstral information to build a model for each speaker involved in the recognition process. ALIZE [1] open-source software has been used to create the GMM models for both background and target speakers. Compared to using mel-frequency cepstrum coefficients (MFCC), the misclassification rate for the NIST SRE 2010 reduced from 29.43% to 27.15% when glottal source features are use

    Speaker characterization by means of attention pooling

    Get PDF
    State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.Peer ReviewedPostprint (published version

    An analysis of the short utterance problem for speaker characterization

    Get PDF
    Speaker characterization has always been conditioned by the length of the evaluated utterances. Despite performing well with large amounts of audio, significant degradations in performance are obtained when short utterances are considered. In this work we present an analysis of the short utterance problem providing an alternative point of view. From our perspective the performance in the evaluation of short utterances is highly influenced by the phonetic similarity between enrollment and test utterances. Both enrollment and test should contain similar phonemes to properly discriminate, being degraded otherwise. In this study we also interpret short utterances as incomplete long utterances where some acoustic units are either unbalanced or just missing. These missing units are responsible for the speaker representations to be unreliable. These unreliable representations are biased with respect to the reference counterparts, obtained from long utterances. These undesired shifts increase the intra-speaker variability, causing a significant loss of performance. According to our experiments, short utterances (3-60 s) can perform as accurate as if long utterances were involved by just reassuring the phonetic distributions. This analysis is determined by the current embedding extraction approach, based on the accumulation of local short-time information. Thus it is applicable to most of the state-of-the-art embeddings, including traditional i-vectors and Deep Neural Network (DNN) xvectors

    Glottal Biometric Features: Are Pathological Voice Studies appliable to Voice Biometry?

    Get PDF
    The purpose of the present paper is to introduce a methodology successfully used already in voice pathology detection for its possible adaptation to biometric speaker characterization as well. For such, the behavior of the same GMM classifiers used in the detection of pathology will be exploited. The work will show specific cases derived from running speech typically used in NIST contests against a Universal Background Model built from the population of normophonic subjects in specific vs general evaluation paradigms. Results are contrasted against a set of impostors derived from the same population of normophonic subjects. The relevance of the parameters used in the study will also be discusse

    Review of Research on Speech Technology: Main Contributions From Spanish Research Groups

    Get PDF
    In the last two decades, there has been an important increase in research on speech technology in Spain, mainly due to a higher level of funding from European, Spanish and local institutions and also due to a growing interest in these technologies for developing new services and applications. This paper provides a review of the main areas of speech technology addressed by research groups in Spain, their main contributions in the recent years and the main focus of interest these days. This description is classified in five main areas: audio processing including speech, speaker characterization, speech and language processing, text to speech conversion and spoken language applications. This paper also introduces the Spanish Network of Speech Technologies (RTTH. Red Temática en Tecnologías del Habla) as the research network that includes almost all the researchers working in this area, presenting some figures, its objectives and its main activities developed in the last years

    A Hybrid Parameterization Technique for Speaker Identification

    Get PDF
    Classical parameterization techniques for Speaker Identification use the codification of the power spectral density of raw speech, not discriminating between articulatory features produced by vocal tract dynamics (acoustic-phonetics) from glottal source biometry. Through the present paper a study is conducted to separate voicing fragments of speech into vocal and glottal components, dominated respectively by the vocal tract transfer function estimated adaptively to track the acoustic-phonetic sequence of the message, and by the glottal characteristics of the speaker and the phonation gesture. The separation methodology is based in Joint Process Estimation under the un-correlation hypothesis between vocal and glottal spectral distributions. Its application on voiced speech is presented in the time and frequency domains. The parameterization methodology is also described. Speaker Identification experiments conducted on 245 speakers are shown comparing different parameterization strategies. The results confirm the better performance of decoupled parameterization compared against approaches based on plain speech parameterization

    Memory Time Span in LSTMs for Multi-Speaker Source Separation

    Full text link
    With deep learning approaches becoming state-of-the-art in many speech (as well as non-speech) related machine learning tasks, efforts are being taken to delve into the neural networks which are often considered as a black box. In this paper it is analyzed how recurrent neural network (RNNs) cope with temporal dependencies by determining the relevant memory time span in a long short-term memory (LSTM) cell. This is done by leaking the state variable with a controlled lifetime and evaluating the task performance. This technique can be used for any task to estimate the time span the LSTM exploits in that specific scenario. The focus in this paper is on the task of separating speakers from overlapping speech. We discern two effects: A long term effect, probably due to speaker characterization and a short term effect, probably exploiting phone-size formant tracks
    corecore