7 research outputs found

    A Framework For Enhancing Speaker Age And Gender Classification By Using A New Feature Set And Deep Neural Network Architectures

    Get PDF
    Speaker age and gender classification is one of the most challenging problems in speech processing. Recently with developing technologies, identifying a speaker age and gender has become a necessity for speaker verification and identification systems such as identifying suspects in criminal cases, improving human-machine interaction, and adapting music for awaiting people queue. Although many studies have been carried out focusing on feature extraction and classifier design for improvement, classification accuracies are still not satisfactory. The key issue in identifying speaker’s age and gender is to generate robust features and to design an in-depth classifier. Age and gender information is concealed in speaker’s speech, which is liable for many factors such as, background noise, speech contents, and phonetic divergences. In this work, different methods are proposed to enhance the speaker age and gender classification based on the deep neural networks (DNNs) as a feature extractor and classifier. First, a model for generating new features from a DNN is proposed. The proposed method uses the Hidden Markov Model toolkit (HTK) tool to find tied-state triphones for all utterances, which are used as labels for the output layer in the DNN. The DNN with a bottleneck layer is trained in an unsupervised manner for calculating the initial weights between layers, then it is trained and tuned in a supervised manner to generate transformed mel-frequency cepstral coefficients (T-MFCCs). Second, the shared class labels method is introduced among misclassified classes to regularize the weights in DNN. Third, DNN-based speakers models using the SDC feature set is proposed. The speakers-aware model can capture the characteristics of the speaker age and gender more effectively than a model that represents a group of speakers. In addition, AGender-Tune system is proposed to classify the speaker age and gender by jointly fine-tuning two DNN models; the first model is pre-trained to classify the speaker age, and second model is pre-trained to classify the speaker gender. Moreover, the new T-MFCCs feature set is used as the input of a fusion model of two systems. The first system is the DNN-based class model and the second system is the DNN-based speaker model. Utilizing the T-MFCCs as input and fusing the final score with the score of a DNN-based class model enhanced the classification accuracies. Finally, the DNN-based speaker models are embedded into an AGender-Tune system to exploit the advantages of each method for a better speaker age and gender classification. The experimental results on a public challenging database showed the effectiveness of the proposed methods for enhancing the speaker age and gender classification and achieved the state of the art on this database

    Automatic identification of brazilian regional accents based on statistical modeling and machine learning techniques

    Get PDF
    Orientadores: Lee Luan Ling, Tiago Fernandes TavaresDissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: O sinal de fala possui características linguísticas fortemente determinadas por aspectos geográficos (região de origem), sociais e étnicos, tais como dialetos e sotaques. Eles estão diretamente relacionados a um idioma pois são compostos por estruturas fonéticas e fonológicas que são intrínsecas e que os diferenciam dos demais. Diversos estudos desenvolvidos na literatura de processamento de sinais de fala tem como finalidade modelar as variações da fala em sistemas de reconhecimento. A partir desses estudos, há a hipótese de que a classificação das variações linguísticas melhora a acurácia e permite a construção de modelos linguísticos mais adaptados às aplicações reais. Aplicações forenses e Speech to Text são exemplos de casos reais de sistemas de reconhecimento de fala. Em geral, o desempenho de sistemas de reconhecimento é mensurado em cenário de avaliação closed-set como também em cenário de teste cross datasets. Experimentos reportados na literatura consideram o caso mais fácil de avaliação, o closed-set. Neste cenário, as classes de treinamento são as mesmas utilizadas para teste. O cenário de teste cross datasets, consiste em treinar e testar o reconhecimento em duas bases de dados diferentes e independentes, sem controle sobre as condições de captura e gravação. Este último melhor se aplica em casos reais de identificação. Neste trabalho, são aplicadas técnicas de reconhecimento de padrões para a identificação das variações regionais da fala do português brasileiro. O objetivo é identificar automaticamente os sotaques brasileiros usando modelos GMM-UBM, iVectors e GMM-SVM. Além de avaliar os sistemas em um cenário closed-set, conforme outros trabalhos descritos na literatura, também analisamos a acurácia em cenários de teste cross datasets. Para execução dos experimentos, utilizamos três bases de dados diferentes, todas em português brasileiro e, como uma das contribuições deste trabalho, desenvolvemos uma base de dados de fala que contempla parte da variação na fala do português brasileiroAbstract: The speech signal has linguistic characteristics strongly determined by geographical (region of origin), social and ethnic aspects, such as dialects and accents. These characteristics are directly related to a language because they have inherent phonetic and phonological structures which differentiate them from the others. Several studies developed in the literature of speech signal processing have the purpose of modeling regional speech variations for speech recognition systems, in order to establish a hypothesis that the classification of the linguistic variations can improves the recognition accuracy and achieve some linguistic models more suitable for the real applications that includes forensic applications and speech to text conversion. As known, the performance of recognition systems is measured in the closed-set evaluation scenario in which, the training and testing data belongs to a common database. Experiments reported in the literature consider the easiest case to evaluate, the closed-set. However, the realistic performance of a recognition system can be performed under a cross data set scenario, in which the training and testing data belongs to different and independent databases without control over capture and recording conditions. In this work, we study some speech pattern recognition techniques to identify the regional variations of Brazilian Portuguese speech. The goal is to automatically identify the Brazilian regional accents using GMM-UBM, iVectors and GMM-SVM models. We evaluate the accent recognition systems under both closed-set and cross data sets scenarios. To perform the experiments we used three different Brazilian Portuguese databases. In fact, one of the major contributions of this work, is the compilation of a new speech database (Braccent), which explicitly expose the linguistic diversity of Brazilian PortugueseMestradoTelecomunicações e TelemáticaMestra em Engenharia ElétricaCAPE

    Síntesis de voz basada en modelos ocultos de Markov y algoritmos de aprendizaje profundo

    Get PDF
    This thesis addresses the problem of improving the results of statistical parametric speech synthesis using deep learning algorithms. The subject has become more important in recent times due to the increasing presence of artificial voices in several devices and applications. In these, there is a need to refine the results so that the sound of a synthetic voice approaches the naturalness and expressiveness of human speech. HMM-based voice synthesis became a hot topic after the second half of the 2000s thanks to its proven ability to generate speech with small amounts of data as well as its increased flexibility compared to other techniques. For this reason, the interest of the world’s leading research groups in this area turned to refine their results. In this work, three proposals are made to improve those results using post-filters based on LSTM deep neural networks. Unlike the preliminary proposals found in the literature, are based on collections of various architectures, such as auto-encoders and auto-associative memories, which are trained and applied according to subsets of speech parameters. In this way, the results achieved surpass previous attempts in which it is considered a single model that is mainly focused on the spectrum of voices. Also, two applications are presented where the use of HMM-based speech synthesis and post-filter systems based on deep learning algorithms show good results. The first is the change of accent in voices, a little-explored area for the variants of the Castilian Spanish. The second is noise reduction in signals with both natural and artificial noise. Both post-filter systems for speech synthesis, as well as the additional applications, include combinations of algorithms with other classical speech signal improvement. The work presented here allows us to glimpse new areas of research in the topic of speech synthesis and enhancement of speech signals in the presence of noise.En esta tesis se aborda el problema de mejorar los resultados de la síntesis estadística paramétrica de voz basada en Modelos Ocultos de Markov (HMM), utilizando algoritmos de aprendizaje profundo. El tema ha cobrado mayor importancia en épocas recientes debido a la presencia cada vez mayor de voces artificiales en diversos dispositivos y aplicaciones, en los cuales existe la necesidad de perfeccionar los resultados de manera que el sonido de la voz sintetizada se acerque a la naturalidad y expresividad del habla humana. La síntesis de voz basada en HMM se difundió durante la segunda mitad de la década de 2000, gracias a su probada capacidad para generar voces con menos recursos y mayor flexibilidad que otras técnicas. Por esta razón, el interés de los principales grupos de investigación del mundo en este tema se volvió hacia perfeccionar sus resultados, a partir de la década de 2010. En la presente tesis se realizan tres propuestas para mejorar estos resultados: la primera utiliza post-filtros basados en redes neuronales de memoria a corto y largo plazo (LSTM), la segunda una combinación con filtros Wiener, y la tercera un nuevo enfoque discriminativo. A diferencia de las propuestas preliminares que se encuentran en la literatura, las de esta tesis tienen como base colecciones de diversas arquitecturas, tales como autocodificadores (autoencoders) y memorias auto-asociativas, las cuales se entrenan y aplican de acuerdo con subconjuntos de parámetros del habla. De esta manera, los resultados alcanzados superan intentos previos en los que se considera un único modelo, principalmente enfocado a las componentes espectrales de las voces. Adicionalmente, se presentan dos aplicaciones donde la propuesta de utilización de síntesis de voz basada en HMM y los sistemas de post-filtros basados en algoritmos de aprendizaje profundo muestran buenos resultados. La primera es el cambio de acento en voces, área poco explorada para variantes de la lengua castellana. La segunda es la reducción de ruido en señales degradadas tanto con ruidos naturales como artificiales. Tanto los sistemas de post-filtros para la síntesis de voz, como las aplicaciones adicionales, incluyen combinaciones de los algoritmos de aprendizaje profundo con otros clásicos en el tema de mejoramiento de señales de habla. El trabajo permite vislumbrar nuevas líneas de investigación en el tema de síntesis de voz y de mejora de señales de habla en presencia de ruido

    SWISS FRENCH REGIONAL ACCENT IDENTIFICATION

    No full text
    In this paper an attempt is made to automatically recognize the speaker’s accent among regional Swiss French accents from four different regions of Switzerland, i.e. Geneva (GE), Martigny (MA), Neuchˆatel (NE) and Nyon (NY). To achieve this goal, we rely on a generative probabilistic framework for classification based on Gaussian mixture modelling (GMM). Two different GMM-based algorithms are investigated: (1) the baseline technique of universal background modelling (UBM) followed by maximum-a-posteriori (MAP) adaptation, and (2) total variability (i-vector) modelling. Both systems perform well, with the i-vector-based system outperforming the baseline system, achieving a relative improvement of 17.1% in the overall regional accent identification accuracy
    corecore