8 research outputs found

    On the Use of Deep Feedforward Neural Networks for Automatic Language Identification

    Get PDF
    In this work, we present a comprehensive study on the use of deep neural networks (DNNs) for automatic language identification (LID). Motivated by the recent success of using DNNs in acoustic modeling for speech recognition, we adapt DNNs to the problem of identifying the language in a given utterance from its short-term acoustic features. We propose two different DNN- based approaches. In the first one, the DNN acts as an end-to-end LID classifier, receiving as input the speech features and providing as output the estimated probabilities of the target languages. In the second approach, the DNN is used to extract bottleneck features that are then used as inputs for a state-of-the-art i-vector system. Experiments are conducted in two different scenarios: the complete NIST Language Recognition Evaluation dataset 2009 (LRE’09) and a subset of the Voice of America (VOA) data from LRE’09, in which all languages have the same amount of training data. Results for both datasets demonstrate that the DNN-based systems significantly outperform a state-of-art i-vector system when dealing with short-duration utterances. Furthermore, the combination of the DNN-based and the classical i-vector system leads to additional performance improvements (up to 45% of relative improvement in both EER and Cavg on 3s and 10s conditions, respectively)

    Towards age-independent acoustic modeling

    Full text link
    International audienceIn automatic speech recognition applications, due to significant differences in voice characteristics, adults and children are usually treated as two population groups, for which different acoustic models are trained. In this paper, age-independent acoustic modeling is investigated in the context of large vocabulary speech recognition. Exploiting a small amount (9 hours) of children's speech and a more significant amount (57 hours) of adult speech, age-independent acoustic models are trained using several methods for speaker adaptive acoustic modeling. Recognition results achieved using these models are compared with those achieved using age-dependent acoustic models for children and adults, respectively. Recognition experiments are performed on four Italian speech corpora, two consisting of children's speech and two of adult speech, using 64k word and 11k word trigram language models. Methods for speaker adaptive acoustic modeling prove to be effective for training age-independent acoustic models ensuring recognition results at least as good as those achieved with age-dependent acoustic models for adults and children

    Automatic Conversion of Emotions in Speech within a Speaker Independent Framework

    Get PDF
    Emotions in speech are a fundamental part of a natural dialog. In everyday life, vocal interaction with people often implies emotions as an intrinsic part of the conversation to a greater or lesser extent. Thus, the inclusion of emotions in human-machine dialog systems is crucial to achieve an acceptable degree of naturalness in the communication. This thesis focuses on automatic emotion conversion of speech, a technique whose aim is to transform an utterance produced in neutral style to a certain emotion state in a speaker independent context. Conversion of emotions represents a challenge in the sense that emotions a affect significantly all the parts of the human vocal production system, and in the conversion process all these factors must be taken into account carefully. The techniques used in the literature are based on voice conversion approaches, with minor modifications to create the sensation of emotion. In this thesis, the idea of voice conversion systems is used as well, but the usual regression process is divided in a two-step procedure that provides additional speaker normalization to remove the intrinsic speaker dependency of this kind of systems, using vocal tract length normalization as a pre-processing technique. In addition, a new method to convert the duration trend of the utterance and the intonation contour is proposed, taking into account the contextual information

    Steuerung sprechernormalisierender Abbildungen durch künstliche neuronale Netzwerke

    Get PDF
    Im Sinne dieser Arbeit bedeutet Sprechernormalisierung eine Vorverarbeitung bzw. Filterung der aufbereiteten Eingangssignale eines automatischen Spracherkennungssystems mit dem Ziel, die Variation in den Signalen analoger Äußerungen verschiedener Sprecher zu reduzieren. Dies bewirkt eine Verminderung von Mehrdeutigkeit und dadurch eine Verbesserung der Erkennungsleistung durch den anschließenden Klassifikator.In dieser Arbeit werden Normalisierungen durch ein auf einer Hauptkomponentenanalyse der Barkspektrogramme basierendes Verfahren und durch die Abbildungen der Spektrogramme vermittels ein- und mehrschichtiger Perzeptrone untersucht. Besondere Aufmerksamkeit erfährt hierbei die Interpolierbarkeit von Nachbarschaftsbeziehungen zwischen verschiedenen Sprechern. Hierbei wird speziell darauf eingegangen, wie diese Interpolation unter Verwendung weiterer Perzeptrone ebenfalls automatisch erreicht werden kann. Die hierfür notwendige Information wird wiederum durch Barkspektrogramme sowie durch - ebenfalls aus dem Sprachsignal ermittelte - artikulatorische Parameter bereitgestellt

    Influence of Morphological Features on Language Modeling With Neural Networks in Speech Recognition Systems

    Get PDF
    Automatsko prepoznavanje govora je tehnologija koja računarima omogućava pretvaranje izgovorenih reči u tekst. Ona se može primeniti u mnogim savremenim sistemima koji uključuju komunikaciju između čoveka i mašine. U ovoj disertaciji detaljno je opisana jedna od dve glavne komponente sistema za prepoznavanje govora, a to je jezički model, koji specificira rečnik sistema, kao i pravila prema kojim se pojedinačne reči mogu povezati u rečenicu. Srpski jezik spada u grupu visoko inflektivnih i morfološki bogatih jezika, što znači da koristi veći broj različitih završetaka reči za izražavanje željene gramatičke, sintaksičke ili semantičke funkcije date reči. Ovakvo ponašanje često dovodi do velikog broja grešaka sistema za prepoznavanje govora kod kojih zbog dobrog akustičkog poklapanja prepoznavač pogodi osnovni oblik reči, ali pogreši njen završetak. Taj završetak može da označava drugu morfološku kategoriju, na primer, padež, rod ili broj. U radu je predstavljen novi alat za modelovanje jezika, koji uz identitet reči u modelu može da koristi dodatna leksička i morfološka obeležja reči, čime je testirana hipoteza da te dodatne informacije mogu pomoći u prevazilaženju značajnog broja grešaka prepoznavača koje su posledica inflektivnosti srpskog jezika.Automatic speech recognition is a technology that allows computers to convert spoken words into text. It can be applied in various areas which involve communication between humans and machines. This thesis primarily deals with one of two main components of speech recognition systems - the language model, that specifies the vocabulary of the system, as well as the rules by which individual words can be linked into sentences. The Serbian language belongs to a group of highly inflective and morphologically rich languages, which means that it uses a number of different word endings to express the desired grammatical, syntactic, or semantic function of the given word. Such behavior often leads to a significant number of errors in speech recognition systems where due to good acoustic matching the recognizer correctly guesses the basic form of the word, but an error occurs in the word ending. This word ending may indicate a different morphological category, for example, word case, grammatical gender, or grammatical number. The thesis presents a new language modeling tool which, along with the word identity, can also model additional lexical and morphological features of the word, thus testing the hypothesis that this additional information can help overcome a significant number of recognition errors that result from the high inflectivity of the Serbian language

    Deep Neural Network Architectures for Large-scale, Robust and Small-Footprint Speaker and Language Recognition

    Full text link
    Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones. Fecha de lectura : 27-04-2017Artificial neural networks are powerful learners of the information embedded in speech signals. They can provide compact, multi-level, nonlinear representations of temporal sequences and holistic optimization algorithms capable of surpassing former leading paradigms. Artificial neural networks are, therefore, a promising technology that can be used to enhance our ability to recognize speakers and languages–an ability increasingly in demand in the context of new, voice-enabled interfaces used today by millions of users. The aim of this thesis is to advance the state-of-the-art of language and speaker recognition through the formulation, implementation and empirical analysis of novel approaches for large-scale and portable speech interfaces. Its major contributions are: (1) novel, compact network architectures for language and speaker recognition, including a variety of network topologies based on fully-connected, recurrent, convolutional, and locally connected layers; (2) a bottleneck combination strategy for classical and neural network approaches for long speech sequences; (3) the architectural design of the first, public, multilingual, large vocabulary continuous speech recognition system; and (4) a novel, end-to-end optimization algorithm for text-dependent speaker recognition that is applicable to a range of verification tasks. Experimental results have demonstrated that artificial neural networks can substantially reduce the number of model parameters and surpass the performance of previous approaches to language and speaker recognition, particularly in the cases of long short-term memory recurrent networks (used to model the input speech signal), end-to-end optimization algorithms (used to predict languages or speakers), short testing utterances, and large training data collections.Las redes neuronales artificiales son sistemas de aprendizaje capaces de extraer la información embebida en las señales de voz. Son capaces de modelar de forma eficiente secuencias temporales complejas, con información no lineal y distribuida en distintos niveles semanticos, mediante el uso de algoritmos de optimización integral con la capacidad potencial de mejorar los sistemas aprendizaje automático existentes. Las redes neuronales artificiales son, pues, una tecnología prometedora para mejorar el reconocimiento automático de locutores e idiomas; siendo el reconocimiento de de locutores e idiomas, tareas con cada vez más demanda en los nuevos sistemas de control por voz, que ya utilizan millones de personas. Esta tesis tiene como objetivo la mejora del estado del arte de las tecnologías de reconocimiento de locutor y de idioma mediante la formulación, implementación y análisis empírico de nuevos enfoques basados en redes neuronales, aplicables a dispositivos portátiles y a su uso en gran escala. Las principales contribuciones de esta tesis incluyen la propuesta original de: (1) arquitecturas eficientes que hacen uso de capas neuronales densas, localmente densas, recurrentes y convolucionales; (2) una nueva estrategia de combinación de enfoques clásicos y enfoques basados en el uso de las denominadas redes de cuello de botella; (3) el diseño del primer sistema público de reconocimiento de voz, de vocabulario abierto y continuo, que es además multilingüe; y (4) la propuesta de un nuevo algoritmo de optimización integral para tareas de reconocimiento de locutor, aplicable también a otras tareas de verificación. Los resultados experimentales extraídos de esta tesis han demostrado que las redes neuronales artificiales son capaces de reducir el número de parámetros usados por los algoritmos de reconocimiento tradicionales, así como de mejorar el rendimiento de dichos sistemas de forma substancial. Dicha mejora relativa puede acentuarse a través del modelado de voz mediante redes recurrentes de memoria a largo plazo, el uso de algoritmos de optimización integral, el uso de locuciones de evaluation de corta duración y mediante la optimización del sistema con grandes cantidades de datos de entrenamiento

    Improved methods for vocal tract normalization

    No full text
    This paper presents improved methods for vocal tract normalization (VTN) along with experimental tests on three databases. We propose a new method for VTN in training: By using acoustic models with single Gaussian densities per state for selecting the normalization scales it is avoided that the models learn the normalization scales of the training speakers. We show that using single Gaussian densities for selecting the normalization scales in training results in lower error rates than using mixture densities. For VTN in recognition, we propose an improvement of the well–known multiple–pass strategy: By using an unnormalized acoustic model for the first recognition pass instead of a normalized model lower error rates are obtained. In recognition tests, this method is compared with a fast variant of VTN. The multiple–pass strategy is an efficient method but it is suboptimal because the normalization scale and the word sequence are determined sequentially. We found that for telephone digit string recognition this suboptimality reduces the VTN gain in recognition performance by 30 % relative. On the German spontaneous scheduling task Verbmobil, the WSJ task and the German telephone digit string corpus SieTill the proposed methods for VTN reduce the error rates significantly. 1
    corecore