71 research outputs found

    Windowed Attention Mechanisms for Speech Recognition

    Get PDF

    Exploring Automatic Speech Recognition with TensorFlow

    Get PDF
    Speech Recognition (reconocimiento de voz) es la tarea que pretende indentificar palabras habladas y convertirlas a texto. Este trabajo de fin de grado se centra en utilizar técnicas de deep learning para construir un sistema de Speech Recognition entrenándolo end-to-end. Como paso preliminar, hacemos un resumen de los métodos más relevantes llevados a cabo los últimos años. A continuación estudiamos uno de los trabajos más recientes en este área que propone un modelo sequence to sequence con atención entrenado end-to-end. Después, reproducimos satisfactoriamente el modelo y lo avaluamos con la base de datos TIMIT. Analizamos los parecidos y diferencias entre la implementación propuesta y el trabajo teórico original. Y finalmente, experimentamos y contrastamos el modelo utilizando diferentes parámetros (e.g. numero de neuronas por capa, la tasa de aprendizaje -learning rate y los batch sizes) y reducimos el Phoneme Error Rate cerca del 12% relativo.Speech recognition is the task aiming to identify words in spoken language and convert them into text. This bachelor's thesis focuses on using deep learning techniques to build an end-to-end Speech Recognition system. As a preliminary step, we overview the most relevant methods carried out over the last several years. Then, we study one of the latest proposals for this end-to-end approach that uses a sequence to sequence model with attention-based mechanisms. Next, we successfully reproduce the model and test it over the TIMIT database. We analyze the similarities and differences between the current implementation proposal and the original theoretical work. And finally, we experiment and contrast using different parameters (e.g. number of layer units, learning rates and batch sizes) and reduce the Phoneme Error Rate in almost 12% relative.Speech Recognition (reconeixement de veu) és la tasca que pretén indentificar paraules del llenguatge parlat i convertir-les a text. Aquest treball de fi de grau es centra en utilitzar tècniques de deep learning per construir un sistema d'Speech Recognition entrenant-lo end-to-end. Com a pas preliminar, fem un resum dels mètodes més rellevants duts a terme els últims anys. A continuació, estudiem un dels treballs més recents en aquesta àrea que proposa un model sequence to sequence amb l?atenció entrenat end-to-end. Després, reproduim satisfactòiament el model i l'avaluem amb la base de dades TIMIT. Analitzem les semblances i diferències entre l'implementació proposada i el treball teòric original. I finalment, experimentem i contrastem el model utilitzant diferents paràmetres (e.g. nombre de neurones per capa, la taxa d'aprenentatge -learning rate- i els batch sizes) i reduim el Phoneme Error Rate gairebé un 12% relatiu

    ЕND-TO-END SPEECH RECOGNITION SYSTEMS FOR AGGLUTINATIVE LANGUAGES

    Get PDF
    With the improvement of intelligent systems, speech recognition technologies are being widely integrated into various aspects of human life. Speech recognition is applied to smart assistants, smart home infrastructure, the call center applications of banks, information system components for impaired people, etc. But these facilities of information systems are available only for common languages, like English, Chinese, or Russian. For low-resource language, these opportunities for information technologies are still not implemented. Most modern speech recognition approaches are still not tested on agglutinative languages, especially for the languages of Turkic group like Kazakh, Tatar, and Turkish Languages. The HMM-GMM (Hidden Markov Models - Gaussian Mixture Models) model has been the most popular in the field of Automatic Speech Recognition (ASR) for a long time. Currently, neural networks are widely used in different fields of NLP, especially in automatic speech recognition. In an enormous number of works application of neural networks within different stages of automatic speech recognition makes the quality level of this systems much better. Integral speech recognition systems based on neural networks are investigated in the article. The paper proves that the Connectionist Temporal Classification (CTC) model works precisely for agglutinative languages. The author conducted an experiment with the LSHTM neural network using an encoder-decoder model, which is based on the attention-based models. The result of the experiment showed a Character Error Rate (CER) equal to 8.01% and a Word Error Rate (WER) equal to 17.91%. This result proves the possibility of getting a good ASR model without the use of the Language Model (LM)

    Supervised Sequence Labelling with Recurrent Neural Networks

    Full text link

    Phonetic Error Analysis Beyond Phone Error Rate

    Get PDF
    In this article, we analyse the performance of the TIMIT-based phone recognition systems beyond the overall phone error rate (PER) metric. We consider three broad phonetic classes (BPCs): {affricate, diphthong, fricative, nasal, plosive, semi-vowel, vowel, silence}, {consonant, vowel, silence} and {voiced, unvoiced, silence} and, calculate the contribution of each phonetic class in terms of the substitution, deletion, insertion and PER. Furthermore, for each BPC we investigate the following: evolution of PER during training, effect of noise (NTIMIT), importance of different spectral subbands (1, 2, 4, and 8 kHz), usefulness of bidirectional vs unidirectional sequential modelling, transfer learning from WSJ and regularisation via monophones. In addition, we construct a confusion matrix for each BPC and analyse the confusions via dimensionality reduction to 2D at the input (acoustic features) and output (logits) levels of the acoustic model. We also compare the performance and confusion matrices of the BLSTM-based hybrid baseline system with those of the GMM-HMM based hybrid, Conformer and wav2vec 2.0 based end-to-end phone recognisers. Finally, the relationship of the unweighted and weighted PERs with the broad phonetic class priors is studied for both the hybrid and end-to-end systems
    • …
    corecore