60 research outputs found

    Automatic Speech Recognition for Low-Resource and Morphologically Complex Languages

    Get PDF
    The application of deep neural networks to the task of acoustic modeling for automatic speech recognition (ASR) has resulted in dramatic decreases of word error rates, allowing for the use of this technology in smart phones and personal home assistants in high-resource languages. Developing ASR models of this caliber, however, requires hundreds or thousands of hours of transcribed speech recordings, which presents challenges for most of the world’s languages. In this work, we investigate the applicability of three distinct architectures that have previously been used for ASR in languages with limited training resources. We tested these architectures using publicly available ASR datasets for several typologically and orthographically diverse languages, whose data was produced under a variety of conditions using different speech collection strategies, practices, and equipment. Additionally, we performed data augmentation on this audio, such that the amount of data could increase nearly tenfold, synthetically creating higher resource training. The architectures and their individual components were modified, and parameters explored such that we might find a best-fit combination of features and modeling schemas to fit a specific language morphology. Our results point to the importance of considering language-specific and corpus-specific factors and experimenting with multiple approaches when developing ASR systems for resource-constrained languages

    Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab

    Get PDF
    Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process. The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values. The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features. The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages

    Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

    Get PDF
    This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

    Modularity and Neural Integration in Large-Vocabulary Continuous Speech Recognition

    Get PDF
    This Thesis tackles the problems of modularity in Large-Vocabulary Continuous Speech Recognition with use of Neural Network

    Automatic Speech Recognition for Documenting Endangered First Nations Languages

    Get PDF
    Automatic speech recognition (ASR) for low-resource languages is an active field of research. Over the past years with the advent of deep learning, impressive achievements have been reported using minimal resources. As many of the world’s languages are getting extinct every year, with every dying language we lose intellect, culture, values, and tradition which generally pass down for long generations. Linguists throughout the world have already initiated many projects on language documentation to preserve such endangered languages. Automatic speech recognition is a solution to accelerate the documentation process reducing the annotation time for field linguists as well as the overall cost of the project. A traditional speech recognizer is trained on thousands of hours of acoustic data and a phonetic dictionary that includes all words from the language. End-to-End ASR systems have shown dramatic improvement for major languages. Especially, recent advancement in self-supervised representation learning which takes advantage of large corpora of untranscribed speech data has become the state-of-the-art for speech recognition technology. However, for resource-constrained languages, the technology is not tested in depth. In this thesis, we explore both traditional methods of ASR and state-of-the-art end-to-end systems for modeling a critically endangered Athabascan language known as Upper Tanana. In our first approach, we investigate traditional models with a comparative study on feature selection and a performance comparison with deep hybrid models. With limited resources at our disposal, we build a working ASR system based on a grapheme-to-phoneme (G2P) phonetic dictionary. The acoustic model can also be used as a separate forced alignment tool for the automatic alignment of training data. The results show that the GMM-HMM methods outperform deep hybrid models in low-resource acoustic modeling. In our second approach, we propose using Domain-adapted Cross-lingual Speech Recognition (DA-XLSR) for an ASR system, developed over the wav2vec 2.0 framework that utilizes pretrained transformer models leveraging cross lingual data for building an acoustic representation. The proposed system uses a multistage transfer learning process in order to fine tune the final model. To supplement the limited data, we compile a data augmentation strategy combining six augmentation techniques. The speech model uses Connectionist Temporal Classification (CTC) for an alignment free training and does not require any pronunciation dictionary or language model. Experiments from the second approach demonstrate that it can outperform the best traditional or end-to-end models in terms of word error rate (WER) and produce a powerful utterance level transcription. On top of that, the augmentation strategy is tested on several end-to-end models, and it provides a consistent improvement in performance. While the best proposed model can currently reduce the WER significantly, it may still require further research to completely replace the need for human transcribers

    A Review on Human-Computer Interaction and Intelligent Robots

    Get PDF
    In the field of artificial intelligence, human–computer interaction (HCI) technology and its related intelligent robot technologies are essential and interesting contents of research. From the perspective of software algorithm and hardware system, these above-mentioned technologies study and try to build a natural HCI environment. The purpose of this research is to provide an overview of HCI and intelligent robots. This research highlights the existing technologies of listening, speaking, reading, writing, and other senses, which are widely used in human interaction. Based on these same technologies, this research introduces some intelligent robot systems and platforms. This paper also forecasts some vital challenges of researching HCI and intelligent robots. The authors hope that this work will help researchers in the field to acquire the necessary information and technologies to further conduct more advanced research

    Fundamental frequency modelling: an articulatory perspective with target approximation and deep learning

    Get PDF
    Current statistical parametric speech synthesis (SPSS) approaches typically aim at state/frame-level acoustic modelling, which leads to a problem of frame-by-frame independence. Besides that, whichever learning technique is used, hidden Markov model (HMM), deep neural network (DNN) or recurrent neural network (RNN), the fundamental idea is to set up a direct mapping from linguistic to acoustic features. Although progress is frequently reported, this idea is questionable in terms of biological plausibility. This thesis aims at addressing the above issues by integrating dynamic mechanisms of human speech production as a core component of F0 generation and thus developing a more human-like F0 modelling paradigm. By introducing an articulatory F0 generation model – target approximation (TA) – between text and speech that controls syllable-synchronised F0 generation, contextual F0 variations are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. With the goal of demonstrating that human speech movement can be considered as a dynamic process of target approximation and that the TA model is a valid F0 generation model to be used at the motor-to-acoustic stage, a TA-based pitch control experiment is conducted first to simulate the subtle human behaviour of online compensation for pitch-shifted auditory feedback. Then, the TA parameters are collectively controlled by linguistic features via a deep or recurrent neural network (DNN/RNN) at the linguistic-to-motor stage. We trained the systems on a Mandarin Chinese dataset consisting of both statements and questions. The TA-based systems generally outperformed the baseline systems in both objective and subjective evaluations. Furthermore, the amount of required linguistic features were reduced first to syllable level only (with DNN) and then with all positional information removed (with RNN). Fewer linguistic features as input with limited number of TA parameters as output led to less training data and lower model complexity, which in turn led to more efficient training and faster synthesis
    corecore