230 research outputs found

    Evaluation of preprocessors for neural network speaker verification

    Get PDF

    Current trends in multilingual speech processing

    Get PDF
    In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin

    CONNECTIONIST SPEECH RECOGNITION - A Hybrid Approach

    Get PDF

    Visual Speech Synthesis using Dynamic Visemes and Deep Learning Architectures

    Get PDF
    The aim of this work is to improve the naturalness of visual speech synthesis produced automatically from a linguistic input over existing methods. Firstly, the most important contribution is on the investigation of the most suitable speech units for the visual speech synthesis. We propose the use of dynamic visemes instead of phonemes or static visemes and found that dynamic visemes can generate better visual speech than either phone or static viseme units. Moreover, best performance is obtained by a combined phoneme-dynamic viseme system. Secondly, we examine the most appropriate model between hidden Markov model (HMM) and different deep learning models that include feedforward and recurrent structures consisting of one-to-one, many-to-one and many-to-many architectures. Results suggested that that frame-by-frame synthesis from deep learning approach outperforms state-based synthesis from HMM approaches and an encoder-decoder many-to-many architecture is better than the one-to-one and many-to-one architectures. Thirdly, we explore the importance of contextual features that include information at varying linguistic levels, from frame level up to the utterance level. Our findings found that frame level information is the most valuable feature, as it is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic animation output. Fourthly, we found that the two most common objective measures of correlation and root mean square error are not able to indicate realism and naturalness of human perceived quality. We introduce an alternative objective measure and show that the global variance is a better indicator of human perception of quality. Finally, we propose a novel method to convert a given text input and phoneme transcription into a dynamic viseme transcription in the case when a reference dynamic viseme sequence is not available. Subjective preference tests confirmed that our proposed method is able to produce animation, that are statistically indistinguishable from animation produced using reference data

    On recognition of gestures arising in flight deck officer (FDO) training

    Get PDF
    This thesis presents an on-line recognition machine RM for the continuous and isolated recognition of dynamic and static gestures that arise in Flight Deck Officer (FDO) training. This thesis considers 18 distinct and commonly used dynamic and static gestures of FDO. Tracker and computer vision based systems are used to acquire the gestures. The recognition machine is based on the generic pattern recognition framework. The gestures are represented as templates using summary statistics. The proposed recognition algorithm exploits temporal and spatial characteristics of the gestures via dynamic programming and Markovian process. The algorithm predicts the correspond-ing index of incremental input data in the templates in an on-line mode. Accumulated consistency in the sequence of prediction provides a similarity measurement (Score) between input data and the templates. Having estimated Score, some heuristics are employed to control the declaration in the final stages. The recognition machine addresses general gesture recognition issues: to recognize real time and dynamic gesture, no starting/end point and inter-intra personal tem-poral and spatial variance. The first two issues and temporal variance are addressed by the proposed algorithm. The spatial invariance is addressed by introducing inde-pendent units to construct gesture models. An important aspect of the algorithm is that it provides an intuitive mechanism for automatic detection of start/end frames of continuous gestures. The algorithm has the additional advantage of providing timely feedback for training purposes. In this thesis, we consider isolated and continuous gestures. The performance of RM is evaluated using six datasets - artificial (W_TTest), hand motion (Yang, Perrotta), Gesture Panel and FDO (tracker, vision). The Hidden Markov Model (HMM) and Dynamic Time Warping (DTW) are used to compare RM's results. Various data analyses techniques are deployed to reveal the complexity and inter similarity of the datasets before experiments are conducted. In the isolated recogni-tion experiments, the recognition machine obtains comparable results with HMM and outperforms DTW. In the continuous experiments, RM surpasses HMM in terms of sentence and word recognition. In addition to these experiments, a multilayer per-ceptron neural network (MLPNN) is introduced for the prediction process of RM to validate modularity of RM. The overall conclusion of the thesis is that, RM achieves comparable results which are in agreement with HMM and DTW. Furthermore, the recognition machine pro-vides more reliable and accurate recognition in the case of missing and noisy data. The recognition machine addresses some common limitations of these algorithms and general temporal pattern recognition in the context of FDO training. The recognition algorithm is thus suited for on-line recognition.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Processing hidden Markov models using recurrent neural networks for biological applications

    Get PDF
    Philosophiae Doctor - PhDIn this thesis, we present a novel hybrid architecture by combining the most popular sequence recognition models such as Recurrent Neural Networks (RNNs) and Hidden Markov Models (HMMs). Though sequence recognition problems could be potentially modelled through well trained HMMs, they could not provide a reasonable solution to the complicated recognition problems. In contrast, the ability of RNNs to recognize the complex sequence recognition problems is known to be exceptionally good. It should be noted that in the past, methods for applying HMMs into RNNs have been developed by other researchers. However, to the best of our knowledge, no algorithm for processing HMMs through learning has been given. Taking advantage of the structural similarities of the architectural dynamics of the RNNs and HMMs, in this work we analyze the combination of these two systems into the hybrid architecture. To this end, the main objective of this study is to improve the sequence recognition/classi_cation performance by applying a hybrid neural/symbolic approach. In particular, trained HMMs are used as the initial symbolic domain theory and directly encoded into appropriate RNN architecture, meaning that the prior knowledge is processed through the training of RNNs. Proposed algorithm is then implemented on sample test beds and other real time biological applications

    Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

    Get PDF
    This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech
    corecore