1,415 research outputs found

    Current trends in multilingual speech processing

    Get PDF
    In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin

    Embedded Knowledge-based Speech Detectors for Real-Time Recognition Tasks

    Get PDF
    Speech recognition has become common in many application domains, from dictation systems for professional practices to vocal user interfaces for people with disabilities or hands-free system control. However, so far the performance of automatic speech recognition (ASR) systems are comparable to human speech recognition (HSR) only under very strict working conditions, and in general much lower. Incorporating acoustic-phonetic knowledge into ASR design has been proven a viable approach to raise ASR accuracy. Manner of articulation attributes such as vowel, stop, fricative, approximant, nasal, and silence are examples of such knowledge. Neural networks have already been used successfully as detectors for manner of articulation attributes starting from representations of speech signal frames. In this paper, the full system implementation is described. The system has a first stage for MFCC extraction followed by a second stage implementing a sinusoidal based multi-layer perceptron for speech event classification. Implementation details over a Celoxica RC203 board are give

    Language independent and unsupervised acoustic models for speech recognition and keyword spotting

    Get PDF
    Copyright © 2014 ISCA. Developing high-performance speech processing systems for low-resource languages is very challenging. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to train a multi-language bottleneck DNN. Language dependent and/or multi-language (all training languages) Tandem acoustic models (AM) are then trained. This work considers a particular scenario where the target language is unseen in multi-language training and has limited language model training data, a limited lexicon, and acoustic training data without transcriptions. A zero acoustic resources case is first described where a multilanguage AM is directly applied, as a language independent AM (LIAM), to an unseen language. Secondly, in an unsupervised approach a LIAM is used to obtain hypotheses for the target language acoustic data transcriptions which are then used in training a language dependent AM. 3 languages from the IARPA Babel project are used for assessment: Vietnamese, Haitian Creole and Bengali. Performance of the zero acoustic resources system is found to be poor, with keyword spotting at best 60% of language dependent performance. Unsupervised language dependent training yields performance gains. For one language (Haitian Creole) the Babel target is achieved on the in-vocabulary data

    An on-line VAD based on Multi-Normalisation Scoring (MNS) of observation likelihoods

    Get PDF
    Preprint del artículo públicado online el 31 de mayo 2018Voice activity detection (VAD) is an essential task in expert systems that rely on oral interfaces. The VAD module detects the presence of human speech and separates speech segments from silences and non-speech noises. The most popular current on-line VAD systems are based on adaptive parameters which seek to cope with varying channel and noise conditions. The main disadvantages of this approach are the need for some initialisation time to properly adjust the parameters to the incoming signal and uncertain performance in the case of poor estimation of the initial parameters. In this paper we propose a novel on-line VAD based only on previous training which does not introduce any delay. The technique is based on a strategy that we have called Multi-Normalisation Scoring (MNS). It consists of obtaining a vector of multiple observation likelihood scores from normalised mel-cepstral coefficients previously computed from different databases. A classifier is then used to label the incoming observation likelihood vector. Encouraging results have been obtained with a Multi-Layer Perceptron (MLP). This technique can generalise for unseen noise levels and types. A validation experiment with two current standard ITU-T VAD algorithms demonstrates the good performance of the method. Indeed, lower classification error rates are obtained for non-speech frames, while results for speech frames are similar.This work was partially supported by the EU (ERDF) under grant TEC2015-67163-C2-1-R (RESTORE) (MINECO/ERDF, EU) and by the Basque Government under grant KK-2017/00043 (BerbaOla)

    Data Mining

    Get PDF

    A Novel Approach for Speech to Text Recognition System Using Hidden Markov Model

    Get PDF
    Speech recognition is the application of sophisticated algorithms which involve the transforming of the human voice to text. Speech identification is essential as it utilizes by several biometric identification systems and voice-controlled automation systems. Variations in recording equipment, speakers, situations, and environments make speech recognition a tough undertaking. Three major phases comprise speech recognition: speech pre-processing, feature extraction, and speech categorization. This work presents a comprehensive study with the objectives of comprehending, analyzing, and enhancing these models and approaches, such as Hidden Markov Models and Artificial Neural Networks, employed in the voice recognition system for feature extraction and classification

    KOMPLEKSOWE METODY UCZENIA MASZYNOWEGO I UCZENIA GŁĘBOKIEGO DO KLASYFIKACJI CHOROBY PARKINSONA I OCENY JEJ NASILENIA

    Get PDF
    In this study, we aimed to adopt a comprehensive approach to categorize and assess the severity of Parkinson's disease by leveraging techniques from both machine learning and deep learning. We thoroughly evaluated the effectiveness of various models, including XGBoost, Random Forest, Multi-Layer Perceptron (MLP), and Recurrent Neural Network (RNN), utilizing classification metrics. We generated detailed reports to facilitate a comprehensive comparative analysis of these models. Notably, XGBoost demonstrated the highest precision at 97.4%. Additionally, we took a step further by developing a Gated Recurrent Unit (GRU) model with the purpose of combining predictions from alternative models. We assessed its ability to predict the severity of the ailment. To quantify the precision levels of the models in disease classification, we calculated severity percentages. Furthermore, we created a Receiver Operating Characteristic (ROC) curve for the GRU model, simplifying the evaluation of its capability to distinguish among various severity levels. This comprehensive approach contributes to a more accurate and detailed understanding of Parkinson's disease severity assessment.W tym badaniu naszym celem było przyjęcie kompleksowego podejścia do kategoryzacji i oceny ciężkości choroby Parkinsona poprzez wykorzystanie technik zarówno uczenia maszynowego, jak i głębokiego uczenia. Dokładnie oceniliśmy skuteczność różnych modeli, w tym XGBoost, Random Forest, Multi-Layer Perceptron (MLP) i Recurrent Neural Network (RNN), wykorzystując wskaźniki klasyfikacji. Wygenerowaliśmy szczegółowe raporty, aby ułatwić kompleksową analizę porównawczą tych modeli. Warto zauważyć, że XGBoost wykazał najwyższą precyzję na poziomie 97,4%. Ponadto poszliśmy o krok dalej, opracowując model Gated Recurrent Unit (GRU) w celu połączenia przewidywań z alternatywnych modeli. Oceniliśmy jego zdolność do przewidywania nasilenia dolegliwości. Aby określić ilościowo poziomy dokładności modeli w klasyfikacji chorób, obliczyliśmy wartości procentowe nasilenia. Ponadto stworzyliśmy krzywą charakterystyki operacyjnej odbiornika (ROC) dla modelu GRU, upraszczając ocenę jego zdolności do rozróżniania różnych poziomów nasilenia. To kompleksowe podejście przyczynia się do dokładniejszego i bardziej szczegółowego zrozumienia oceny ciężkości choroby Parkinsona
    corecore