13,738 research outputs found
Syntactic error modeling and scoring normalization in speech recognition: Error modeling and scoring normalization in the speech recognition task for adult literacy training
The purpose was to develop a speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Better mechanisms are provided for using speech recognition in a literacy tutor application. Using a combination of scoring normalization techniques and cheater-mode decoding, a reasonable acceptance/rejection threshold was provided. In continuous speech, the system was tested to be able to provide above 80 pct. correct acceptance of words, while correctly rejecting over 80 pct. of incorrectly pronounced words
Recommended from our members
Privacy-Preserving iVector-Based Speaker Verification
This paper introduces an efficient algorithm to develop a privacy-preserving voice verification based on iVector and linear discriminant analysis techniques. This research considers a scenario in which users enrol their voice biometric to access different services (i.e., banking). Once enrolment is completed, users can verify themselves using their voice print instead of alphanumeric passwords. Since a voice print is unique for everyone, storing it with a third-party server raises several privacy concerns. To address this challenge, this paper proposes a novel technique based on randomization to carry out voice authentication, which allows the user to enrol and verify their voice in the randomized domain. To achieve this, the iVector-based voice verification technique has been redesigned to work on the randomized domain. The proposed algorithm is validated using a well-known speech dataset. The proposed algorithm neither compromises the authentication accuracy nor adds additional complexity due to the randomization operations
Recommended from our members
An end-to-end framework for real-time automatic sleep stage classification.
Sleep staging is a fundamental but time consuming process in any sleep laboratory. To greatly speed up sleep staging without compromising accuracy, we developed a novel framework for performing real-time automatic sleep stage classification. The client-server architecture adopted here provides an end-to-end solution for anonymizing and efficiently transporting polysomnography data from the client to the server and for receiving sleep stages in an interoperable fashion. The framework intelligently partitions the sleep staging task between the client and server in a way that multiple low-end clients can work with one server, and can be deployed both locally as well as over the cloud. The framework was tested on four datasets comprising ≈1700 polysomnography records (≈12000 hr of recordings) collected from adolescents, young, and old adults, involving healthy persons as well as those with medical conditions. We used two independent validation datasets: one comprising patients from a sleep disorders clinic and the other incorporating patients with Parkinson's disease. Using this system, an entire night's sleep was staged with an accuracy on par with expert human scorers but much faster (≈5 s compared with 30-60 min). To illustrate the utility of such real-time sleep staging, we used it to facilitate the automatic delivery of acoustic stimuli at targeted phase of slow-sleep oscillations to enhance slow-wave sleep
MISPRONUNCIATION DETECTION AND DIAGNOSIS IN MANDARIN ACCENTED ENGLISH SPEECH
This work presents the development, implementation, and evaluation of a Mispronunciation Detection and Diagnosis (MDD) system, with application to pronunciation evaluation of Mandarin-accented English speech. A comprehensive detection and diagnosis of errors in the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) was performed by using the expert phonetic transcripts and an Automatic Speech Recognition (ASR) system. Articulatory features derived from the parallel kinematic data available in the EMA-MAE corpus were used to identify the most significant articulatory error patterns seen in L2 speakers during common mispronunciations. Using both acoustic and articulatory information, an ASR based Mispronunciation Detection and Diagnosis (MDD) system was built and evaluated across different feature combinations and Deep Neural Network (DNN) architectures. The MDD system captured mispronunciation errors with a detection accuracy of 82.4%, a diagnostic accuracy of 75.8% and a false rejection rate of 17.2%. The results demonstrate the advantage of using articulatory features in revealing the significant contributors of mispronunciation as well as improving the performance of MDD systems
Very Fast Keyword Spotting System with Real Time Factor below 0.01
In the paper we present an architecture of a keyword spotting (KWS) system
that is based on modern neural networks, yields good performance on various
types of speech data and can run very fast. We focus mainly on the last aspect
and propose optimizations for all the steps required in a KWS design: signal
processing and likelihood computation, Viterbi decoding, spot candidate
detection and confidence calculation. We present time and memory efficient
modelling by bidirectional feedforward sequential memory networks (an
alternative to recurrent nets) either by standard triphones or so called
quasi-monophones, and an entirely forward decoding of speech frames (with
minimal need for look back). Several variants of the proposed scheme are
evaluated on 3 large Czech datasets (broadcast, internet and telephone, 17
hours in total) and their performance is compared by Detection Error Tradeoff
(DET) diagrams and real-time (RT) factors. We demonstrate that the complete
system can run in a single pass with a RT factor close to 0.001 if all
optimizations (including a GPU for likelihood computation) are applied.Comment: 11 pages, 3 figure
Enhancing posterior based speech recognition systems
The use of local phoneme posterior probabilities has been increasingly explored for improving speech recognition systems. Hybrid hidden Markov model / artificial neural network (HMM/ANN) and Tandem are the most successful examples of such systems. In this thesis, we present a principled framework for enhancing the estimation of local posteriors, by integrating phonetic and lexical knowledge, as well as long contextual information. This framework allows for hierarchical estimation, integration and use of local posteriors from the phoneme up to the word level. We propose two approaches for enhancing the posteriors. In the first approach, phoneme posteriors estimated with an ANN (particularly multi-layer Perceptron – MLP) are used as emission probabilities in HMM forward-backward recursions. This yields new enhanced posterior estimates integrating HMM topological constraints (encoding specific phonetic and lexical knowledge), and long context. In the second approach, a temporal context of the regular MLP posteriors is post-processed by a secondary MLP, in order to learn inter and intra dependencies among the phoneme posteriors. The learned knowledge is integrated in the posterior estimation during the inference (forward pass) of the second MLP, resulting in enhanced posteriors. The use of resulting local enhanced posteriors is investigated in a wide range of posterior based speech recognition systems (e.g. Tandem and hybrid HMM/ANN), as a replacement or in combination with the regular MLP posteriors. The enhanced posteriors consistently outperform the regular posteriors in different applications over small and large vocabulary databases
Support Vector Machine classification of strong gravitational lenses
The imminent advent of very large-scale optical sky surveys, such as Euclid
and LSST, makes it important to find efficient ways of discovering rare objects
such as strong gravitational lens systems, where a background object is
multiply gravitationally imaged by a foreground mass. As well as finding the
lens systems, it is important to reject false positives due to intrinsic
structure in galaxies, and much work is in progress with machine learning
algorithms such as neural networks in order to achieve both these aims. We
present and discuss a Support Vector Machine (SVM) algorithm which makes use of
a Gabor filterbank in order to provide learning criteria for separation of
lenses and non-lenses, and demonstrate using blind challenges that under
certain circumstances it is a particularly efficient algorithm for rejecting
false positives. We compare the SVM engine with a large-scale human examination
of 100000 simulated lenses in a challenge dataset, and also apply the SVM
method to survey images from the Kilo-Degree Survey.Comment: Accepted by MNRA
Survey on Leveraging Uncertainty Estimation Towards Trustworthy Deep Neural Networks: The Case of Reject Option and Post-training Processing
Although neural networks (especially deep neural networks) have achieved
\textit{better-than-human} performance in many fields, their real-world
deployment is still questionable due to the lack of awareness about the
limitation in their knowledge. To incorporate such awareness in the machine
learning model, prediction with reject option (also known as selective
classification or classification with abstention) has been proposed in
literature. In this paper, we present a systematic review of the prediction
with the reject option in the context of various neural networks. To the best
of our knowledge, this is the first study focusing on this aspect of neural
networks. Moreover, we discuss different novel loss functions related to the
reject option and post-training processing (if any) of network output for
generating suitable measurements for knowledge awareness of the model. Finally,
we address the application of the rejection option in reducing the prediction
time for the real-time problems and present a comprehensive summary of the
techniques related to the reject option in the context of extensive variety of
neural networks. Our code is available on GitHub:
\url{https://github.com/MehediHasanTutul/Reject_option
- …