9,116 research outputs found
Speech Enhancement Modeling Towards Robust Speech Recognition System
Form about four decades human beings have been dreaming of an intelligent
machine which can master the natural speech. In its simplest form, this machine
should consist of two subsystems, namely automatic speech recognition (ASR) and
speech understanding (SU). The goal of ASR is to transcribe natural speech
while SU is to understand the meaning of the transcription. Recognizing and
understanding a spoken sentence is obviously a knowledge-intensive process,
which must take into account all variable information about the speech
communication process, from acoustics to semantics and pragmatics. While
developing an Automatic Speech Recognition System, it is observed that some
adverse conditions degrade the performance of the Speech Recognition System. In
this contribution, speech enhancement system is introduced for enhancing speech
signals corrupted by additive noise and improving the performance of Automatic
Speech Recognizers in noisy conditions. Automatic speech recognition
experiments show that replacing noisy speech signals by the corresponding
enhanced speech signals leads to an improvement in the recognition accuracies.
The amount of improvement varies with the type of the corrupting noise.Comment: Pages: 04; Conference Proceedings International Conference on Advance
Computing (ICAC-2008), Indi
Writer-Aware CNN for Parsimonious HMM-Based Offline Handwritten Chinese Text Recognition
Recently, the hybrid convolutional neural network hidden Markov model
(CNN-HMM) has been introduced for offline handwritten Chinese text recognition
(HCTR) and has achieved state-of-the-art performance. However, modeling each of
the large vocabulary of Chinese characters with a uniform and fixed number of
hidden states requires high memory and computational costs and makes the tens
of thousands of HMM state classes confusing. Another key issue of CNN-HMM for
HCTR is the diversified writing style, which leads to model strain and a
significant performance decline for specific writers. To address these issues,
we propose a writer-aware CNN based on parsimonious HMM (WCNN-PHMM). First,
PHMM is designed using a data-driven state-tying algorithm to greatly reduce
the total number of HMM states, which not only yields a compact CNN by state
sharing of the same or similar radicals among different Chinese characters but
also improves the recognition accuracy due to the more accurate modeling of
tied states and the lower confusion among them. Second, WCNN integrates each
convolutional layer with one adaptive layer fed by a writer-dependent vector,
namely, the writer code, to extract the irrelevant variability in writer
information to improve recognition performance. The parameters of
writer-adaptive layers are jointly optimized with other network parameters in
the training stage, while a multiple-pass decoding strategy is adopted to learn
the writer code and generate recognition results. Validated on the ICDAR 2013
competition of CASIA-HWDB database, the more compact WCNN-PHMM of a 7360-class
vocabulary can achieve a relative character error rate (CER) reduction of 16.6%
over the conventional CNN-HMM without considering language modeling. By
adopting a powerful hybrid language model (N-gram language model and recurrent
neural network language model), the CER of WCNN-PHMM is reduced to 3.17%
Personalized Acoustic Modeling by Weakly Supervised Multi-Task Deep Learning using Acoustic Tokens Discovered from Unlabeled Data
It is well known that recognizers personalized to each user are much more
effective than user-independent recognizers. With the popularity of smartphones
today, although it is not difficult to collect a large set of audio data for
each user, it is difficult to transcribe it. However, it is now possible to
automatically discover acoustic tokens from unlabeled personal data in an
unsupervised way. We therefore propose a multi-task deep learning framework
called a phoneme-token deep neural network (PTDNN), jointly trained from
unsupervised acoustic tokens discovered from unlabeled data and very limited
transcribed data for personalized acoustic modeling. We term this scenario
"weakly supervised". The underlying intuition is that the high degree of
similarity between the HMM states of acoustic token models and phoneme models
may help them learn from each other in this multi-task learning framework.
Initial experiments performed over a personalized audio data set recorded from
Facebook posts demonstrated that very good improvements can be achieved in both
frame accuracy and word accuracy over popularly-considered baselines such as
fDLR, speaker code and lightly supervised adaptation. This approach complements
existing speaker adaptation approaches and can be used jointly with such
techniques to yield improved results.Comment: 5 pages, 5 figures, published in IEEE ICASSP 201
Unsupervised Discovery of Linguistic Structure Including Two-level Acoustic Patterns Using Three Cascaded Stages of Iterative Optimization
Techniques for unsupervised discovery of acoustic patterns are getting
increasingly attractive, because huge quantities of speech data are becoming
available but manual annotations remain hard to acquire. In this paper, we
propose an approach for unsupervised discovery of linguistic structure for the
target spoken language given raw speech data. This linguistic structure
includes two-level (subword-like and word-like) acoustic patterns, the lexicon
of word-like patterns in terms of subword-like patterns and the N-gram language
model based on word-like patterns. All patterns, models, and parameters can be
automatically learned from the unlabelled speech corpus. This is achieved by an
initialization step followed by three cascaded stages for acoustic, linguistic,
and lexical iterative optimization. The lexicon of word-like patterns defines
allowed consecutive sequence of HMMs for subword-like patterns. In each
iteration, model training and decoding produces updated labels from which the
lexicon and HMMs can be further updated. In this way, model parameters and
decoded labels are respectively optimized in each iteration, and the knowledge
about the linguistic structure is learned gradually layer after layer. The
proposed approach was tested in preliminary experiments on a corpus of Mandarin
broadcast news, including a task of spoken term detection with performance
compared to a parallel test using models trained in a supervised way. Results
show that the proposed system not only yields reasonable performance on its
own, but is also complimentary to existing large vocabulary ASR systems.Comment: Accepted by ICASSP 201
Probabilistic Lexical Modeling and Unsupervised Training for Zero-Resourced ASR
Standard automatic speech recognition (ASR) systems rely on transcribed speech, language models, and pronunciation dictionaries to achieve state-of-the-art performance. The unavailability of these resources constrains the ASR technology to be available for many languages. In this paper, we propose a novel zero-resourced ASR approach to train acoustic models that only uses list of probable words from the language of interest. The proposed approach is based on Kullback-Leibler divergence based hidden Markov model (KL-HMM), grapheme subword units, knowledge of grapheme-to-phoneme mapping, and graphemic constraints derived from the word list. The approach also exploits existing acoustic and lexical resources available in other resource rich languages. Furthermore, we propose unsupervised adaptation of KL-HMM acoustic model parameters if untranscribed speech data in the target language is available. We demonstrate the potential of the proposed approach through a simulated study on Greek language
Unsupervised Discovery of Structured Acoustic Tokens with Applications to Spoken Term Detection
In this paper, we compare two paradigms for unsupervised discovery of
structured acoustic tokens directly from speech corpora without any human
annotation. The Multigranular Paradigm seeks to capture all available
information in the corpora with multiple sets of tokens for different model
granularities. The Hierarchical Paradigm attempts to jointly learn several
levels of signal representations in a hierarchical structure. The two paradigms
are unified within a theoretical framework in this paper. Query-by-Example
Spoken Term Detection (QbE-STD) experiments on the QUESST dataset of MediaEval
2015 verifies the competitiveness of the acoustic tokens. The Enhanced
Relevance Score (ERS) proposed in this work improves both paradigms for the
task of QbE-STD. We also list results on the ABX evaluation task of the Zero
Resource Challenge 2015 for comparison of the Paradigms
Duration modeling with expanded HMM applied to speech recognition
The occupancy of the HMM states is modeled by means of a Markov chain. A linear estimator is introduced to compute the probabilities of the Markov chain. The distribution function (DF) represents accurately the observed data. Representing the DF as a Markov chain allows the use of standard HMM recognizers. The increase of complexity is negligible in training and strongly limited during recognition. Experiments performed on acoustic-phonetic decoding shows how the phone recognition rate increases from 60.6 to 61.1. Furthermore, on a task of database inquires, where phones are used as subword units, the correct word rate increases from 88.2 to 88.4.Peer ReviewedPostprint (published version
Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model
Multilingual models for Automatic Speech Recognition (ASR) are attractive as
they have been shown to benefit from more training data, and better lend
themselves to adaptation to under-resourced languages. However, initialisation
from monolingual context-dependent models leads to an explosion of
context-dependent states. Connectionist Temporal Classification (CTC) is a
potential solution to this as it performs well with monophone labels.
We investigate multilingual CTC in the context of adaptation and
regularisation techniques that have been shown to be beneficial in more
conventional contexts. The multilingual model is trained to model a universal
International Phonetic Alphabet (IPA)-based phone set using the CTC loss
function. Learning Hidden Unit Contribution (LHUC) is investigated to perform
language adaptive training. In addition, dropout during cross-lingual
adaptation is also studied and tested in order to mitigate the overfitting
problem.
Experiments show that the performance of the universal phoneme-based CTC
system can be improved by applying LHUC and it is extensible to new phonemes
during cross-lingual adaptation. Updating all the parameters shows consistent
improvement on limited data. Applying dropout during adaptation can further
improve the system and achieve competitive performance with Deep Neural Network
/ Hidden Markov Model (DNN/HMM) systems on limited data
RWTH ASR Systems for LibriSpeech: Hybrid vs Attention -- w/o Data Augmentation
We present state-of-the-art automatic speech recognition (ASR) systems
employing a standard hybrid DNN/HMM architecture compared to an attention-based
encoder-decoder design for the LibriSpeech task. Detailed descriptions of the
system development, including model design, pretraining schemes, training
schedules, and optimization approaches are provided for both system
architectures. Both hybrid DNN/HMM and attention-based systems employ
bi-directional LSTMs for acoustic modeling/encoding. For language modeling, we
employ both LSTM and Transformer based architectures. All our systems are built
using RWTHs open-source toolkits RASR and RETURNN. To the best knowledge of the
authors, the results obtained when training on the full LibriSpeech training
set, are the best published currently, both for the hybrid DNN/HMM and the
attention-based systems. Our single hybrid system even outperforms previous
results obtained from combining eight single systems. Our comparison shows that
on the LibriSpeech 960h task, the hybrid DNN/HMM system outperforms the
attention-based system by 15% relative on the clean and 40% relative on the
other test sets in terms of word error rate. Moreover, experiments on a reduced
100h-subset of the LibriSpeech training corpus even show a more pronounced
margin between the hybrid DNN/HMM and attention-based architectures.Comment: Proceedings of INTERSPEECH 201
- …