6,912 research outputs found
Anti-spoofing Methods for Automatic SpeakerVerification System
Growing interest in automatic speaker verification (ASV)systems has lead to
significant quality improvement of spoofing attackson them. Many research works
confirm that despite the low equal er-ror rate (EER) ASV systems are still
vulnerable to spoofing attacks. Inthis work we overview different acoustic
feature spaces and classifiersto determine reliable and robust countermeasures
against spoofing at-tacks. We compared several spoofing detection systems,
presented so far,on the development and evaluation datasets of the Automatic
SpeakerVerification Spoofing and Countermeasures (ASVspoof) Challenge
2015.Experimental results presented in this paper demonstrate that the useof
magnitude and phase information combination provides a substantialinput into
the efficiency of the spoofing detection systems. Also wavelet-based features
show impressive results in terms of equal error rate. Inour overview we compare
spoofing performance for systems based on dif-ferent classifiers. Comparison
results demonstrate that the linear SVMclassifier outperforms the conventional
GMM approach. However, manyresearchers inspired by the great success of deep
neural networks (DNN)approaches in the automatic speech recognition, applied
DNN in thespoofing detection task and obtained quite low EER for known and
un-known type of spoofing attacks.Comment: 12 pages, 0 figures, published in Springer Communications in Computer
and Information Science (CCIS) vol. 66
Children at risk : their phonemic awareness development in holistic instruction
Includes bibliographical references (p. 17-19
RWTH ASR Systems for LibriSpeech: Hybrid vs Attention -- w/o Data Augmentation
We present state-of-the-art automatic speech recognition (ASR) systems
employing a standard hybrid DNN/HMM architecture compared to an attention-based
encoder-decoder design for the LibriSpeech task. Detailed descriptions of the
system development, including model design, pretraining schemes, training
schedules, and optimization approaches are provided for both system
architectures. Both hybrid DNN/HMM and attention-based systems employ
bi-directional LSTMs for acoustic modeling/encoding. For language modeling, we
employ both LSTM and Transformer based architectures. All our systems are built
using RWTHs open-source toolkits RASR and RETURNN. To the best knowledge of the
authors, the results obtained when training on the full LibriSpeech training
set, are the best published currently, both for the hybrid DNN/HMM and the
attention-based systems. Our single hybrid system even outperforms previous
results obtained from combining eight single systems. Our comparison shows that
on the LibriSpeech 960h task, the hybrid DNN/HMM system outperforms the
attention-based system by 15% relative on the clean and 40% relative on the
other test sets in terms of word error rate. Moreover, experiments on a reduced
100h-subset of the LibriSpeech training corpus even show a more pronounced
margin between the hybrid DNN/HMM and attention-based architectures.Comment: Proceedings of INTERSPEECH 201
Rhythm-Flexible Voice Conversion without Parallel Data Using Cycle-GAN over Phoneme Posteriorgram Sequences
Speaking rate refers to the average number of phonemes within some unit time,
while the rhythmic patterns refer to duration distributions for realizations of
different phonemes within different phonetic structures. Both are key
components of prosody in speech, which is different for different speakers.
Models like cycle-consistent adversarial network (Cycle-GAN) and variational
auto-encoder (VAE) have been successfully applied to voice conversion tasks
without parallel data. However, due to the neural network architectures and
feature vectors chosen for these approaches, the length of the predicted
utterance has to be fixed to that of the input utterance, which limits the
flexibility in mimicking the speaking rates and rhythmic patterns for the
target speaker. On the other hand, sequence-to-sequence learning model was used
to remove the above length constraint, but parallel training data are needed.
In this paper, we propose an approach utilizing sequence-to-sequence model
trained with unsupervised Cycle-GAN to perform the transformation between the
phoneme posteriorgram sequences for different speakers. In this way, the length
constraint mentioned above is removed to offer rhythm-flexible voice conversion
without requiring parallel data. Preliminary evaluation on two datasets showed
very encouraging results.Comment: 8 pages, 6 figures, Submitted to SLT 201
Alcohol Language Corpus
The Alcohol Language Corpus (ALC) is the first publicly available speech corpus comprising intoxicated and sober speech of 162 female and male German speakers.
Recordings are done in the automotive environment to allow for the development of automatic alcohol detection and to ensure a consistent acoustic environment for the alcoholized and the sober recording. The recorded speech covers a variety of contents and speech styles. Breath and blood alcohol concentration measurements are provided for all speakers. A transcription according to SpeechDat/Verbmobil standards and disfluency tagging as well as an automatic phonetic segmentation are part of the corpus. An Emu version of ALC allows easy access to basic speech parameters as well as the us of R for statistical analysis of selected parts of ALC. ALC is available without restriction for scientific or commercial use at the Bavarian Archive for Speech Signals
PM-MMUT: Boosted Phone-Mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition
Consonant and vowel reduction are often encountered in speech, which might
cause performance degradation in automatic speech recognition (ASR). Our
recently proposed learning strategy based on masking, Phone Masking Training
(PMT), alleviates the impact of such phenomenon in Uyghur ASR. Although PMT
achieves remarkably improvements, there still exists room for further gains due
to the granularity mismatch between the masking unit of PMT (phoneme) and the
modeling unit (word-piece). To boost the performance of PMT, we propose
multi-modeling unit training (MMUT) architecture fusion with PMT (PM-MMUT). The
idea of MMUT framework is to split the Encoder into two parts including
acoustic feature sequences to phoneme-level representation (AF-to-PLR) and
phoneme-level representation to word-piece-level representation (PLR-to-WPLR).
It allows AF-to-PLR to be optimized by an intermediate phoneme-based CTC loss
to learn the rich phoneme-level context information brought by PMT.
Experimental results on Uyghur ASR show that the proposed approaches outperform
obviously the pure PMT. We also conduct experiments on the 960-hour Librispeech
benchmark using ESPnet1, which achieves about 10% relative WER reduction on all
the test set without LM fusion comparing with the latest official ESPnet1
pre-trained model.Comment: Accepted to INTERSPEECH 202
- …