399 research outputs found
Deep neural networks in acoustic model
L'estudiant m'ha contactat amb el requeriment d'una oferta per matricular-se i aquesta oferta respon a la seva petició. Després de confirmar amb Secretaria Acadèmica que està acceptat a destinació, deixem títol, descripció, objectius, i tutor extern per determinar quan arribi a destí.Do implementation of a training of a deep neural network acoustic model for speech recognitio
Automatic speech recognition: from study to practice
Today, automatic speech recognition (ASR) is widely used for different purposes such as robotics, multimedia, medical and industrial application. Although many researches have been performed in this field in the past decades, there is still a lot of room to work. In order to start working in this area, complete knowledge of ASR systems as well as their weak points and problems is inevitable. Besides that, practical experience improves the theoretical knowledge understanding in a reliable way. Regarding to these facts, in this master thesis, we have first reviewed the principal structure of the standard HMM-based ASR systems from technical point of view. This includes, feature extraction, acoustic modeling, language modeling and decoding. Then, the most significant challenging points in ASR systems is discussed. These challenging points address different internal components characteristics or external agents which affect the ASR systems performance. Furthermore, we have implemented a Spanish language recognizer using HTK toolkit. Finally, two open research lines according to the studies of different sources in the field of ASR has been suggested for future work
Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification
In this paper a novel cross-device text-independent speaker verification
architecture is proposed. Majority of the state-of-the-art deep architectures
that are used for speaker verification tasks consider Mel-frequency cepstral
coefficients. In contrast, our proposed Siamese convolutional neural network
architecture uses Mel-frequency spectrogram coefficients to benefit from the
dependency of the adjacent spectro-temporal features. Moreover, although
spectro-temporal features have proved to be highly reliable in speaker
verification models, they only represent some aspects of short-term acoustic
level traits of the speaker's voice. However, the human voice consists of
several linguistic levels such as acoustic, lexicon, prosody, and phonetics,
that can be utilized in speaker verification models. To compensate for these
inherited shortcomings in spectro-temporal features, we propose to enhance the
proposed Siamese convolutional neural network architecture by deploying a
multilayer perceptron network to incorporate the prosodic, jitter, and shimmer
features. The proposed end-to-end verification architecture performs feature
extraction and verification simultaneously. This proposed architecture displays
significant improvement over classical signal processing approaches and deep
algorithms for forensic cross-device speaker verification.Comment: Accepted in 9th IEEE International Conference on Biometrics: Theory,
Applications, and Systems (BTAS 2018
Automatic transcription of multi-genre media archives
This paper describes some recent results of our collaborative work on
developing a speech recognition system for the automatic transcription
or media archives from the British Broadcasting Corporation (BBC). The
material includes a wide diversity of shows with their associated
metadata. The latter are highly diverse in terms of completeness,
reliability and accuracy. First, we investigate how to improve lightly
supervised acoustic training, when timestamp information is inaccurate
and when speech deviates significantly from the transcription, and how
to perform evaluations when no reference transcripts are available.
An automatic timestamp correction method as well as a word and segment
level combination approaches between the lightly supervised transcripts
and the original programme scripts are presented which yield improved
metadata. Experimental results show that systems trained using the
improved metadata consistently outperform those trained with only the
original lightly supervised decoding hypotheses. Secondly, we show that
the recognition task may benefit from systems trained on a combination
of in-domain and out-of-domain data. Working with tandem HMMs, we
describe Multi-level Adaptive Networks, a novel technique for
incorporating information from out-of domain posterior features using
deep neural network. We show that it provides a substantial reduction in
WER over other systems including a PLP-based baseline, in-domain tandem
features, and the best out-of-domain tandem features.This research was supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology).This paper was presented at the First Workshop on Speech, Language and Audio in Multimedia, August 22-23, 2013; Marseille. It was published in CEUR Workshop Proceedings at http://ceur-ws.org/Vol-1012/
Discriminatively trained features using fMPE for multi-stream audio-visual speech recognition
Abstract fMPE is a recently introduced discriminative training technique that uses the Minimum Phone Error (MPE) discriminative criterion to train a feature-level transformation. In this paper we investigate fMPE trained audio/visual features for multistream HMM-based audio-visual speech recognition. A flexible, layer-based implementation of fMPE allows us to combine the the visual information with the audio stream using the discriminative traning process, and dispense with the multiple stream approach. Experiments are reported on the IBM infrared headset audio-visual database. On average of 20-speaker 1 hour speaker independent test data, the fMPE trained acoustic features achieve 33% relative gain. Adding video layers on top of audio layers gives additional 10% gain over fMPE trained features from the audio stream alone. The fMPE trained visual features achieve 14% relative gain, while the decision fusion of audio/visual streams with fMPE trained features achieves 29% relative gain. However, fMPE trained models do not improve over the original models on the mismatched noisy test data
Utterance verification in large vocabulary spoken language understanding system
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (leaves 87-89).by Huan Yao.M.Eng
- …