75 research outputs found
RETURNN as a Generic Flexible Neural Toolkit with Application to Translation and Speech Recognition
We compare the fast training and decoding speed of RETURNN of attention
models for translation, due to fast CUDA LSTM kernels, and a fast pure
TensorFlow beam search decoder. We show that a layer-wise pretraining scheme
for recurrent attention models gives over 1% BLEU improvement absolute and it
allows to train deeper recurrent encoder networks. Promising preliminary
results on max. expected BLEU training are presented. We are able to train
state-of-the-art models for translation and end-to-end models for speech
recognition and show results on WMT 2017 and Switchboard. The flexibility of
RETURNN allows a fast research feedback loop to experiment with alternative
architectures, and its generality allows to use it on a wide range of
applications.Comment: accepted as demo paper on ACL 201
Effects of Layer Freezing when Transferring DeepSpeech to New Languages
In this paper, we train Mozilla's DeepSpeech architecture on German and Swiss
German speech datasets and compare the results of different training methods.
We first train the models from scratch on both languages and then improve upon
the results by using an English pretrained version of DeepSpeech for weight
initialization and experiment with the effects of freezing different layers
during training. We see that even freezing only one layer already improves the
results dramatically
Scribosermo: fast speech-to-text models for German and other languages
Recent Speech-to-Text models often require a large amount of hardware resources and are mostly trained in English. This paper presents Speech-to-Text models for German, as well as for Spanish and French with special features: (a) They are small and run in real-time on microcontrollers like a RaspberryPi. (b) Using a pretrained English model, they can be trained on consumer-grade hardware with a relatively small dataset. (c) The models are competitive with other solutions and outperform them in German. In this respect, the models combine advantages of other approaches, which only include a subset of the presented features. Furthermore, the paper provides a new library for handling datasets, which is focused on easy extension with additional datasets and shows an optimized way for transfer-learning new languages using a pretrained model from another language with a similar alphabet
Neural Voice Puppetry: Audio-driven Facial Reenactment
We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis. Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples. Our method is not only more general than existing works since we are generic to the input person, but we also show superior visual and lip sync quality compared to photo-realistic audio- and video-driven reenactment techniques
A Comparison of Hybrid and End-to-End Models for Syllable Recognition
This paper presents a comparison of a traditional hybrid speech recognition
system (kaldi using WFST and TDNN with lattice-free MMI) and a lexicon-free
end-to-end (TensorFlow implementation of multi-layer LSTM with CTC training)
models for German syllable recognition on the Verbmobil corpus. The results
show that explicitly modeling prior knowledge is still valuable in building
recognition systems. With a strong language model (LM) based on syllables, the
structured approach significantly outperforms the end-to-end model. The best
word error rate (WER) regarding syllables was achieved using kaldi with a
4-gram LM, modeling all syllables observed in the training set. It achieved
10.0% WER w.r.t. the syllables, compared to the end-to-end approach where the
best WER was 27.53%. The work presented here has implications for building
future recognition systems that operate independent of a large vocabulary, as
typically used in a tasks such as recognition of syllabic or agglutinative
languages, out-of-vocabulary techniques, keyword search indexing and medical
speech processing.Comment: 22th International Conference of Text, Speech and Dialogue TSD201
Building a neural speech recognizer for quranic recitations
This work is an effort towards building Neural Speech Recognizers system for Quranic recitations that can be effectively used by anyone regardless of their gender and age. Despite having a lot of recitations available online, most of them are recorded by professional male adult reciters, which means that an ASR system trained on such datasets would not work for female/child reciters. We address this gap by adopting a benchmark dataset of audio records of Quranic recitations that consists of recitations by both genders from different ages. Using this dataset, we build several speaker-independent NSR systems based on the DeepSpeech model and use word error rate (WER) for evaluating them. The goal is to show how an NSR system trained and tuned on a dataset of a certain gender would perform on a test set from the other gender. Unfortunately, the number of female recitations in our dataset is rather small while the number of male recitations is much larger. In the first set of experiments, we avoid the imbalance issue between the two genders and down-sample the male part to match the female part. For this small subset of our dataset, the results are interesting with 0.968 WER when the system is trained on male recitations and tested on female recitations. The same system gives 0.406 WER when tested on male recitations. On the other hand, training the system on female recitations and testing it on male recitation gives 0.966 WER while testing it on female recitations gives 0.608 WER
- …