Search CORE

3,912 research outputs found

Speech Synthesis Based on Hidden Markov Models

Author: Nankaku Y.
Oura K.
Toda T.
Tokuda K.
Yamagishi J.
Zen H.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2013
Field of study

XNMT: The eXtensible Neural Machine Translation Toolkit

Author: Arthur Philip
Felix Matthieu
Godard Pierre
Hewitt John
Matthews Austin
Neubig Graham
Padmanabhan Sarguna
Qi Ye
Riad Rachid
Sachan Devendra Singh
Sperber Matthias
Wang Liming
Wang Xinyi
Publication venue
Publication date: 28/02/2018
Field of study

This paper describes XNMT, the eXtensible Neural Machine Translation toolkit. XNMT distin- guishes itself from other open-source NMT toolkits by its focus on modular code design, with the purpose of enabling fast iteration in research and replicable, reliable results. In this paper we describe the design of XNMT and its experiment configuration system, and demonstrate its utility on the tasks of machine translation, speech recognition, and multi-tasked machine translation/parsing. XNMT is available open-source at https://github.com/neulab/xnmtComment: To be presented at AMTA 2018 Open Source Software Showcas

arXiv.org e-Print Archive

Monash University Research Portal

Porting concepts from DNNs back to GMMs

Author: Demuynck Kris
Triefenbach Fabian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

Deep neural networks (DNNs) have been shown to outperform Gaussian Mixture Models (GMM) on a variety of speech recognition benchmarks. In this paper we analyze the differences between the DNN and GMM modeling techniques and port the best ideas from the DNN-based modeling to a GMM-based system. By going both deep (multiple layers) and wide (multiple parallel sub-models) and by sharing model parameters, we are able to close the gap between the two modeling techniques on the TIMIT database. Since the 'deep' GMMs retain the maximum-likelihood trained Gaussians as first layer, advanced techniques such as speaker adaptation and model-based noise robustness can be readily incorporated. Regardless of their similarities, the DNNs and the deep GMMs still show a sufficient amount of complementarity to allow effective system combination

Crossref

Ghent University Academic Bibliography

Understanding of Object Manipulation Actions Using Human Multi-Modal Sensory Data

Author: Abbasi Bahareh
Noohi Ehsan
Parastegari Sina
Zefran Milos
Publication venue
Publication date: 01/01/2019
Field of study

Object manipulation actions represent an important share of the Activities of Daily Living (ADLs). In this work, we study how to enable service robots to use human multi-modal data to understand object manipulation actions, and how they can recognize such actions when humans perform them during human-robot collaboration tasks. The multi-modal data in this study consists of videos, hand motion data, applied forces as represented by the pressure patterns on the hand, and measurements of the bending of the fingers, collected as human subjects performed manipulation actions. We investigate two different approaches. In the first one, we show that multi-modal signal (motion, finger bending and hand pressure) generated by the action can be decomposed into a set of primitives that can be seen as its building blocks. These primitives are used to define 24 multi-modal primitive features. The primitive features can in turn be used as an abstract representation of the multi-modal signal and employed for action recognition. In the latter approach, the visual features are extracted from the data using a pre-trained image classification deep convolutional neural network. The visual features are subsequently used to train the classifier. We also investigate whether adding data from other modalities produces a statistically significant improvement in the classifier performance. We show that both approaches produce a comparable performance. This implies that image-based methods can successfully recognize human actions during human-robot collaboration. On the other hand, in order to provide training data for the robot so it can learn how to perform object manipulation actions, multi-modal data provides a better alternative

arXiv.org e-Print Archive

University of Illinois at Chicago: UIC INDIGO (INtellectual property in DIGital form available online in an Open environment)

Predicting Parameters in Deep Learning

Author: de Freitas Nando
Denil Misha
Dinh Laurent
Ranzato Marc'Aurelio
Shakibi Babak
Publication venue
Publication date: 01/01/2013
Field of study

We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy

arXiv.org e-Print Archive

CiteSeerX

Oxford University Research Archive

Language Modeling with Deep Transformers

Author: Irie Kazuki
Ney Hermann
Schlüter Ralf
Zeyer Albert
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2019
Field of study

We explore deep autoregressive Transformer models in language modeling for speech recognition. We focus on two aspects. First, we revisit Transformer model configurations specifically for language modeling. We show that well configured Transformer models outperform our baseline models based on the shallow stack of LSTM recurrent neural network layers. We carry out experiments on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level and 10K byte-pair encoding subword-level language modeling. We apply our word-level models to conventional hybrid speech recognition by lattice rescoring, and the subword-level models to attention based encoder-decoder models by shallow fusion. Second, we show that deep Transformer language models do not require positional encoding. The positional encoding is an essential augmentation for the self-attention mechanism which is invariant to sequence ordering. However, in autoregressive setup, as is the case for language modeling, the amount of information increases along the position dimension, which is a positional signal by its own. The analysis of attention weights shows that deep autoregressive self-attention models can automatically make use of such positional information. We find that removing the positional encoding even slightly improves the performance of these models.Comment: To appear in the proceedings of INTERSPEECH 201

arXiv.org e-Print Archive

Crossref

Publikationsserver der RWTH Aachen University

On-Line Bayesian Speaker Adaptation By Using Tree-Structured Transformation and Robust Priors

Author: Wang Shaojun
Zhao Yunxin
Publication venue: CORE Scholar
Publication date: 01/06/2000
Field of study

This paper presents new results by using our previously proposed on-line Bayesian learning approach for affine transformation parameter estimation in speaker adaptation. The on-line Bayesian learning technique allows updating parameter estimates after each utterance and it can accommodate flexible forms of transformation functions as well as prior probability density functions. We show through experimental results the robustness of heavy tailed priors to mismatch in prior density estimation. We also show that by properly choosing the transformation matrices and depths of hierarchical trees, recognition performance improved significantly

CORE