10,685 research outputs found
Recurrent DNNs and its Ensembles on the TIMIT Phone Recognition Task
In this paper, we have investigated recurrent deep neural networks (DNNs) in
combination with regularization techniques as dropout, zoneout, and
regularization post-layer. As a benchmark, we chose the TIMIT phone recognition
task due to its popularity and broad availability in the community. It also
simulates a low-resource scenario that is helpful in minor languages. Also, we
prefer the phone recognition task because it is much more sensitive to an
acoustic model quality than a large vocabulary continuous speech recognition
task. In recent years, recurrent DNNs pushed the error rates in automatic
speech recognition down. But, there was no clear winner in proposed
architectures. The dropout was used as the regularization technique in most
cases, but combination with other regularization techniques together with model
ensembles was omitted. However, just an ensemble of recurrent DNNs performed
best and achieved an average phone error rate from 10 experiments 14.84 %
(minimum 14.69 %) on core test set that is slightly lower then the
best-published PER to date, according to our knowledge. Finally, in contrast of
the most papers, we published the open-source scripts to easily replicate the
results and to help continue the development.Comment: Submitted to SPECOM 2018, 20th International Conference on Speech and
Compute
Twin Networks: Matching the Future for Sequence Generation
We propose a simple technique for encouraging generative RNNs to plan ahead.
We train a "backward" recurrent network to generate a given sequence in reverse
order, and we encourage states of the forward model to predict cotemporal
states of the backward model. The backward network is used only during
training, and plays no role during sampling or inference. We hypothesize that
our approach eases modeling of long-term dependencies by implicitly forcing the
forward states to hold information about the longer-term future (as contained
in the backward states). We show empirically that our approach achieves 9%
relative improvement for a speech recognition task, and achieves significant
improvement on a COCO caption generation task.Comment: 12 pages, 3 figures, published at ICLR 201
- …