264 research outputs found
Accelerating recurrent neural network training using sequence bucketing and multi-GPU data parallelization
An efficient algorithm for recurrent neural network training is presented.
The approach increases the training speed for tasks where a length of the input
sequence may vary significantly. The proposed approach is based on the optimal
batch bucketing by input sequence length and data parallelization on multiple
graphical processing units. The baseline training performance without sequence
bucketing is compared with the proposed solution for a different number of
buckets. An example is given for the online handwriting recognition task using
an LSTM recurrent neural network. The evaluation is performed in terms of the
wall clock time, number of epochs, and validation loss value.Comment: 4 pages, 5 figures, Comments, 2016 IEEE First International
Conference on Data Stream Mining & Processing (DSMP), Lviv, 201
Deep neural networks for video classification in ecology
Analyzing large volumes of video data is a challenging and time-consuming task. Automating this process would very valuable, especially in ecological research where massive amounts of video can be used to unlock new avenues of ecological research into the behaviour of animals in their environments. Deep Neural Networks, particularly Deep Convolutional Neural Networks, are a powerful class of models for computer vision. When combined with Recurrent Neural Networks, Deep Convolutional models can be applied to video for frame level video classification. This research studies two datasets: penguins and seals. The purpose of the research is to compare the performance of image-only CNNs, which treat each frame of a video independently, against a combined CNN-RNN approach; and to assess whether incorporating the motion information in the temporal aspect of video improves the accuracy of classifications in these two datasets. Video and image-only models offer similar out-of-sample performance on the simpler seals dataset but the video model led to moderate performance improvements on the more complex penguin action recognition dataset
Recommended from our members
CUED-RNNLM - An open-source toolkit for efficient training and evaluation of recurrent neural network language models
An Empirical Study on Bidirectional Recurrent Neural Networks for Human Motion Recognition
The deep recurrent neural networks (RNNs) and their associated gated neurons, such as Long Short-Term Memory (LSTM) have demonstrated a continued and growing success rates with researches in various sequential data processing applications, especially when applied to speech recognition and language modeling. Despite this, amongst current researches, there are limited studies on the deep RNNs architectures and their effects being applied to other application domains. In this paper, we evaluated the different strategies available to construct bidirectional recurrent neural networks (BRNNs) applying Gated Recurrent Units (GRUs), as well as investigating a reservoir computing RNNs, i.e., Echo state networks (ESN) and a few other conventional machine learning techniques for skeleton-based human motion recognition. The evaluation of tasks focuses on the generalization of different approaches by employing arbitrary untrained viewpoints, combined together with previously unseen subjects. Moreover, we extended the test by lowering the subsampling frame rates to examine the robustness of the algorithms being employed against the varying of movement speed
Effective attention-based sequence-to-sequence modelling for automatic speech recognition
With sufficient training data, attentional encoder-decoder models have given outstanding ASR results. In such models, the encoder encodes the input sequence into a sequence of hidden representations. The attention mechanism generates a soft alignment
between the encoder hidden states and the decoder hidden states. The decoder produces the current output by considering the alignment and the previous outputs.
However, attentional encoder-decoder models are originally designed for machine
translation tasks, where the input and output sequences are relatively short and the
alignments between them are flexible. For ASR tasks, the input sequences are notably
long. Further, acoustic frames (or their hidden representations) typically can be aligned
with output units in a left-to-right order, and compared to the length of the entire utterance, the duration of each output unit is usually small. Conventional encoder-decoder
models have difficulties in modelling long sequences, and the attention mechanism
does not guarantee the monotonic left-to-right alignments.
In this thesis, we study attention-based sequence-to-sequence ASR models and
address the aforementioned issues. We investigate recurrent neural network (RNN)
encoder-decoder models and self-attention encoder-decoder models. For RNN encoder-decoder models, we develop a dynamic subsampling RNN (dsRNN) encoder to shorten
the lengths of the input sequences. The dsRNN learns to skip redundant frames. Furthermore, the skip ratio may vary at different stages of training, thus allowing the
encoder to learn the most relevant information for each epoch. Thus, the dsRNN alleviates the difficulties of encoding long sequences. We also propose a fully trainable
windowed attention mechanism, in which both the window shift and window length
are learned by the model. Our windowed method forces the attention mechanism to
attend inputs within small sliding windows in a strict left-to-right order. The proposed
dsRNN and windowed attention give significant performance gains over traditional
encoder-decoder ASR models.
We next study self-attention encoder-decoder models. For RNN encoder-decoder
models, we have shown that restricting the attention within small windows is beneficial. However, self-attention encodes input sequences by comparing each element
of the sequence with all other elements of the sequence. Therefore, we investigate if
the global view of self-attention is necessary for ASR. We note that the range of the
learned context increases from the lower to the upper self-attention layers, and suggest
that the upper encoder layers may have seen sufficient contextual information without
the need for self-attention. This would imply that the upper self-attention layers can
be replaced with feed-forward layers (we can view the feed-forward layers as strict
local left-to-right self-attention). In practice, we observe replacing upper encoder self-attention layers with feed forward layers does not impact the performance. We also
observe that there are individual attention heads that only attend local information, and
thus the self-attention mechanism is redundant for these attention heads. Based on
these observations, we propose randomly removing attention heads during training but
keep all heads at testing. The proposed method achieves state-of-the-art ASR results
on benchmark datasets of different ASR scenarios.
Finally, we investigate top-down level-wise training of sequence-to-sequence ASR
models. We find that when training sequence-to-sequence ASR models on noisy data,
the use of upper layers trained on clean data forces the lower layers to learn noise-invariant features, since the features which fit the clean-trained upper layers are more
general. We further show that within the same dataset, conventional joint training
makes the upper layers quickly overfit. Therefore, we propose to freeze the upper
layers and retrain the lower layers. The proposed method is a general training strategy;
we use it not only to train ASR models but also to train other neural networks in other
domains. The proposed training method yields consistent performance gains across
different tasks (e.g., language modelling, image classification).
In summary, we propose methods which enable attention-based sequence-to-sequence
ASR systems to better model sequential data, and demonstrate the benefits of training
neural networks in a top-down cascade manner
Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation
We present a novel family of language model (LM) estimation techniques named
Sparse Non-negative Matrix (SNM) estimation. A first set of experiments
empirically evaluating it on the One Billion Word Benchmark shows that SNM
-gram LMs perform almost as well as the well-established Kneser-Ney (KN)
models. When using skip-gram features the models are able to match the
state-of-the-art recurrent neural network (RNN) LMs; combining the two modeling
techniques yields the best known result on the benchmark. The computational
advantages of SNM over both maximum entropy and RNN LM estimation are probably
its main strength, promising an approach that has the same flexibility in
combining arbitrary features effectively and yet should scale to very large
amounts of data as gracefully as -gram LMs do
Image Captioning with Recurrent Neural Networks
Tato práce se zabývá automatickým generovaním popisů obrázků s využitím několika druhů neuronových sítí. Práce je založena na článcích z MS COCO Captioning Challenge 2015 a znakových jazykových modelech, popularizovaných A. Karpathym. Navržený model je kombinací konvoluční a rekurentní neuronové sítě s architekturou kodér--dekodér. Vektor reprezentující zakódovaný obrázek je předáván jazykovému modelu jako hodnoty paměti LSTM vrstev v síti. Práce zkoumá, na jaké úrovni je model s takto jednoduchou architekturou schopen popisovat obrázky a jak si stojí v porovnání s ostatními současnými modely. Jedním ze závěrů práce je, že navržená architektura není dostatečná pro jakýkoli popis obrázků.In this work I deal with automatic generation of image captions by using multiple types of neural networks. Thesis is based on the papers from MS COCO Captioning Challenge 2015 and character language models, popularized by A. Karpathy. Proposed model is combination of convolutional and recurrent neural network with encoder--decoder architecture. Vector representing encoded image is passed to language model as memory values of LSTM layers in the network. This work investigate, whether model with such simple architecture is able to generate captions and how good it is in comparison to other contemporary solutions. One of the results is that the proposed architecture is not sufficient for any image captioning task.
- …