12,296 research outputs found
DDLSTM: Dual-Domain LSTM for Cross-Dataset Action Recognition
Domain alignment in convolutional networks aims to learn the degree of
layer-specific feature alignment beneficial to the joint learning of source and
target datasets. While increasingly popular in convolutional networks, there
have been no previous attempts to achieve domain alignment in recurrent
networks. Similar to spatial features, both source and target domains are
likely to exhibit temporal dependencies that can be jointly learnt and aligned.
In this paper we introduce Dual-Domain LSTM (DDLSTM), an architecture that is
able to learn temporal dependencies from two domains concurrently. It performs
cross-contaminated batch normalisation on both input-to-hidden and
hidden-to-hidden weights, and learns the parameters for cross-contamination,
for both single-layer and multi-layer LSTM architectures. We evaluate DDLSTM on
frame-level action recognition using three datasets, taking a pair at a time,
and report an average increase in accuracy of 3.5%. The proposed DDLSTM
architecture outperforms standard, fine-tuned, and batch-normalised LSTMs.Comment: To appear in CVPR 201
A Study of All-Convolutional Encoders for Connectionist Temporal Classification
Connectionist temporal classification (CTC) is a popular sequence prediction
approach for automatic speech recognition that is typically used with models
based on recurrent neural networks (RNNs). We explore whether deep
convolutional neural networks (CNNs) can be used effectively instead of RNNs as
the "encoder" in CTC. CNNs lack an explicit representation of the entire
sequence, but have the advantage that they are much faster to train. We present
an exploration of CNNs as encoders for CTC models, in the context of
character-based (lexicon-free) automatic speech recognition. In particular, we
explore a range of one-dimensional convolutional layers, which are particularly
efficient. We compare the performance of our CNN-based models against typical
RNNbased models in terms of training time, decoding time, model size and word
error rate (WER) on the Switchboard Eval2000 corpus. We find that our CNN-based
models are close in performance to LSTMs, while not matching them, and are much
faster to train and decode.Comment: Accepted to ICASSP-201
- …