19,167 research outputs found
Temporal Attention-Gated Model for Robust Sequence Classification
Typical techniques for sequence classification are designed for
well-segmented sequences which have been edited to remove noisy or irrelevant
parts. Therefore, such methods cannot be easily applied on noisy sequences
expected in real-world applications. In this paper, we present the Temporal
Attention-Gated Model (TAGM) which integrates ideas from attention models and
gated recurrent networks to better deal with noisy or unsegmented sequences.
Specifically, we extend the concept of attention model to measure the relevance
of each observation (time step) of a sequence. We then use a novel gated
recurrent network to learn the hidden representation for the final prediction.
An important advantage of our approach is interpretability since the temporal
attention weights provide a meaningful value for the salience of each time step
in the sequence. We demonstrate the merits of our TAGM approach, both for
prediction accuracy and interpretability, on three different tasks: spoken
digit recognition, text-based sentiment analysis and visual event recognition.Comment: Accepted by CVPR 201
Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition
The success of self-attention in NLP has led to recent applications in
end-to-end encoder-decoder architectures for speech recognition. Separately,
connectionist temporal classification (CTC) has matured as an alignment-free,
non-autoregressive approach to sequence transduction, either by itself or in
various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully
self-attentional network for CTC, and show it is tractable and competitive for
end-to-end speech recognition. SAN-CTC trains quickly and outperforms existing
CTC models and most encoder-decoder models, with character error rates (CERs)
of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean,
with a fixed architecture and one GPU. Similar improvements hold for WERs after
LM decoding. We motivate the architecture for speech, evaluate position and
downsampling approaches, and explore how label alphabets (character, phoneme,
subword) affect attention heads and performance.Comment: Accepted to ICASSP 201
Attention and Localization based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging
Audio tagging aims to perform multi-label classification on audio chunks and
it is a newly proposed task in the Detection and Classification of Acoustic
Scenes and Events 2016 (DCASE 2016) challenge. This task encourages research
efforts to better analyze and understand the content of the huge amounts of
audio data on the web. The difficulty in audio tagging is that it only has a
chunk-level label without a frame-level label. This paper presents a weakly
supervised method to not only predict the tags but also indicate the temporal
locations of the occurred acoustic events. The attention scheme is found to be
effective in identifying the important frames while ignoring the unrelated
frames. The proposed framework is a deep convolutional recurrent model with two
auxiliary modules: an attention module and a localization module. The proposed
algorithm was evaluated on the Task 4 of DCASE 2016 challenge. State-of-the-art
performance was achieved on the evaluation set with equal error rate (EER)
reduced from 0.13 to 0.11, compared with the convolutional recurrent baseline
system.Comment: 5 pages, submitted to interspeech201
Interacting Attention-gated Recurrent Networks for Recommendation
Capturing the temporal dynamics of user preferences over items is important
for recommendation. Existing methods mainly assume that all time steps in
user-item interaction history are equally relevant to recommendation, which
however does not apply in real-world scenarios where user-item interactions can
often happen accidentally. More importantly, they learn user and item dynamics
separately, thus failing to capture their joint effects on user-item
interactions. To better model user and item dynamics, we present the
Interacting Attention-gated Recurrent Network (IARN) which adopts the attention
model to measure the relevance of each time step. In particular, we propose a
novel attention scheme to learn the attention scores of user and item history
in an interacting way, thus to account for the dependencies between user and
item dynamics in shaping user-item interactions. By doing so, IARN can
selectively memorize different time steps of a user's history when predicting
her preferences over different items. Our model can therefore provide
meaningful interpretations for recommendation results, which could be further
enhanced by auxiliary features. Extensive validation on real-world datasets
shows that IARN consistently outperforms state-of-the-art methods.Comment: Accepted by ACM International Conference on Information and Knowledge
Management (CIKM), 201
Light Gated Recurrent Units for Speech Recognition
A field that has directly benefited from the recent advances in deep learning
is Automatic Speech Recognition (ASR). Despite the great achievements of the
past decades, however, a natural and robust human-machine speech interaction
still appears to be out of reach, especially in challenging environments
characterized by significant noise and reverberation. To improve robustness,
modern speech recognizers often employ acoustic models based on Recurrent
Neural Networks (RNNs), that are naturally able to exploit large time contexts
and long-term speech modulations. It is thus of great interest to continue the
study of proper techniques for improving the effectiveness of RNNs in
processing speech signals.
In this paper, we revise one of the most popular RNN models, namely Gated
Recurrent Units (GRUs), and propose a simplified architecture that turned out
to be very effective for ASR. The contribution of this work is two-fold: First,
we analyze the role played by the reset gate, showing that a significant
redundancy with the update gate occurs. As a result, we propose to remove the
former from the GRU design, leading to a more efficient and compact single-gate
model. Second, we propose to replace hyperbolic tangent with ReLU activations.
This variation couples well with batch normalization and could help the model
learn long-term dependencies without numerical issues.
Results show that the proposed architecture, called Light GRU (Li-GRU), not
only reduces the per-epoch training time by more than 30% over a standard GRU,
but also consistently improves the recognition accuracy across different tasks,
input features, noisy conditions, as well as across different ASR paradigms,
ranging from standard DNN-HMM speech recognizers to end-to-end CTC models.Comment: Copyright 2018 IEE
- …