34,869 research outputs found
Tensorized Self-Attention: Efficiently Modeling Pairwise and Global Dependencies Together
Neural networks equipped with self-attention have parallelizable computation,
light-weight structure, and the ability to capture both long-range and local
dependencies. Further, their expressive power and performance can be boosted by
using a vector to measure pairwise dependency, but this requires to expand the
alignment matrix to a tensor, which results in memory and computation
bottlenecks. In this paper, we propose a novel attention mechanism called
"Multi-mask Tensorized Self-Attention" (MTSA), which is as fast and as
memory-efficient as a CNN, but significantly outperforms previous
CNN-/RNN-/attention-based models. MTSA 1) captures both pairwise (token2token)
and global (source2token) dependencies by a novel compatibility function
composed of dot-product and additive attentions, 2) uses a tensor to represent
the feature-wise alignment scores for better expressive power but only requires
parallelizable matrix multiplications, and 3) combines multi-head with
multi-dimensional attentions, and applies a distinct positional mask to each
head (subspace), so the memory and computation can be distributed to multiple
heads, each with sequential information encoded independently. The experiments
show that a CNN/RNN-free model based on MTSA achieves state-of-the-art or
competitive performance on nine NLP benchmarks with compelling memory- and
time-efficiency
Compressed Video Action Recognition
Training robust deep video representations has proven to be much more
challenging than learning deep image representations. This is in part due to
the enormous size of raw video streams and the high temporal redundancy; the
true and interesting signal is often drowned in too much irrelevant data.
Motivated by that the superfluous information can be reduced by up to two
orders of magnitude by video compression (using H.264, HEVC, etc.), we propose
to train a deep network directly on the compressed video.
This representation has a higher information density, and we found the
training to be easier. In addition, the signals in a compressed video provide
free, albeit noisy, motion information. We propose novel techniques to use them
effectively. Our approach is about 4.6 times faster than Res3D and 2.7 times
faster than ResNet-152. On the task of action recognition, our approach
outperforms all the other methods on the UCF-101, HMDB-51, and Charades
dataset.Comment: CVPR 2018 (Selected for spotlight presentation
Multimodal Polynomial Fusion for Detecting Driver Distraction
Distracted driving is deadly, claiming 3,477 lives in the U.S. in 2015 alone.
Although there has been a considerable amount of research on modeling the
distracted behavior of drivers under various conditions, accurate automatic
detection using multiple modalities and especially the contribution of using
the speech modality to improve accuracy has received little attention. This
paper introduces a new multimodal dataset for distracted driving behavior and
discusses automatic distraction detection using features from three modalities:
facial expression, speech and car signals. Detailed multimodal feature analysis
shows that adding more modalities monotonically increases the predictive
accuracy of the model. Finally, a simple and effective multimodal fusion
technique using a polynomial fusion layer shows superior distraction detection
results compared to the baseline SVM and neural network models.Comment: INTERSPEECH 201
- …