69 research outputs found
Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7
Scene-aware dialog systems will be able to have conversations with users
about the objects and events around them. Progress on such systems can be made
by integrating state-of-the-art technologies from multiple research areas
including end-to-end dialog systems visual dialog, and video description. We
introduce the Audio Visual Scene Aware Dialog (AVSD) challenge and dataset. In
this challenge, which is one track of the 7th Dialog System Technology
Challenges (DSTC7) workshop1, the task is to build a system that generates
responses in a dialog about an input vide
Stream attention-based multi-array end-to-end speech recognition
Automatic Speech Recognition (ASR) using multiple microphone arrays has
achieved great success in the far-field robustness. Taking advantage of all the
information that each array shares and contributes is crucial in this task.
Motivated by the advances of joint Connectionist Temporal Classification
(CTC)/attention mechanism in the End-to-End (E2E) ASR, a stream attention-based
multi-array framework is proposed in this work. Microphone arrays, acting as
information streams, are activated by separate encoders and decoded under the
instruction of both CTC and attention networks. In terms of attention, a
hierarchical structure is adopted. On top of the regular attention networks,
stream attention is introduced to steer the decoder toward the most informative
encoders. Experiments have been conducted on AMI and DIRHA multi-array corpora
using the encoder-decoder architecture. Compared with the best single-array
results, the proposed framework has achieved relative Word Error Rates (WERs)
reduction of 3.7% and 9.7% in the two datasets, respectively, which is better
than conventional strategies as well.Comment: Submitted to ICASSP 201
Multi-encoder multi-resolution framework for end-to-end speech recognition
Attention-based methods and Connectionist Temporal Classification (CTC)
network have been promising research directions for end-to-end Automatic Speech
Recognition (ASR). The joint CTC/Attention model has achieved great success by
utilizing both architectures during multi-task training and joint decoding. In
this work, we present a novel Multi-Encoder Multi-Resolution (MEMR) framework
based on the joint CTC/Attention model. Two heterogeneous encoders with
different architectures, temporal resolutions and separate CTC networks work in
parallel to extract complimentary acoustic information. A hierarchical
attention mechanism is then used to combine the encoder-level information. To
demonstrate the effectiveness of the proposed model, experiments are conducted
on Wall Street Journal (WSJ) and CHiME-4, resulting in relative Word Error Rate
(WER) reduction of 18.0-32.1%. Moreover, the proposed MEMR model achieves 3.6%
WER in the WSJ eval92 test set, which is the best WER reported for an
end-to-end system on this benchmark
End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features
Dialog systems need to understand dynamic visual scenes in order to have
conversations with users about the objects and events around them. Scene-aware
dialog systems for real-world applications could be developed by integrating
state-of-the-art technologies from multiple research areas, including:
end-to-end dialog technologies, which generate system responses using models
trained from dialog data; visual question answering (VQA) technologies, which
answer questions about images using learned image features; and video
description technologies, in which descriptions/captions are generated from
videos using multimodal information. We introduce a new dataset of dialogs
about videos of human behaviors. Each dialog is a typed conversation that
consists of a sequence of 10 question-and-answer(QA) pairs between two Amazon
Mechanical Turk (AMT) workers. In total, we collected dialogs on roughly 9,000
videos. Using this new dataset for Audio Visual Scene-aware dialog (AVSD), we
trained an end-to-end conversation model that generates responses in a dialog
about a video. Our experiments demonstrate that using multimodal features that
were developed for multimodal attention-based video description enhances the
quality of generated dialog about dynamic scenes (videos). Our dataset, model
code and pretrained models will be publicly available for a new Video
Scene-Aware Dialog challenge.Comment: A prototype system for the Audio Visual Scene-aware Dialog (AVSD) at
DSTC
Sensor Transformation Attention Networks
Recent work on encoder-decoder models for sequence-to-sequence mapping has
shown that integrating both temporal and spatial attention mechanisms into
neural networks increases the performance of the system substantially. In this
work, we report on the application of an attentional signal not on temporal and
spatial regions of the input, but instead as a method of switching among inputs
themselves. We evaluate the particular role of attentional switching in the
presence of dynamic noise in the sensors, and demonstrate how the attentional
signal responds dynamically to changing noise levels in the environment to
achieve increased performance on both audio and visual tasks in three
commonly-used datasets: TIDIGITS, Wall Street Journal, and GRID. Moreover, the
proposed sensor transformation network architecture naturally introduces a
number of advantages that merit exploration, including ease of adding new
sensors to existing architectures, attentional interpretability, and increased
robustness in a variety of noisy environments not seen during training.
Finally, we demonstrate that the sensor selection attention mechanism of a
model trained only on the small TIDIGITS dataset can be transferred directly to
a pre-existing larger network trained on the Wall Street Journal dataset,
maintaining functionality of switching between sensors to yield a dramatic
reduction of error in the presence of noise.Comment: 8 pages, 5 figures, 3 table
Multimodal Semantic Attention Network for Video Captioning
Inspired by the fact that different modalities in videos carry complementary
information, we propose a Multimodal Semantic Attention Network(MSAN), which is
a new encoder-decoder framework incorporating multimodal semantic attributes
for video captioning. In the encoding phase, we detect and generate multimodal
semantic attributes by formulating it as a multi-label classification problem.
Moreover, we add auxiliary classification loss to our model that can obtain
more effective visual features and high-level multimodal semantic attribute
distributions for sufficient video encoding. In the decoding phase, we extend
each weight matrix of the conventional LSTM to an ensemble of
attribute-dependent weight matrices, and employ attention mechanism to pay
attention to different attributes at each time of the captioning process. We
evaluate algorithm on two popular public benchmarks: MSVD and MSR-VTT,
achieving competitive results with current state-of-the-art across six
evaluation metrics.Comment: 6 pages, 4 figures, accepted by IEEE International Conference on
Multimedia and Expo (ICME) 201
Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog
We are witnessing a confluence of vision, speech and dialog system
technologies that are enabling the IVAs to learn audio-visual groundings of
utterances and have conversations with users about the objects, activities and
events surrounding them. Recent progress in visual grounding techniques and
Audio Understanding are enabling machines to understand shared semantic
concepts and listen to the various sensory events in the environment. With
audio and visual grounding methods, end-to-end multimodal SDS are trained to
meaningfully communicate with us in natural language about the real dynamic
audio-visual sensory world around us. In this work, we explore the role of
`topics' as the context of the conversation along with multimodal attention
into such an end-to-end audio-visual scene-aware dialog system architecture. We
also incorporate an end-to-end audio classification ConvNet, AclNet, into our
models. We develop and test our approaches on the Audio Visual Scene-Aware
Dialog (AVSD) dataset released as a part of the DSTC7. We present the analysis
of our experiments and show that some of our model variations outperform the
baseline system released for AVSD.Comment: Presented at the Visual Question Answering and Dialog Workshop, CVPR
2019, Long Beach, USA. arXiv admin note: substantial text overlap with
arXiv:1912.1013
Progressive Attention Memory Network for Movie Story Question Answering
This paper proposes the progressive attention memory network (PAMN) for movie
story question answering (QA). Movie story QA is challenging compared to VQA in
two aspects: (1) pinpointing the temporal parts relevant to answer the question
is difficult as the movies are typically longer than an hour, (2) it has both
video and subtitle where different questions require different modality to
infer the answer. To overcome these challenges, PAMN involves three main
features: (1) progressive attention mechanism that utilizes cues from both
question and answer to progressively prune out irrelevant temporal parts in
memory, (2) dynamic modality fusion that adaptively determines the contribution
of each modality for answering the current question, and (3) belief correction
answering scheme that successively corrects the prediction score on each
candidate answer. Experiments on publicly available benchmark datasets, MovieQA
and TVQA, demonstrate that each feature contributes to our movie story QA
architecture, PAMN, and improves performance to achieve the state-of-the-art
result. Qualitative analysis by visualizing the inference mechanism of PAMN is
also provided.Comment: CVPR 2019, Accepte
End-to-End Video Captioning with Multitask Reinforcement Learning
Although end-to-end (E2E) learning has led to impressive progress on a
variety of visual understanding tasks, it is often impeded by hardware
constraints (e.g., GPU memory) and is prone to overfitting. When it comes to
video captioning, one of the most challenging benchmark tasks in computer
vision, those limitations of E2E learning are especially amplified by the fact
that both the input videos and output captions are lengthy sequences. Indeed,
state-of-the-art methods for video captioning process video frames by
convolutional neural networks and generate captions by unrolling recurrent
neural networks. If we connect them in an E2E manner, the resulting model is
both memory-consuming and data-hungry, making it extremely hard to train. In
this paper, we propose a multitask reinforcement learning approach to training
an E2E video captioning model. The main idea is to mine and construct as many
effective tasks (e.g., attributes, rewards, and the captions) as possible from
the human captioned videos such that they can jointly regulate the search space
of the E2E neural network, from which an E2E video captioning model can be
found and generalized to the testing phase. To the best of our knowledge, this
is the first video captioning model that is trained end-to-end from the raw
video input to the caption output. Experimental results show that such a model
outperforms existing ones to a large margin on two benchmark video captioning
datasets
Equilibrated Recurrent Neural Network: Neuronal Time-Delayed Self-Feedback Improves Accuracy and Stability
We propose a novel {\it Equilibrated Recurrent Neural Network} (ERNN) to
combat the issues of inaccuracy and instability in conventional RNNs. Drawing
upon the concept of autapse in neuroscience, we propose augmenting an RNN with
a time-delayed self-feedback loop. Our sole purpose is to modify the dynamics
of each internal RNN state and, at any time, enforce it to evolve close to the
equilibrium point associated with the input signal at that time. We show that
such self-feedback helps stabilize the hidden state transitions leading to fast
convergence during training while efficiently learning discriminative latent
features that result in state-of-the-art results on several benchmark datasets
at test-time. We propose a novel inexact Newton method to solve fixed-point
conditions given model parameters for generating the latent features at each
hidden state. We prove that our inexact Newton method converges locally with
linear rate (under mild conditions). We leverage this result for efficient
training of ERNNs based on backpropagation
- …