12,232 research outputs found
Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition
We present a new computational model for gaze prediction in egocentric videos
by exploring patterns in temporal shift of gaze fixations (attention
transition) that are dependent on egocentric manipulation tasks. Our assumption
is that the high-level context of how a task is completed in a certain way has
a strong influence on attention transition and should be modeled for gaze
prediction in natural dynamic scenes. Specifically, we propose a hybrid model
based on deep neural networks which integrates task-dependent attention
transition with bottom-up saliency prediction. In particular, the
task-dependent attention transition is learned with a recurrent neural network
to exploit the temporal context of gaze fixations, e.g. looking at a cup after
moving gaze away from a grasped bottle. Experiments on public egocentric
activity datasets show that our model significantly outperforms
state-of-the-art gaze prediction methods and is able to learn meaningful
transition of human attention.Comment: Accepted as oral presentation in ECCV 201
Deep Learning applied to NLP
Convolutional Neural Network (CNNs) are typically associated with Computer
Vision. CNNs are responsible for major breakthroughs in Image Classification
and are the core of most Computer Vision systems today. More recently CNNs have
been applied to problems in Natural Language Processing and gotten some
interesting results. In this paper, we will try to explain the basics of CNNs,
its different variations and how they have been applied to NLP
A deep-learning-based surrogate model for data assimilation in dynamic subsurface flow problems
A deep-learning-based surrogate model is developed and applied for predicting
dynamic subsurface flow in channelized geological models. The surrogate model
is based on deep convolutional and recurrent neural network architectures,
specifically a residual U-Net and a convolutional long short term memory
recurrent network. Training samples entail global pressure and saturation maps,
at a series of time steps, generated by simulating oil-water flow in many (1500
in our case) realizations of a 2D channelized system. After training, the
`recurrent R-U-Net' surrogate model is shown to be capable of accurately
predicting dynamic pressure and saturation maps and well rates (e.g.,
time-varying oil and water rates at production wells) for new geological
realizations. Assessments demonstrating high surrogate-model accuracy are
presented for an individual geological realization and for an ensemble of 500
test geomodels. The surrogate model is then used for the challenging problem of
data assimilation (history matching) in a channelized system. For this study,
posterior reservoir models are generated using the randomized maximum
likelihood method, with the permeability field represented using the recently
developed CNN-PCA parameterization. The flow responses required during the data
assimilation procedure are provided by the recurrent R-U-Net. The overall
approach is shown to lead to substantial reduction in prediction uncertainty.
High-fidelity numerical simulation results for the posterior geomodels
(generated by the surrogate-based data assimilation procedure) are shown to be
in essential agreement with the recurrent R-U-Net predictions. The accuracy and
dramatic speedup provided by the surrogate model suggest that it may eventually
enable the application of more formal posterior sampling methods in realistic
problems
Information Aggregation via Dynamic Routing for Sequence Encoding
While much progress has been made in how to encode a text sequence into a
sequence of vectors, less attention has been paid to how to aggregate these
preceding vectors (outputs of RNN/CNN) into fixed-size encoding vector.
Usually, a simple max or average pooling is used, which is a bottom-up and
passive way of aggregation and lack of guidance by task information. In this
paper, we propose an aggregation mechanism to obtain a fixed-size encoding with
a dynamic routing policy. The dynamic routing policy is dynamically deciding
that what and how much information need be transferred from each word to the
final encoding of the text sequence. Following the work of Capsule Network, we
design two dynamic routing policies to aggregate the outputs of RNN/CNN
encoding layer into a final encoding vector. Compared to the other aggregation
methods, dynamic routing can refine the messages according to the state of
final encoding vector. Experimental results on five text classification tasks
show that our method outperforms other aggregating models by a significant
margin. Related source code is released on our github page
Deep Co-attention based Comparators For Relative Representation Learning in Person Re-identification
Person re-identification (re-ID) requires rapid, flexible yet discriminant
representations to quickly generalize to unseen observations on-the-fly and
recognize the same identity across disjoint camera views. Recent effective
methods are developed in a pair-wise similarity learning system to detect a
fixed set of features from distinct regions which are mapped to their vector
embeddings for the distance measuring. However, the most relevant and crucial
parts of each image are detected independently without referring to the
dependency conditioned on one and another. Also, these region based methods
rely on spatial manipulation to position the local features in comparable
similarity measuring. To combat these limitations, in this paper we introduce
the Deep Co-attention based Comparators (DCCs) that fuse the co-dependent
representations of the paired images so as to focus on the relevant parts of
both images and produce their \textit{relative representations}. Given a pair
of pedestrian images to be compared, the proposed model mimics the foveation of
human eyes to detect distinct regions concurrent on both images, namely
co-dependent features, and alternatively attend to relevant regions to fuse
them into the similarity learning. Our comparator is capable of producing
dynamic representations relative to a particular sample every time, and thus
well-suited to the case of re-identifying pedestrians on-the-fly. We perform
extensive experiments to provide the insights and demonstrate the effectiveness
of the proposed DCCs in person re-ID. Moreover, our approach has achieved the
state-of-the-art performance on three benchmark data sets: DukeMTMC-reID
\cite{DukeMTMC}, CUHK03 \cite{FPNN}, and Market-1501 \cite{Market1501}
OrthoSeg: A Deep Multimodal Convolutional Neural Network for Semantic Segmentation of Orthoimagery
This paper addresses the task of semantic segmentation of orthoimagery using
multimodal data e.g. optical RGB, infrared and digital surface model. We
propose a deep convolutional neural network architecture termed OrthoSeg for
semantic segmentation using multimodal, orthorectified and coregistered data.
We also propose a training procedure for supervised training of OrthoSeg. The
training procedure complements the inherent architectural characteristics of
OrthoSeg for preventing complex co-adaptations of learned features, which may
arise due to probable high dimensionality and spatial correlation in multimodal
and/or multispectral coregistered data. OrthoSeg consists of parallel encoding
networks for independent encoding of multimodal feature maps and a decoder
designed for efficiently fusing independently encoded multimodal feature maps.
A softmax layer at the end of the network uses the features generated by the
decoder for pixel-wise classification. The decoder fuses feature maps from the
parallel encoders locally as well as contextually at multiple scales to
generate per-pixel feature maps for final pixel-wise classification resulting
in segmented output. We experimentally show the merits of OrthoSeg by
demonstrating state-of-the-art accuracy on the ISPRS Potsdam 2D Semantic
Segmentation dataset. Adaptability is one of the key motivations behind
OrthoSeg so that it serves as a useful architectural option for a wide range of
problems involving the task of semantic segmentation of coregistered multimodal
and/or multispectral imagery. Hence, OrthoSeg is designed to enable independent
scaling of parallel encoder networks and decoder network to better match
application requirements, such as the number of input channels, the effective
field-of-view, and model capacity.Comment: 8 pages, 9 figures, 3 table
Visual Reference Resolution using Attention Memory for Visual Dialog
Visual dialog is a task of answering a series of inter-dependent questions
given an input image, and often requires to resolve visual references among the
questions. This problem is different from visual question answering (VQA),
which relies on spatial attention (a.k.a. visual grounding) estimated from an
image and question pair. We propose a novel attention mechanism that exploits
visual attentions in the past to resolve the current reference in the visual
dialog scenario. The proposed model is equipped with an associative attention
memory storing a sequence of previous (attention, key) pairs. From this memory,
the model retrieves the previous attention, taking into account recency, which
is most relevant for the current question, in order to resolve potentially
ambiguous references. The model then merges the retrieved attention with a
tentative one to obtain the final attention for the current question;
specifically, we use dynamic parameter prediction to combine the two attentions
conditioned on the question. Through extensive experiments on a new synthetic
visual dialog dataset, we show that our model significantly outperforms the
state-of-the-art (by ~16 % points) in situations, where visual reference
resolution plays an important role. Moreover, the proposed model achieves
superior performance (~ 2 % points improvement) in the Visual Dialog dataset,
despite having significantly fewer parameters than the baselines
Discriminatively Learned Hierarchical Rank Pooling Networks
In this work, we present novel temporal encoding methods for action and
activity classification by extending the unsupervised rank pooling temporal
encoding method in two ways. First, we present "discriminative rank pooling" in
which the shared weights of our video representation and the parameters of the
action classifiers are estimated jointly for a given training dataset of
labelled vector sequences using a bilevel optimization formulation of the
learning problem. When the frame level features vectors are obtained from a
convolutional neural network (CNN), we rank pool the network activations and
jointly estimate all parameters of the model, including CNN filters and
fully-connected weights, in an end-to-end manner which we coined as "end-to-end
trainable rank pooled CNN". Importantly, this model can make use of any
existing convolutional neural network architecture (e.g., AlexNet or VGG)
without modification or introduction of additional parameters. Then, we extend
rank pooling to a high capacity video representation, called "hierarchical rank
pooling". Hierarchical rank pooling consists of a network of rank pooling
functions, which encode temporal semantics over arbitrary long video clips
based on rich frame level features. By stacking non-linear feature functions
and temporal sub-sequence encoders one on top of the other, we build a high
capacity encoding network of the dynamic behaviour of the video. The resulting
video representation is a fixed-length feature vector describing the entire
video clip that can be used as input to standard machine learning classifiers.
We demonstrate our approach on the task of action and activity recognition.
Obtained results are comparable to state-of-the-art methods on three important
activity recognition benchmarks with classification performance of 76.7% mAP on
Hollywood2, 69.4% on HMDB51, and 93.6% on UCF101.Comment: International Journal of Computer Visio
Deep Learning for Sentiment Analysis : A Survey
Deep learning has emerged as a powerful machine learning technique that
learns multiple layers of representations or features of the data and produces
state-of-the-art prediction results. Along with the success of deep learning in
many other application domains, deep learning is also popularly used in
sentiment analysis in recent years. This paper first gives an overview of deep
learning and then provides a comprehensive survey of its current applications
in sentiment analysis.Comment: 34 pages, 9 figures, 2 table
Efficient Inferencing of Compressed Deep Neural Networks
Large number of weights in deep neural networks makes the models difficult to
be deployed in low memory environments such as, mobile phones, IOT edge devices
as well as "inferencing as a service" environments on cloud. Prior work has
considered reduction in the size of the models, through compression techniques
like pruning, quantization, Huffman encoding etc. However, efficient
inferencing using the compressed models has received little attention,
specially with the Huffman encoding in place. In this paper, we propose
efficient parallel algorithms for inferencing of single image and batches,
under various memory constraints. Our experimental results show that our
approach of using variable batch size for inferencing achieves 15-25\%
performance improvement in the inference throughput for AlexNet, while
maintaining memory and latency constraints
- …