33,213 research outputs found
Explicitly Modeled Attention Maps for Image Classification
Self-attention networks have shown remarkable progress in computer vision
tasks such as image classification. The main benefit of the self-attention
mechanism is the ability to capture long-range feature interactions in
attention-maps. However, the computation of attention-maps requires a learnable
key, query, and positional encoding, whose usage is often not intuitive and
computationally expensive. To mitigate this problem, we propose a novel
self-attention module with explicitly modeled attention-maps using only a
single learnable parameter for low computational overhead. The design of
explicitly modeled attention-maps using geometric prior is based on the
observation that the spatial context for a given pixel within an image is
mostly dominated by its neighbors, while more distant pixels have a minor
contribution. Concretely, the attention-maps are parametrized via simple
functions (e.g., Gaussian kernel) with a learnable radius, which is modeled
independently of the input content. Our evaluation shows that our method
achieves an accuracy improvement of up to 2.2% over the ResNet-baselines in
ImageNet ILSVRC and outperforms other self-attention methods such as
AA-ResNet152 in accuracy by 0.9% with 6.4% fewer parameters and 6.7% fewer
GFLOPs. This result empirically indicates the value of incorporating geometric
prior into self-attention mechanism when applied in image classification.Comment: Accepted by AAAI202
Latent Embeddings for Collective Activity Recognition
Rather than simply recognizing the action of a person individually,
collective activity recognition aims to find out what a group of people is
acting in a collective scene. Previ- ous state-of-the-art methods using
hand-crafted potentials in conventional graphical model which can only define a
limited range of relations. Thus, the complex structural de- pendencies among
individuals involved in a collective sce- nario cannot be fully modeled. In
this paper, we overcome these limitations by embedding latent variables into
feature space and learning the feature mapping functions in a deep learning
framework. The embeddings of latent variables build a global relation
containing person-group interac- tions and richer contextual information by
jointly modeling broader range of individuals. Besides, we assemble atten- tion
mechanism during embedding for achieving more com- pact representations. We
evaluate our method on three col- lective activity datasets, where we
contribute a much larger dataset in this work. The proposed model has achieved
clearly better performance as compared to the state-of-the- art methods in our
experiments.Comment: 6pages, accepted by IEEE-AVSS201
Multiscale Discriminant Saliency for Visual Attention
The bottom-up saliency, an early stage of humans' visual attention, can be
considered as a binary classification problem between center and surround
classes. Discriminant power of features for the classification is measured as
mutual information between features and two classes distribution. The estimated
discrepancy of two feature classes very much depends on considered scale
levels; then, multi-scale structure and discriminant power are integrated by
employing discrete wavelet features and Hidden markov tree (HMT). With wavelet
coefficients and Hidden Markov Tree parameters, quad-tree like label structures
are constructed and utilized in maximum a posterior probability (MAP) of hidden
class variables at corresponding dyadic sub-squares. Then, saliency value for
each dyadic square at each scale level is computed with discriminant power
principle and the MAP. Finally, across multiple scales is integrated the final
saliency map by an information maximization rule. Both standard quantitative
tools such as NSS, LCC, AUC and qualitative assessments are used for evaluating
the proposed multiscale discriminant saliency method (MDIS) against the
well-know information-based saliency method AIM on its Bruce Database wity
eye-tracking data. Simulation results are presented and analyzed to verify the
validity of MDIS as well as point out its disadvantages for further research
direction.Comment: 16 pages, ICCSA 2013 - BIOCA sessio
Predicting Human Interaction via Relative Attention Model
Predicting human interaction is challenging as the on-going activity has to
be inferred based on a partially observed video. Essentially, a good algorithm
should effectively model the mutual influence between the two interacting
subjects. Also, only a small region in the scene is discriminative for
identifying the on-going interaction. In this work, we propose a relative
attention model to explicitly address these difficulties. Built on a
tri-coupled deep recurrent structure representing both interacting subjects and
global interaction status, the proposed network collects spatio-temporal
information from each subject, rectified with global interaction information,
yielding effective interaction representation. Moreover, the proposed network
also unifies an attention module to assign higher importance to the regions
which are relevant to the on-going action. Extensive experiments have been
conducted on two public datasets, and the results demonstrate that the proposed
relative attention network successfully predicts informative regions between
interacting subjects, which in turn yields superior human interaction
prediction accuracy.Comment: To appear in IJCAI 201
Enhancing hyperspectral image unmixing with spatial correlations
This paper describes a new algorithm for hyperspectral image unmixing. Most
of the unmixing algorithms proposed in the literature do not take into account
the possible spatial correlations between the pixels. In this work, a Bayesian
model is introduced to exploit these correlations. The image to be unmixed is
assumed to be partitioned into regions (or classes) where the statistical
properties of the abundance coefficients are homogeneous. A Markov random field
is then proposed to model the spatial dependency of the pixels within any
class. Conditionally upon a given class, each pixel is modeled by using the
classical linear mixing model with additive white Gaussian noise. This strategy
is investigated the well known linear mixing model. For this model, the
posterior distributions of the unknown parameters and hyperparameters allow
ones to infer the parameters of interest. These parameters include the
abundances for each pixel, the means and variances of the abundances for each
class, as well as a classification map indicating the classes of all pixels in
the image. To overcome the complexity of the posterior distribution of
interest, we consider Markov chain Monte Carlo methods that generate samples
distributed according to the posterior of interest. The generated samples are
then used for parameter and hyperparameter estimation. The accuracy of the
proposed algorithms is illustrated on synthetic and real data.Comment: Manuscript accepted for publication in IEEE Trans. Geoscience and
Remote Sensin
- …