184,511 research outputs found
Single Image Action Recognition by Predicting Space-Time Saliency
We propose a novel approach based on deep Convolutional Neural Networks (CNN)
to recognize human actions in still images by predicting the future motion, and
detecting the shape and location of the salient parts of the image. We make the
following major contributions to this important area of research: (i) We use
the predicted future motion in the static image (Walker et al., 2015) as a
means of compensating for the missing temporal information, while using the
saliency map to represent the the spatial information in the form of location
and shape of what is predicted as significant. (ii) We cast action
classification in static images as a domain adaptation problem by transfer
learning. We first map the input static image to a new domain that we refer to
as the Predicted Optical Flow-Saliency Map domain (POF-SM), and then fine-tune
the layers of a deep CNN model trained on classifying the ImageNet dataset to
perform action classification in the POF-SM domain. (iii) We tested our method
on the popular Willow dataset. But unlike existing methods, we also tested on a
more realistic and challenging dataset of over 2M still images that we
collected and labeled by taking random frames from the UCF-101 video dataset.
We call our dataset the UCF Still Image dataset or UCFSI-101 in short. Our
results outperform the state of the art
Volumetric Super-Resolution of Multispectral Data
Most multispectral remote sensors (e.g. QuickBird, IKONOS, and Landsat 7
ETM+) provide low-spatial high-spectral resolution multispectral (MS) or
high-spatial low-spectral resolution panchromatic (PAN) images, separately. In
order to reconstruct a high-spatial/high-spectral resolution multispectral
image volume, either the information in MS and PAN images are fused (i.e.
pansharpening) or super-resolution reconstruction (SRR) is used with only MS
images captured on different dates. Existing methods do not utilize temporal
information of MS and high spatial resolution of PAN images together to improve
the resolution. In this paper, we propose a multiframe SRR algorithm using
pansharpened MS images, taking advantage of both temporal and spatial
information available in multispectral imagery, in order to exceed spatial
resolution of given PAN images. We first apply pansharpening to a set of
multispectral images and their corresponding PAN images captured on different
dates. Then, we use the pansharpened multispectral images as input to the
proposed wavelet-based multiframe SRR method to yield full volumetric SRR. The
proposed SRR method is obtained by deriving the subband relations between
multitemporal MS volumes. We demonstrate the results on Landsat 7 ETM+ images
comparing our method to conventional techniques.Comment: arXiv admin note: text overlap with arXiv:1705.0125
Learning Semantics for Image Annotation
Image search and retrieval engines rely heavily on textual annotation in
order to match word queries to a set of candidate images. A system that can
automatically annotate images with meaningful text can be highly beneficial for
such engines. Currently, the approaches to develop such systems try to
establish relationships between keywords and visual features of images. In this
paper, We make three main contributions to this area: (i) We transform this
problem from the low-level keyword space to the high-level semantics space that
we refer to as the "{\em image theme}", (ii) Instead of treating each possible
keyword independently, we use latent Dirichlet allocation to learn image themes
from the associated texts in a training phase. Images are then annotated with
image themes rather than keywords, using a modified continuous relevance model,
which takes into account the spatial coherence and the visual continuity among
images of common theme. (iii) To achieve more coherent annotations among images
of common theme, we have integrated ConceptNet in learning the semantics of
images, and hence augment image descriptions beyond annotations provided by
humans. Images are thus further annotated by a few most significant words of
the prominent image theme. Our extensive experiments show that a coherent
theme-based image annotation using high-level semantics results in improved
precision and recall as compared with equivalent classical keyword annotation
systems
Non-Linear Phase-Shifting of Haar Wavelets for Run-Time All-Frequency Lighting
This paper focuses on real-time all-frequency image-based rendering using an
innovative solution for run-time computation of light transport. The approach
is based on new results derived for non-linear phase shifting in the Haar
wavelet domain. Although image-based methods for real-time rendering of dynamic
glossy objects have been proposed, they do not truly scale to all possible
frequencies and high sampling rates without trading storage, glossiness, or
computational time, while varying both lighting and viewpoint. This is due to
the fact that current approaches are limited to precomputed radiance transfer
(PRT), which is prohibitively expensive in terms of memory requirements and
real-time rendering when both varying light and viewpoint changes are required
together with high sampling rates for high frequency lighting of glossy
material. On the other hand, current methods cannot handle object rotation,
which is one of the paramount issues for all PRT methods using wavelets. This
latter problem arises because the precomputed data are defined in a global
coordinate system and encoded in the wavelet domain, while the object is
rotated in a local coordinate system. At the root of all the above problems is
the lack of efficient run-time solution to the nontrivial problem of rotating
wavelets (a non-linear phase-shift), which we solve in this paper
Image Annotation using Multi-Layer Sparse Coding
Automatic annotation of images with descriptive words is a challenging
problem with vast applications in the areas of image search and retrieval. This
problem can be viewed as a label-assignment problem by a classifier dealing
with a very large set of labels, i.e., the vocabulary set. We propose a novel
annotation method that employs two layers of sparse coding and performs
coarse-to-fine labeling. Themes extracted from the training data are treated as
coarse labels. Each theme is a set of training images that share a common
subject in their visual and textual contents. Our system extracts coarse labels
for training and test images without requiring any prior knowledge. Vocabulary
words are the fine labels to be associated with images. Most of the annotation
methods achieve low recall due to the large number of available fine labels,
i.e., vocabulary words. These systems also tend to achieve high precision for
highly frequent words only while relatively rare words are more important for
search and retrieval purposes. Our system not only outperforms various
previously proposed annotation systems, but also achieves symmetric response in
terms of precision and recall. Our system scores and maintains high precision
for words with a wide range of frequencies. Such behavior is achieved by
intelligently reducing the number of available fine labels or words for each
image based on coarse labels assigned to it
An Invariant Model of the Significance of Different Body Parts in Recognizing Different Actions
In this paper, we show that different body parts do not play equally
important roles in recognizing a human action in video data. We investigate to
what extent a body part plays a role in recognition of different actions and
hence propose a generic method of assigning weights to different body points.
The approach is inspired by the strong evidence in the applied perception
community that humans perform recognition in a foveated manner, that is they
recognize events or objects by only focusing on visually significant aspects.
An important contribution of our method is that the computation of the weights
assigned to body parts is invariant to viewing directions and camera parameters
in the input data. We have performed extensive experiments to validate the
proposed approach and demonstrate its significance. In particular, results show
that considerable improvement in performance is gained by taking into account
the relative importance of different body parts as defined by our approach.Comment: arXiv admin note: substantial text overlap with arXiv:1705.04641,
arXiv:1705.05741, arXiv:1705.0443
View-Invariant Recognition of Action Style Self-Dissimilarity
Self-similarity was recently introduced as a measure of inter-class
congruence for classification of actions. Herein, we investigate the dual
problem of intra-class dissimilarity for classification of action styles. We
introduce self-dissimilarity matrices that discriminate between same actions
performed by different subjects regardless of viewing direction and camera
parameters. We investigate two frameworks using these invariant style
dissimilarity measures based on Principal Component Analysis (PCA) and Fisher
Discriminant Analysis (FDA). Extensive experiments performed on IXMAS dataset
indicate remarkably good discriminant characteristics for the proposed
invariant measures for gender recognition from video data
Visual Affordance and Function Understanding: A Survey
Nowadays, robots are dominating the manufacturing, entertainment and
healthcare industries. Robot vision aims to equip robots with the ability to
discover information, understand it and interact with the environment. These
capabilities require an agent to effectively understand object affordances and
functionalities in complex visual domains. In this literature survey, we first
focus on Visual affordances and summarize the state of the art as well as open
problems and research gaps. Specifically, we discuss sub-problems such as
affordance detection, categorization, segmentation and high-level reasoning.
Furthermore, we cover functional scene understanding and the prevalent
functional descriptors used in the literature. The survey also provides
necessary background to the problem, sheds light on its significance and
highlights the existing challenges for affordance and functionality learning.Comment: 26 pages, 22 image
Bottom-up Attention, Models of
In this review, we examine the recent progress in saliency prediction and
proposed several avenues for future research. In spite of tremendous efforts
and huge progress, there is still room for improvement in terms finer-grained
analysis of deep saliency models, evaluation measures, datasets, annotation
methods, cognitive studies, and new applications. This chapter will appear in
Encyclopedia of Computational Neuroscience.Comment: arXiv admin note: substantial text overlap with arXiv:1810.0371
Multigrid Predictive Filter Flow for Unsupervised Learning on Videos
We introduce multigrid Predictive Filter Flow (mgPFF), a framework for
unsupervised learning on videos. The mgPFF takes as input a pair of frames and
outputs per-pixel filters to warp one frame to the other. Compared to optical
flow used for warping frames, mgPFF is more powerful in modeling sub-pixel
movement and dealing with corruption (e.g., motion blur). We develop a
multigrid coarse-to-fine modeling strategy that avoids the requirement of
learning large filters to capture large displacement. This allows us to train
an extremely compact model (4.6MB) which operates in a progressive way over
multiple resolutions with shared weights. We train mgPFF on unsupervised,
free-form videos and show that mgPFF is able to not only estimate long-range
flow for frame reconstruction and detect video shot transitions, but also
readily amendable for video object segmentation and pose tracking, where it
substantially outperforms the published state-of-the-art without bells and
whistles. Moreover, owing to mgPFF's nature of per-pixel filter prediction, we
have the unique opportunity to visualize how each pixel is evolving during
solving these tasks, thus gaining better interpretability.Comment: webpage (https://www.ics.uci.edu/~skong2/mgpff.html
- …