175,570 research outputs found
Single Image Action Recognition by Predicting Space-Time Saliency
We propose a novel approach based on deep Convolutional Neural Networks (CNN)
to recognize human actions in still images by predicting the future motion, and
detecting the shape and location of the salient parts of the image. We make the
following major contributions to this important area of research: (i) We use
the predicted future motion in the static image (Walker et al., 2015) as a
means of compensating for the missing temporal information, while using the
saliency map to represent the the spatial information in the form of location
and shape of what is predicted as significant. (ii) We cast action
classification in static images as a domain adaptation problem by transfer
learning. We first map the input static image to a new domain that we refer to
as the Predicted Optical Flow-Saliency Map domain (POF-SM), and then fine-tune
the layers of a deep CNN model trained on classifying the ImageNet dataset to
perform action classification in the POF-SM domain. (iii) We tested our method
on the popular Willow dataset. But unlike existing methods, we also tested on a
more realistic and challenging dataset of over 2M still images that we
collected and labeled by taking random frames from the UCF-101 video dataset.
We call our dataset the UCF Still Image dataset or UCFSI-101 in short. Our
results outperform the state of the art
View-Invariant Recognition of Action Style Self-Dissimilarity
Self-similarity was recently introduced as a measure of inter-class
congruence for classification of actions. Herein, we investigate the dual
problem of intra-class dissimilarity for classification of action styles. We
introduce self-dissimilarity matrices that discriminate between same actions
performed by different subjects regardless of viewing direction and camera
parameters. We investigate two frameworks using these invariant style
dissimilarity measures based on Principal Component Analysis (PCA) and Fisher
Discriminant Analysis (FDA). Extensive experiments performed on IXMAS dataset
indicate remarkably good discriminant characteristics for the proposed
invariant measures for gender recognition from video data
Volumetric Super-Resolution of Multispectral Data
Most multispectral remote sensors (e.g. QuickBird, IKONOS, and Landsat 7
ETM+) provide low-spatial high-spectral resolution multispectral (MS) or
high-spatial low-spectral resolution panchromatic (PAN) images, separately. In
order to reconstruct a high-spatial/high-spectral resolution multispectral
image volume, either the information in MS and PAN images are fused (i.e.
pansharpening) or super-resolution reconstruction (SRR) is used with only MS
images captured on different dates. Existing methods do not utilize temporal
information of MS and high spatial resolution of PAN images together to improve
the resolution. In this paper, we propose a multiframe SRR algorithm using
pansharpened MS images, taking advantage of both temporal and spatial
information available in multispectral imagery, in order to exceed spatial
resolution of given PAN images. We first apply pansharpening to a set of
multispectral images and their corresponding PAN images captured on different
dates. Then, we use the pansharpened multispectral images as input to the
proposed wavelet-based multiframe SRR method to yield full volumetric SRR. The
proposed SRR method is obtained by deriving the subband relations between
multitemporal MS volumes. We demonstrate the results on Landsat 7 ETM+ images
comparing our method to conventional techniques.Comment: arXiv admin note: text overlap with arXiv:1705.0125
An Invariant Model of the Significance of Different Body Parts in Recognizing Different Actions
In this paper, we show that different body parts do not play equally
important roles in recognizing a human action in video data. We investigate to
what extent a body part plays a role in recognition of different actions and
hence propose a generic method of assigning weights to different body points.
The approach is inspired by the strong evidence in the applied perception
community that humans perform recognition in a foveated manner, that is they
recognize events or objects by only focusing on visually significant aspects.
An important contribution of our method is that the computation of the weights
assigned to body parts is invariant to viewing directions and camera parameters
in the input data. We have performed extensive experiments to validate the
proposed approach and demonstrate its significance. In particular, results show
that considerable improvement in performance is gained by taking into account
the relative importance of different body parts as defined by our approach.Comment: arXiv admin note: substantial text overlap with arXiv:1705.04641,
arXiv:1705.05741, arXiv:1705.0443
Learning Semantics for Image Annotation
Image search and retrieval engines rely heavily on textual annotation in
order to match word queries to a set of candidate images. A system that can
automatically annotate images with meaningful text can be highly beneficial for
such engines. Currently, the approaches to develop such systems try to
establish relationships between keywords and visual features of images. In this
paper, We make three main contributions to this area: (i) We transform this
problem from the low-level keyword space to the high-level semantics space that
we refer to as the "{\em image theme}", (ii) Instead of treating each possible
keyword independently, we use latent Dirichlet allocation to learn image themes
from the associated texts in a training phase. Images are then annotated with
image themes rather than keywords, using a modified continuous relevance model,
which takes into account the spatial coherence and the visual continuity among
images of common theme. (iii) To achieve more coherent annotations among images
of common theme, we have integrated ConceptNet in learning the semantics of
images, and hence augment image descriptions beyond annotations provided by
humans. Images are thus further annotated by a few most significant words of
the prominent image theme. Our extensive experiments show that a coherent
theme-based image annotation using high-level semantics results in improved
precision and recall as compared with equivalent classical keyword annotation
systems
Non-Linear Phase-Shifting of Haar Wavelets for Run-Time All-Frequency Lighting
This paper focuses on real-time all-frequency image-based rendering using an
innovative solution for run-time computation of light transport. The approach
is based on new results derived for non-linear phase shifting in the Haar
wavelet domain. Although image-based methods for real-time rendering of dynamic
glossy objects have been proposed, they do not truly scale to all possible
frequencies and high sampling rates without trading storage, glossiness, or
computational time, while varying both lighting and viewpoint. This is due to
the fact that current approaches are limited to precomputed radiance transfer
(PRT), which is prohibitively expensive in terms of memory requirements and
real-time rendering when both varying light and viewpoint changes are required
together with high sampling rates for high frequency lighting of glossy
material. On the other hand, current methods cannot handle object rotation,
which is one of the paramount issues for all PRT methods using wavelets. This
latter problem arises because the precomputed data are defined in a global
coordinate system and encoded in the wavelet domain, while the object is
rotated in a local coordinate system. At the root of all the above problems is
the lack of efficient run-time solution to the nontrivial problem of rotating
wavelets (a non-linear phase-shift), which we solve in this paper
A Neural Compositional Paradigm for Image Captioning
Mainstream captioning models often follow a sequential structure to generate
captions, leading to issues such as introduction of irrelevant semantics, lack
of diversity in the generated captions, and inadequate generalization
performance. In this paper, we present an alternative paradigm for image
captioning, which factorizes the captioning procedure into two stages: (1)
extracting an explicit semantic representation from the given image; and (2)
constructing the caption based on a recursive compositional procedure in a
bottom-up manner. Compared to conventional ones, our paradigm better preserves
the semantic content through an explicit factorization of semantics and syntax.
By using the compositional generation procedure, caption construction follows a
recursive structure, which naturally fits the properties of human language.
Moreover, the proposed compositional procedure requires less data to train,
generalizes better, and yields more diverse captions.Comment: 32nd Conference on Neural Information Processing Systems (NIPS 2018),
Montr\'eal, Canad
Space-Time Representation of People Based on 3D Skeletal Data: A Review
Spatiotemporal human representation based on 3D visual perception data is a
rapidly growing research area. Based on the information sources, these
representations can be broadly categorized into two groups based on RGB-D
information or 3D skeleton data. Recently, skeleton-based human representations
have been intensively studied and kept attracting an increasing attention, due
to their robustness to variations of viewpoint, human body scale and motion
speed as well as the realtime, online performance. This paper presents a
comprehensive survey of existing space-time representations of people based on
3D skeletal data, and provides an informative categorization and analysis of
these methods from the perspectives, including information modality,
representation encoding, structure and transition, and feature engineering. We
also provide a brief overview of skeleton acquisition devices and construction
methods, enlist a number of public benchmark datasets with skeleton data, and
discuss potential future research directions.Comment: Our paper has been accepted by the journal Computer Vision and Image
Understanding, see
http://www.sciencedirect.com/science/article/pii/S1077314217300279, Computer
Vision and Image Understanding, 201
Classifying Traffic Scenes Using The GIST Image Descriptor
This paper investigates classification of traffic scenes in a very low
bandwidth scenario, where an image should be coded by a small number of
features. We introduce a novel dataset, called the FM1 dataset, consisting of
5615 images of eight different traffic scenes: open highway, open road,
settlement, tunnel, tunnel exit, toll booth, heavy traffic and the overpass. We
evaluate the suitability of the GIST descriptor as a representation of these
images, first by exploring the descriptor space using PCA and k-means
clustering, and then by using an SVM classifier and recording its 10-fold
cross-validation performance on the introduced FM1 dataset. The obtained
recognition rates are very encouraging, indicating that the use of the GIST
descriptor alone could be sufficiently descriptive even when very high
performance is required.Comment: Part of the Proceedings of the Croatian Computer Vision Workshop,
CCVW 2013, Year
Super-Resolution via Deep Learning
The recent phenomenal interest in convolutional neural networks (CNNs) must
have made it inevitable for the super-resolution (SR) community to explore its
potential. The response has been immense and in the last three years, since the
advent of the pioneering work, there appeared too many works not to warrant a
comprehensive survey. This paper surveys the SR literature in the context of
deep learning. We focus on the three important aspects of multimedia - namely
image, video and multi-dimensions, especially depth maps. In each case, first
relevant benchmarks are introduced in the form of datasets and state of the art
SR methods, excluding deep learning. Next is a detailed analysis of the
individual works, each including a short description of the method and a
critique of the results with special reference to the benchmarking done. This
is followed by minimum overall benchmarking in the form of comparison on some
common dataset, while relying on the results reported in various works
- …