715 research outputs found
Crowd Counting via Weighted VLAD on Dense Attribute Feature Maps
Crowd counting is an important task in computer vision, which has many
applications in video surveillance. Although the regression-based framework has
achieved great improvements for crowd counting, how to improve the
discriminative power of image representation is still an open problem.
Conventional holistic features used in crowd counting often fail to capture
semantic attributes and spatial cues of the image. In this paper, we propose
integrating semantic information into learning locality-aware feature sets for
accurate crowd counting. First, with the help of convolutional neural network
(CNN), the original pixel space is mapped onto a dense attribute feature map,
where each dimension of the pixel-wise feature indicates the probabilistic
strength of a certain semantic class. Then, locality-aware features (LAF) built
on the idea of spatial pyramids on neighboring patches are proposed to explore
more spatial context and local information. Finally, the traditional VLAD
encoding method is extended to a more generalized form in which diverse
coefficient weights are taken into consideration. Experimental results validate
the effectiveness of our presented method.Comment: 10 page
Hybrid CNN and Dictionary-Based Models for Scene Recognition and Domain Adaptation
Convolutional neural network (CNN) has achieved state-of-the-art performance
in many different visual tasks. Learned from a large-scale training dataset,
CNN features are much more discriminative and accurate than the hand-crafted
features. Moreover, CNN features are also transferable among different domains.
On the other hand, traditional dictionarybased features (such as BoW and SPM)
contain much more local discriminative and structural information, which is
implicitly embedded in the images. To further improve the performance, in this
paper, we propose to combine CNN with dictionarybased models for scene
recognition and visual domain adaptation. Specifically, based on the well-tuned
CNN models (e.g., AlexNet and VGG Net), two dictionary-based representations
are further constructed, namely mid-level local representation (MLR) and
convolutional Fisher vector representation (CFV). In MLR, an efficient
two-stage clustering method, i.e., weighted spatial and feature space spectral
clustering on the parts of a single image followed by clustering all
representative parts of all images, is used to generate a class-mixture or a
classspecific part dictionary. After that, the part dictionary is used to
operate with the multi-scale image inputs for generating midlevel
representation. In CFV, a multi-scale and scale-proportional GMM training
strategy is utilized to generate Fisher vectors based on the last convolutional
layer of CNN. By integrating the complementary information of MLR, CFV and the
CNN features of the fully connected layer, the state-of-the-art performance can
be achieved on scene recognition and domain adaptation problems. An interested
finding is that our proposed hybrid representation (from VGG net trained on
ImageNet) is also complementary with GoogLeNet and/or VGG-11 (trained on
Place205) greatly.Comment: Accepted by TCSVT on Sep.201
Face Recognition in Low Quality Images: A Survey
Low-resolution face recognition (LRFR) has received increasing attention over
the past few years. Its applications lie widely in the real-world environment
when high-resolution or high-quality images are hard to capture. One of the
biggest demands for LRFR technologies is video surveillance. As the the number
of surveillance cameras in the city increases, the videos that captured will
need to be processed automatically. However, those videos or images are usually
captured with large standoffs, arbitrary illumination condition, and diverse
angles of view. Faces in these images are generally small in size. Several
studies addressed this problem employed techniques like super resolution,
deblurring, or learning a relationship between different resolution domains. In
this paper, we provide a comprehensive review of approaches to low-resolution
face recognition in the past five years. First, a general problem definition is
given. Later, systematically analysis of the works on this topic is presented
by catogory. In addition to describing the methods, we also focus on datasets
and experiment settings. We further address the related works on unconstrained
low-resolution face recognition and compare them with the result that use
synthetic low-resolution data. Finally, we summarized the general limitations
and speculate a priorities for the future effort.Comment: There are some mistakes addressing in this paper which will be
misleading to the reader and we wont have a new version in short time. We
will resubmit once it is being corecte
Shared Predictive Cross-Modal Deep Quantization
With explosive growth of data volume and ever-increasing diversity of data
modalities, cross-modal similarity search, which conducts nearest neighbor
search across different modalities, has been attracting increasing interest.
This paper presents a deep compact code learning solution for efficient
cross-modal similarity search. Many recent studies have proven that
quantization-based approaches perform generally better than hashing-based
approaches on single-modal similarity search. In this paper, we propose a deep
quantization approach, which is among the early attempts of leveraging deep
neural networks into quantization-based cross-modal similarity search. Our
approach, dubbed shared predictive deep quantization (SPDQ), explicitly
formulates a shared subspace across different modalities and two private
subspaces for individual modalities, and representations in the shared subspace
and the private subspaces are learned simultaneously by embedding them to a
reproducing kernel Hilbert space, where the mean embedding of different
modality distributions can be explicitly compared. In addition, in the shared
subspace, a quantizer is learned to produce the semantics preserving compact
codes with the help of label alignment. Thanks to this novel network
architecture in cooperation with supervised quantization training, SPDQ can
preserve intramodal and intermodal similarities as much as possible and greatly
reduce quantization error. Experiments on two popular benchmarks corroborate
that our approach outperforms state-of-the-art methods
Recent Advance in Content-based Image Retrieval: A Literature Survey
The explosive increase and ubiquitous accessibility of visual data on the Web
have led to the prosperity of research activity in image search or retrieval.
With the ignorance of visual content as a ranking clue, methods with text
search techniques for visual retrieval may suffer inconsistency between the
text words and visual content. Content-based image retrieval (CBIR), which
makes use of the representation of visual content to identify relevant images,
has attracted sustained attention in recent two decades. Such a problem is
challenging due to the intention gap and the semantic gap problems. Numerous
techniques have been developed for content-based image retrieval in the last
decade. The purpose of this paper is to categorize and evaluate those
algorithms proposed during the period of 2003 to 2016. We conclude with several
promising directions for future research.Comment: 22 page
Set-to-Set Hashing with Applications in Visual Recognition
Visual data, such as an image or a sequence of video frames, is often
naturally represented as a point set. In this paper, we consider the
fundamental problem of finding a nearest set from a collection of sets, to a
query set. This problem has obvious applications in large-scale visual
retrieval and recognition, and also in applied fields beyond computer vision.
One challenge stands out in solving the problem---set representation and
measure of similarity. Particularly, the query set and the sets in dataset
collection can have varying cardinalities. The training collection is large
enough such that linear scan is impractical. We propose a simple representation
scheme that encodes both statistical and structural information of the sets.
The derived representations are integrated in a kernel framework for flexible
similarity measurement. For the query set process, we adopt a learning-to-hash
pipeline that turns the kernel representations into hash bits based on simple
learners, using multiple kernel learning. Experiments on two visual retrieval
datasets show unambiguously that our set-to-set hashing framework outperforms
prior methods that do not take the set-to-set search setting.Comment: 9 page
Dual-level Semantic Transfer Deep Hashing for Efficient Social Image Retrieval
Social network stores and disseminates a tremendous amount of user shared
images. Deep hashing is an efficient indexing technique to support large-scale
social image retrieval, due to its deep representation capability, fast
retrieval speed and low storage cost. Particularly, unsupervised deep hashing
has well scalability as it does not require any manually labelled data for
training. However, owing to the lacking of label guidance, existing methods
suffer from severe semantic shortage when optimizing a large amount of deep
neural network parameters. Differently, in this paper, we propose a Dual-level
Semantic Transfer Deep Hashing (DSTDH) method to alleviate this problem with a
unified deep hash learning framework. Our model targets at learning the
semantically enhanced deep hash codes by specially exploiting the
user-generated tags associated with the social images. Specifically, we design
a complementary dual-level semantic transfer mechanism to efficiently discover
the potential semantics of tags and seamlessly transfer them into binary hash
codes. On the one hand, instance-level semantics are directly preserved into
hash codes from the associated tags with adverse noise removing. Besides, an
image-concept hypergraph is constructed for indirectly transferring the latent
high-order semantic correlations of images and tags into hash codes. Moreover,
the hash codes are obtained simultaneously with the deep representation
learning by the discrete hash optimization strategy. Extensive experiments on
two public social image retrieval datasets validate the superior performance of
our method compared with state-of-the-art hashing methods. The source codes of
our method can be obtained at https://github.com/research2020-1/DSTDHComment: Accepted by IEEE TCSV
From BoW to CNN: Two Decades of Texture Representation for Texture Classification
Texture is a fundamental characteristic of many types of images, and texture
representation is one of the essential and challenging problems in computer
vision and pattern recognition which has attracted extensive research
attention. Since 2000, texture representations based on Bag of Words (BoW) and
on Convolutional Neural Networks (CNNs) have been extensively studied with
impressive performance. Given this period of remarkable evolution, this paper
aims to present a comprehensive survey of advances in texture representation
over the last two decades. More than 200 major publications are cited in this
survey covering different aspects of the research, which includes (i) problem
description; (ii) recent advances in the broad categories of BoW-based,
CNN-based and attribute-based methods; and (iii) evaluation issues,
specifically benchmark datasets and state of the art results. In retrospect of
what has been achieved so far, the survey discusses open challenges and
directions for future research.Comment: Accepted by IJC
cvpaper.challenge in 2015 - A review of CVPR2015 and DeepSurvey
The "cvpaper.challenge" is a group composed of members from AIST, Tokyo Denki
Univ. (TDU), and Univ. of Tsukuba that aims to systematically summarize papers
on computer vision, pattern recognition, and related fields. For this
particular review, we focused on reading the ALL 602 conference papers
presented at the CVPR2015, the premier annual computer vision event held in
June 2015, in order to grasp the trends in the field. Further, we are proposing
"DeepSurvey" as a mechanism embodying the entire process from the reading
through all the papers, the generation of ideas, and to the writing of paper.Comment: Survey Pape
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
- …