14,190 research outputs found
Query-Adaptive Hash Code Ranking for Large-Scale Multi-View Visual Search
Hash based nearest neighbor search has become attractive in many
applications. However, the quantization in hashing usually degenerates the
discriminative power when using Hamming distance ranking. Besides, for
large-scale visual search, existing hashing methods cannot directly support the
efficient search over the data with multiple sources, and while the literature
has shown that adaptively incorporating complementary information from diverse
sources or views can significantly boost the search performance. To address the
problems, this paper proposes a novel and generic approach to building multiple
hash tables with multiple views and generating fine-grained ranking results at
bitwise and tablewise levels. For each hash table, a query-adaptive bitwise
weighting is introduced to alleviate the quantization loss by simultaneously
exploiting the quality of hash functions and their complement for nearest
neighbor search. From the tablewise aspect, multiple hash tables are built for
different data views as a joint index, over which a query-specific rank fusion
is proposed to rerank all results from the bitwise ranking by diffusing in a
graph. Comprehensive experiments on image search over three well-known
benchmarks show that the proposed method achieves up to 17.11% and 20.28%
performance gains on single and multiple table search over state-of-the-art
methods
Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos
Single modality action recognition on RGB or depth sequences has been
extensively explored recently. It is generally accepted that each of these two
modalities has different strengths and limitations for the task of action
recognition. Therefore, analysis of the RGB+D videos can help us to better
study the complementary properties of these two types of modalities and achieve
higher levels of performance. In this paper, we propose a new deep autoencoder
based shared-specific feature factorization network to separate input
multimodal signals into a hierarchy of components. Further, based on the
structure of the features, a structured sparsity learning machine is proposed
which utilizes mixed norms to apply regularization within components and group
selection between them for better classification performance. Our experimental
results show the effectiveness of our cross-modality feature analysis framework
by achieving state-of-the-art accuracy for action classification on five
challenging benchmark datasets
Unsupervised Multi-modal Hashing for Cross-modal retrieval
With the advantage of low storage cost and high efficiency, hashing learning
has received much attention in the domain of Big Data. In this paper, we
propose a novel unsupervised hashing learning method to cope with this open
problem to directly preserve the manifold structure by hashing. To address this
problem, both the semantic correlation in textual space and the locally
geometric structure in the visual space are explored simultaneously in our
framework. Besides, the `2;1-norm constraint is imposed on the projection
matrices to learn the discriminative hash function for each modality. Extensive
experiments are performed to evaluate the proposed method on the three publicly
available datasets and the experimental results show that our method can
achieve superior performance over the state-of-the-art methods.Comment: 4 pages, 4 figure
Multi-View Task-Driven Recognition in Visual Sensor Networks
Nowadays, distributed smart cameras are deployed for a wide set of tasks in
several application scenarios, ranging from object recognition, image
retrieval, and forensic applications. Due to limited bandwidth in distributed
systems, efficient coding of local visual features has in fact been an active
topic of research. In this paper, we propose a novel approach to obtain a
compact representation of high-dimensional visual data using sensor fusion
techniques. We convert the problem of visual analysis in resource-limited
scenarios to a multi-view representation learning, and we show that the key to
finding properly compressed representation is to exploit the position of
cameras with respect to each other as a norm-based regularization in the
particular signal representation of sparse coding. Learning the representation
of each camera is viewed as an individual task and a multi-task learning with
joint sparsity for all nodes is employed. The proposed representation learning
scheme is referred to as the multi-view task-driven learning for visual sensor
network (MT-VSN). We demonstrate that MT-VSN outperforms state-of-the-art in
various surveillance recognition tasks.Comment: 5 pages, Accepted in International Conference of Image Processing,
201
Learning to Measure Change: Fully Convolutional Siamese Metric Networks for Scene Change Detection
A critical challenge problem of scene change detection is that noisy changes
generated by varying illumination, shadows and camera viewpoint make variances
of a scene difficult to define and measure since the noisy changes and semantic
ones are entangled. Following the intuitive idea of detecting changes by
directly comparing dissimilarities between a pair of features, we propose a
novel fully Convolutional siamese metric Network(CosimNet) to measure changes
by customizing implicit metrics. To learn more discriminative metrics, we
utilize contrastive loss to reduce the distance between the unchanged feature
pairs and to enlarge the distance between the changed feature pairs.
Specifically, to address the issue of large viewpoint differences, we propose
Thresholded Contrastive Loss (TCL) with a more tolerant strategy to punish
noisy changes. We demonstrate the effectiveness of the proposed approach with
experiments on three challenging datasets: CDnet, PCD2015, and VL-CMU-CD. Our
approach is robust to lots of challenging conditions, such as illumination
changes, large viewpoint difference caused by camera motion and zooming. In
addition, we incorporate the distance metric into the segmentation framework
and validate the effectiveness through visualization of change maps and feature
distribution. The source code is available at
https://github.com/gmayday1997/ChangeDet.Comment: 10 pages, 12 figure
Two-stream Collaborative Learning with Spatial-Temporal Attention for Video Classification
Video classification is highly important with wide applications, such as
video search and intelligent surveillance. Video naturally consists of static
and motion information, which can be represented by frame and optical flow.
Recently, researchers generally adopt the deep networks to capture the static
and motion information \textbf{\emph{separately}}, which mainly has two
limitations: (1) Ignoring the coexistence relationship between spatial and
temporal attention, while they should be jointly modelled as the spatial and
temporal evolutions of video, thus discriminative video features can be
extracted.(2) Ignoring the strong complementarity between static and motion
information coexisted in video, while they should be collaboratively learned to
boost each other. For addressing the above two limitations, this paper proposes
the approach of two-stream collaborative learning with spatial-temporal
attention (TCLSTA), which consists of two models: (1) Spatial-temporal
attention model: The spatial-level attention emphasizes the salient regions in
frame, and the temporal-level attention exploits the discriminative frames in
video. They are jointly learned and mutually boosted to learn the
discriminative static and motion features for better classification
performance. (2) Static-motion collaborative model: It not only achieves mutual
guidance on static and motion information to boost the feature learning, but
also adaptively learns the fusion weights of static and motion streams, so as
to exploit the strong complementarity between static and motion information to
promote video classification. Experiments on 4 widely-used datasets show that
our TCLSTA approach achieves the best performance compared with more than 10
state-of-the-art methods.Comment: 14 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technolog
Cooperative Training of Deep Aggregation Networks for RGB-D Action Recognition
A novel deep neural network training paradigm that exploits the conjoint
information in multiple heterogeneous sources is proposed. Specifically, in a
RGB-D based action recognition task, it cooperatively trains a single
convolutional neural network (named c-ConvNet) on both RGB visual features and
depth features, and deeply aggregates the two kinds of features for action
recognition. Differently from the conventional ConvNet that learns the deep
separable features for homogeneous modality-based classification with only one
softmax loss function, the c-ConvNet enhances the discriminative power of the
deeply learned features and weakens the undesired modality discrepancy by
jointly optimizing a ranking loss and a softmax loss for both homogeneous and
heterogeneous modalities. The ranking loss consists of intra-modality and
cross-modality triplet losses, and it reduces both the intra-modality and
cross-modality feature variations. Furthermore, the correlations between RGB
and depth data are embedded in the c-ConvNet, and can be retrieved by either of
the modalities and contribute to the recognition in the case even only one of
the modalities is available. The proposed method was extensively evaluated on
two large RGB-D action recognition datasets, ChaLearn LAP IsoGD and NTU RGB+D
datasets, and one small dataset, SYSU 3D HOI, and achieved state-of-the-art
results
Hierarchical Spatial-aware Siamese Network for Thermal Infrared Object Tracking
Most thermal infrared (TIR) tracking methods are discriminative, treating the
tracking problem as a classification task. However, the objective of the
classifier (label prediction) is not coupled to the objective of the tracker
(location estimation). The classification task focuses on the between-class
difference of the arbitrary objects, while the tracking task mainly deals with
the within-class difference of the same objects. In this paper, we cast the TIR
tracking problem as a similarity verification task, which is coupled well to
the objective of the tracking task. We propose a TIR tracker via a Hierarchical
Spatial-aware Siamese Convolutional Neural Network (CNN), named HSSNet. To
obtain both spatial and semantic features of the TIR object, we design a
Siamese CNN that coalesces the multiple hierarchical convolutional layers.
Then, we propose a spatial-aware network to enhance the discriminative ability
of the coalesced hierarchical feature. Subsequently, we train this network end
to end on a large visible video detection dataset to learn the similarity
between paired objects before we transfer the network into the TIR domain.
Next, this pre-trained Siamese network is used to evaluate the similarity
between the target template and target candidates. Finally, we locate the
candidate that is most similar to the tracked target. Extensive experimental
results on the benchmarks VOT-TIR 2015 and VOT-TIR 2016 show that our proposed
method achieves favourable performance compared to the state-of-the-art
methods.Comment: 20 pages, 7 figure
A review of heterogeneous data mining for brain disorders
With rapid advances in neuroimaging techniques, the research on brain
disorder identification has become an emerging area in the data mining
community. Brain disorder data poses many unique challenges for data mining
research. For example, the raw data generated by neuroimaging experiments is in
tensor representations, with typical characteristics of high dimensionality,
structural complexity and nonlinear separability. Furthermore, brain
connectivity networks can be constructed from the tensor data, embedding subtle
interactions between brain regions. Other clinical measures are usually
available reflecting the disease status from different perspectives. It is
expected that integrating complementary information in the tensor data and the
brain network data, and incorporating other clinical parameters will be
potentially transformative for investigating disease mechanisms and for
informing therapeutic interventions. Many research efforts have been devoted to
this area. They have achieved great success in various applications, such as
tensor-based modeling, subgraph pattern mining, multi-view feature analysis. In
this paper, we review some recent data mining methods that are used for
analyzing brain disorders
A Survey on Multi-View Clustering
With advances in information acquisition technologies, multi-view data become
ubiquitous. Multi-view learning has thus become more and more popular in
machine learning and data mining fields. Multi-view unsupervised or
semi-supervised learning, such as co-training, co-regularization has gained
considerable attention. Although recently, multi-view clustering (MVC) methods
have been developed rapidly, there has not been a survey to summarize and
analyze the current progress. Therefore, this paper reviews the common
strategies for combining multiple views of data and based on this summary we
propose a novel taxonomy of the MVC approaches. We further discuss the
relationships between MVC and multi-view representation, ensemble clustering,
multi-task clustering, multi-view supervised and semi-supervised learning.
Several representative real-world applications are elaborated. To promote
future development of MVC, we envision several open problems that may require
further investigation and thorough examination.Comment: 17 pages, 4 figure
- …