5,067 research outputs found
Learning to Measure Change: Fully Convolutional Siamese Metric Networks for Scene Change Detection
A critical challenge problem of scene change detection is that noisy changes
generated by varying illumination, shadows and camera viewpoint make variances
of a scene difficult to define and measure since the noisy changes and semantic
ones are entangled. Following the intuitive idea of detecting changes by
directly comparing dissimilarities between a pair of features, we propose a
novel fully Convolutional siamese metric Network(CosimNet) to measure changes
by customizing implicit metrics. To learn more discriminative metrics, we
utilize contrastive loss to reduce the distance between the unchanged feature
pairs and to enlarge the distance between the changed feature pairs.
Specifically, to address the issue of large viewpoint differences, we propose
Thresholded Contrastive Loss (TCL) with a more tolerant strategy to punish
noisy changes. We demonstrate the effectiveness of the proposed approach with
experiments on three challenging datasets: CDnet, PCD2015, and VL-CMU-CD. Our
approach is robust to lots of challenging conditions, such as illumination
changes, large viewpoint difference caused by camera motion and zooming. In
addition, we incorporate the distance metric into the segmentation framework
and validate the effectiveness through visualization of change maps and feature
distribution. The source code is available at
https://github.com/gmayday1997/ChangeDet.Comment: 10 pages, 12 figure
A Survey on Multi-View Clustering
With advances in information acquisition technologies, multi-view data become
ubiquitous. Multi-view learning has thus become more and more popular in
machine learning and data mining fields. Multi-view unsupervised or
semi-supervised learning, such as co-training, co-regularization has gained
considerable attention. Although recently, multi-view clustering (MVC) methods
have been developed rapidly, there has not been a survey to summarize and
analyze the current progress. Therefore, this paper reviews the common
strategies for combining multiple views of data and based on this summary we
propose a novel taxonomy of the MVC approaches. We further discuss the
relationships between MVC and multi-view representation, ensemble clustering,
multi-task clustering, multi-view supervised and semi-supervised learning.
Several representative real-world applications are elaborated. To promote
future development of MVC, we envision several open problems that may require
further investigation and thorough examination.Comment: 17 pages, 4 figure
Unsupervised Multi-modal Hashing for Cross-modal retrieval
With the advantage of low storage cost and high efficiency, hashing learning
has received much attention in the domain of Big Data. In this paper, we
propose a novel unsupervised hashing learning method to cope with this open
problem to directly preserve the manifold structure by hashing. To address this
problem, both the semantic correlation in textual space and the locally
geometric structure in the visual space are explored simultaneously in our
framework. Besides, the `2;1-norm constraint is imposed on the projection
matrices to learn the discriminative hash function for each modality. Extensive
experiments are performed to evaluate the proposed method on the three publicly
available datasets and the experimental results show that our method can
achieve superior performance over the state-of-the-art methods.Comment: 4 pages, 4 figure
PM-GANs: Discriminative Representation Learning for Action Recognition Using Partial-modalities
Data of different modalities generally convey complimentary but heterogeneous
information, and a more discriminative representation is often preferred by
combining multiple data modalities like the RGB and infrared features. However
in reality, obtaining both data channels is challenging due to many
limitations. For example, the RGB surveillance cameras are often restricted
from private spaces, which is in conflict with the need of abnormal activity
detection for personal security. As a result, using partial data channels to
build a full representation of multi-modalities is clearly desired. In this
paper, we propose a novel Partial-modal Generative Adversarial Networks
(PM-GANs) that learns a full-modal representation using data from only partial
modalities. The full representation is achieved by a generated representation
in place of the missing data channel. Extensive experiments are conducted to
verify the performance of our proposed method on action recognition, compared
with four state-of-the-art methods. Meanwhile, a new Infrared-Visible Dataset
for action recognition is introduced, and will be the first publicly available
action dataset that contains paired infrared and visible spectrum
Deep Facial Expression Recognition: A Survey
With the transition of facial expression recognition (FER) from
laboratory-controlled to challenging in-the-wild conditions and the recent
success of deep learning techniques in various fields, deep neural networks
have increasingly been leveraged to learn discriminative representations for
automatic FER. Recent deep FER systems generally focus on two important issues:
overfitting caused by a lack of sufficient training data and
expression-unrelated variations, such as illumination, head pose and identity
bias. In this paper, we provide a comprehensive survey on deep FER, including
datasets and algorithms that provide insights into these intrinsic problems.
First, we describe the standard pipeline of a deep FER system with the related
background knowledge and suggestions of applicable implementations for each
stage. We then introduce the available datasets that are widely used in the
literature and provide accepted data selection and evaluation principles for
these datasets. For the state of the art in deep FER, we review existing novel
deep neural networks and related training strategies that are designed for FER
based on both static images and dynamic image sequences, and discuss their
advantages and limitations. Competitive performances on widely used benchmarks
are also summarized in this section. We then extend our survey to additional
related issues and application scenarios. Finally, we review the remaining
challenges and corresponding opportunities in this field as well as future
directions for the design of robust deep FER systems
Audio Surveillance: a Systematic Review
Despite surveillance systems are becoming increasingly ubiquitous in our
living environment, automated surveillance, currently based on video sensory
modality and machine intelligence, lacks most of the time the robustness and
reliability required in several real applications. To tackle this issue, audio
sensory devices have been taken into account, both alone or in combination with
video, giving birth, in the last decade, to a considerable amount of research.
In this paper audio-based automated surveillance methods are organized into a
comprehensive survey: a general taxonomy, inspired by the more widespread video
surveillance field, is proposed in order to systematically describe the methods
covering background subtraction, event classification, object tracking and
situation analysis. For each of these tasks, all the significant works are
reviewed, detailing their pros and cons and the context for which they have
been proposed. Moreover, a specific section is devoted to audio features,
discussing their expressiveness and their employment in the above described
tasks. Differently, from other surveys on audio processing and analysis, the
present one is specifically targeted to automated surveillance, highlighting
the target applications of each described methods and providing the reader
tables and schemes useful to retrieve the most suited algorithms for a specific
requirement
Learning Environmental Sounds with Multi-scale Convolutional Neural Network
Deep learning has dramatically improved the performance of sounds
recognition. However, learning acoustic models directly from the raw waveform
is still challenging. Current waveform-based models generally use time-domain
convolutional layers to extract features. The features extracted by single size
filters are insufficient for building discriminative representation of audios.
In this paper, we propose multi-scale convolution operation, which can get
better audio representation by improving the frequency resolution and learning
filters cross all frequency area. For leveraging the waveform-based features
and spectrogram-based features in a single model, we introduce two-phase method
to fuse the different features. Finally, we propose a novel end-to-end network
called WaveMsNet based on the multi-scale convolution operation and two-phase
method. On the environmental sounds classification datasets ESC-10 and ESC-50,
the classification accuracies of our WaveMsNet achieve 93.75% and 79.10%
respectively, which improve significantly from the previous methods.Comment: accepted by IJCNN 201
Supervised Mixed Norm Autoencoder for Kinship Verification in Unconstrained Videos
Identifying kinship relations has garnered interest due to several
applications such as organizing and tagging the enormous amount of videos being
uploaded on the Internet. Existing research in kinship verification primarily
focuses on kinship prediction with image pairs. In this research, we propose a
new deep learning framework for kinship verification in unconstrained videos
using a novel Supervised Mixed Norm regularization Autoencoder (SMNAE). This
new autoencoder formulation introduces class-specific sparsity in the weight
matrix. The proposed three-stage SMNAE based kinship verification framework
utilizes the learned spatio-temporal representation in the video frames for
verifying kinship in a pair of videos. A new kinship video (KIVI) database of
more than 500 individuals with variations due to illumination, pose, occlusion,
ethnicity, and expression is collected for this research. It comprises a total
of 355 true kin video pairs with over 250,000 still frames. The effectiveness
of the proposed framework is demonstrated on the KIVI database and six existing
kinship databases. On the KIVI database, SMNAE yields video-based kinship
verification accuracy of 83.18% which is at least 3.2% better than existing
algorithms. The algorithm is also evaluated on six publicly available kinship
databases and compared with best-reported results. It is observed that the
proposed SMNAE consistently yields best results on all the databasesComment: Accepted for publication in Transactions in Image Processin
Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events
Audio-visual representation learning is an important task from the
perspective of designing machines with the ability to understand complex
events. To this end, we propose a novel multimodal framework that instantiates
multiple instance learning. We show that the learnt representations are useful
for classifying events and localizing their characteristic audio-visual
elements. The system is trained using only video-level event labels without any
timing information. An important feature of our method is its capacity to learn
from unsynchronized audio-visual events. We achieve state-of-the-art results on
a large-scale dataset of weakly-labeled audio event videos. Visualizations of
localized visual regions and audio segments substantiate our system's efficacy,
especially when dealing with noisy situations where modality-specific cues
appear asynchronously
Predicting Human Intentions from Motion Only: A 2D+3D Fusion Approach
In this paper, we address the new problem of the prediction of human intents.
There is neuro-psychological evidence that actions performed by humans are
anticipated by peculiar motor acts which are discriminant of the type of action
going to be performed afterwards. In other words, an actual intent can be
forecast by looking at the kinematics of the immediately preceding movement. To
prove it in a computational and quantitative manner, we devise a new
experimental setup where, without using contextual information, we predict
human intents all originating from the same motor act. We posit the problem as
a classification task and we introduce a new multi-modal dataset consisting of
a set of motion capture marker 3D data and 2D video sequences, where, by only
analysing very similar movements in both training and test phases, we are able
to predict the underlying intent, i.e., the future, never observed action. We
also present an extensive experimental evaluation as a baseline, customizing
state-of-the-art techniques for either 3D and 2D data analysis. Realizing that
video processing methods lead to inferior performance but show complementary
information with respect to 3D data sequences, we developed a 2D+3D fusion
analysis where we achieve better classification accuracies, attesting the
superiority of the multimodal approach for the context-free prediction of human
intents.Comment: accepted as poster at the 25th ACM Multimedia (ACM MM) 2017, Mountain
View, California, US
- …