35,730 research outputs found
Understanding the efficacy, reliability and resiliency of computer vision techniques for malware detection and future research directions
My research lies in the intersection of security and machine learning. This
overview summarizes one component of my research: combining computer vision
with malware exploit detection for enhanced security solutions. I will present
the perspectives of efficacy, reliability and resiliency to formulate threat
detection as computer vision problems and develop state-of-the-art image-based
malware classification. Representing malware binary as images provides a direct
visualization of data samples, reduces the efforts for feature extraction, and
consumes the whole binary for holistic structural analysis. Employing transfer
learning of deep neural networks effective for large scale image classification
to malware classification demonstrates superior classification efficacy
compared with classical machine learning algorithms. To enhance reliability of
these vision-based malware detectors, interpretation frameworks can be
constructed on the malware visual representations and useful for extracting
faithful explanation, so that security practitioners have confidence in the
model before deployment. In cyber-security applications, we should always
assume that a malware writer constantly modifies code to bypass detection.
Addressing the resiliency of the malware detectors is equivalently important as
efficacy and reliability. Via understanding the attack surfaces of machine
learning models used for malware detection, we can greatly improve the
robustness of the algorithms to combat malware adversaries in the wild. Finally
I will discuss future research directions worth pursuing in this research
community.Comment: Repor
Deep Ordinal Hashing with Spatial Attention
Hashing has attracted increasing research attentions in recent years due to
its high efficiency of computation and storage in image retrieval. Recent works
have demonstrated the superiority of simultaneous feature representations and
hash functions learning with deep neural networks. However, most existing deep
hashing methods directly learn the hash functions by encoding the global
semantic information, while ignoring the local spatial information of images.
The loss of local spatial structure makes the performance bottleneck of hash
functions, therefore limiting its application for accurate similarity
retrieval. In this work, we propose a novel Deep Ordinal Hashing (DOH) method,
which learns ordinal representations by leveraging the ranking structure of
feature space from both local and global views. In particular, to effectively
build the ranking structure, we propose to learn the rank correlation space by
exploiting the local spatial information from Fully Convolutional Network (FCN)
and the global semantic information from the Convolutional Neural Network (CNN)
simultaneously. More specifically, an effective spatial attention model is
designed to capture the local spatial information by selectively learning
well-specified locations closely related to target objects. In such hashing
framework,the local spatial and global semantic nature of images are captured
in an end-to-end ranking-to-hashing manner. Experimental results conducted on
three widely-used datasets demonstrate that the proposed DOH method
significantly outperforms the state-of-the-art hashing methods
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
From BoW to CNN: Two Decades of Texture Representation for Texture Classification
Texture is a fundamental characteristic of many types of images, and texture
representation is one of the essential and challenging problems in computer
vision and pattern recognition which has attracted extensive research
attention. Since 2000, texture representations based on Bag of Words (BoW) and
on Convolutional Neural Networks (CNNs) have been extensively studied with
impressive performance. Given this period of remarkable evolution, this paper
aims to present a comprehensive survey of advances in texture representation
over the last two decades. More than 200 major publications are cited in this
survey covering different aspects of the research, which includes (i) problem
description; (ii) recent advances in the broad categories of BoW-based,
CNN-based and attribute-based methods; and (iii) evaluation issues,
specifically benchmark datasets and state of the art results. In retrospect of
what has been achieved so far, the survey discusses open challenges and
directions for future research.Comment: Accepted by IJC
Nested Invariance Pooling and RBM Hashing for Image Instance Retrieval
The goal of this work is the computation of very compact binary hashes for
image instance retrieval. Our approach has two novel contributions. The first
one is Nested Invariance Pooling (NIP), a method inspired from i-theory, a
mathematical theory for computing group invariant transformations with
feed-forward neural networks. NIP is able to produce compact and
well-performing descriptors with visual representations extracted from
convolutional neural networks. We specifically incorporate scale, translation
and rotation invariances but the scheme can be extended to any arbitrary sets
of transformations. We also show that using moments of increasing order
throughout nesting is important. The NIP descriptors are then hashed to the
target code size (32-256 bits) with a Restricted Boltzmann Machine with a novel
batch-level regularization scheme specifically designed for the purpose of
hashing (RBMH). A thorough empirical evaluation with state-of-the-art shows
that the results obtained both with the NIP descriptors and the NIP+RBMH hashes
are consistently outstanding across a wide range of datasets.Comment: Image Instance Retrieval, CNN, Invariant Representation, Hashing,
Unsupervised Learning, Regularization. arXiv admin note: text overlap with
arXiv:1601.0209
Correlation Hashing Network for Efficient Cross-Modal Retrieval
Hashing is widely applied to approximate nearest neighbor search for
large-scale multimodal retrieval with storage and computation efficiency.
Cross-modal hashing improves the quality of hash coding by exploiting semantic
correlations across different modalities. Existing cross-modal hashing methods
first transform data into low-dimensional feature vectors, and then generate
binary codes by another separate quantization step. However, suboptimal hash
codes may be generated since the quantization error is not explicitly minimized
and the feature representation is not jointly optimized with the binary codes.
This paper presents a Correlation Hashing Network (CHN) approach to cross-modal
hashing, which jointly learns good data representation tailored to hash coding
and formally controls the quantization error. The proposed CHN is a hybrid deep
architecture that constitutes a convolutional neural network for learning good
image representations, a multilayer perception for learning good text
representations, two hashing layers for generating compact binary codes, and a
structured max-margin loss that integrates all things together to enable
learning similarity-preserving and high-quality hash codes. Extensive empirical
study shows that CHN yields state of the art cross-modal retrieval performance
on standard benchmarks.Comment: 7 page
Neuronal Synchrony in Complex-Valued Deep Networks
Deep learning has recently led to great successes in tasks such as image
recognition (e.g Krizhevsky et al., 2012). However, deep networks are still
outmatched by the power and versatility of the brain, perhaps in part due to
the richer neuronal computations available to cortical circuits. The challenge
is to identify which neuronal mechanisms are relevant, and to find suitable
abstractions to model them. Here, we show how aspects of spike timing, long
hypothesized to play a crucial role in cortical information processing, could
be incorporated into deep networks to build richer, versatile representations.
We introduce a neural network formulation based on complex-valued neuronal
units that is not only biologically meaningful but also amenable to a variety
of deep learning frameworks. Here, units are attributed both a firing rate and
a phase, the latter indicating properties of spike timing. We show how this
formulation qualitatively captures several aspects thought to be related to
neuronal synchrony, including gating of information processing and dynamic
binding of distributed object representations. Focusing on the latter, we
demonstrate the potential of the approach in several simple experiments. Thus,
neuronal synchrony could be a flexible mechanism that fulfills multiple
functional roles in deep networks.Comment: ICLR 2014, accepted to conference track. This version: added
proceedings note, minor addition
Face Attribute Prediction Using Off-the-Shelf CNN Features
Predicting attributes from face images in the wild is a challenging computer
vision problem. To automatically describe face attributes from face containing
images, traditionally one needs to cascade three technical blocks --- face
localization, facial descriptor construction, and attribute classification ---
in a pipeline. As a typical classification problem, face attribute prediction
has been addressed using deep learning. Current state-of-the-art performance
was achieved by using two cascaded Convolutional Neural Networks (CNNs), which
were specifically trained to learn face localization and attribute description.
In this paper, we experiment with an alternative way of employing the power of
deep representations from CNNs. Combining with conventional face localization
techniques, we use off-the-shelf architectures trained for face recognition to
build facial descriptors. Recognizing that the describable face attributes are
diverse, our face descriptors are constructed from different levels of the CNNs
for different attributes to best facilitate face attribute prediction.
Experiments on two large datasets, LFWA and CelebA, show that our approach is
entirely comparable to the state-of-the-art. Our findings not only demonstrate
an efficient face attribute prediction approach, but also raise an important
question: how to leverage the power of off-the-shelf CNN representations for
novel tasks.Comment: In proceeding of 2016 International Conference on Biometrics (ICB
Reward Learning from Narrated Demonstrations
Humans effortlessly "program" one another by communicating goals and desires
in natural language. In contrast, humans program robotic behaviours by
indicating desired object locations and poses to be achieved, by providing RGB
images of goal configurations, or supplying a demonstration to be imitated.
None of these methods generalize across environment variations, and they convey
the goal in awkward technical terms. This work proposes joint learning of
natural language grounding and instructable behavioural policies reinforced by
perceptual detectors of natural language expressions, grounded to the sensory
inputs of the robotic agent. Our supervision is narrated visual
demonstrations(NVD), which are visual demonstrations paired with verbal
narration (as opposed to being silent). We introduce a dataset of NVD where
teachers perform activities while describing them in detail. We map the
teachers' descriptions to perceptual reward detectors, and use them to train
corresponding behavioural policies in simulation.We empirically show that our
instructable agents (i) learn visual reward detectors using a small number of
examples by exploiting hard negative mined configurations from demonstration
dynamics, (ii) develop pick-and place policies using learned visual reward
detectors, (iii) benefit from object-factorized state representations that
mimic the syntactic structure of natural language goal expressions, and (iv)
can execute behaviours that involve novel objects in novel locations at test
time, instructed by natural language.Comment: The work has been accepted to Conference on Computer Vision and
Pattern Recognition (CVPR) 201
Deep Hashing with Category Mask for Fast Video Retrieval
This paper proposes an end-to-end deep hashing framework with category mask
for fast video retrieval. We train our network in a supervised way by fully
exploiting inter-class diversity and intra-class identity. Classification loss
is optimized to maximize inter-class diversity, while intra-pair is introduced
to learn representative intra-class identity. We investigate the binary bits
distribution related to categories and find out that the effectiveness of
binary bits is highly correlated with data categories, and some bits may
degrade classification performance of some categories. We then design hash code
generation scheme with category mask to filter out bits with negative
contribution. Experimental results demonstrate the proposed method outperforms
several state-of-the-arts under various evaluation metrics on public datasets
- …