4,658 research outputs found
Cross-modal Subspace Learning for Fine-grained Sketch-based Image Retrieval
Sketch-based image retrieval (SBIR) is challenging due to the inherent
domain-gap between sketch and photo. Compared with pixel-perfect depictions of
photos, sketches are iconic renderings of the real world with highly abstract.
Therefore, matching sketch and photo directly using low-level visual clues are
unsufficient, since a common low-level subspace that traverses semantically
across the two modalities is non-trivial to establish. Most existing SBIR
studies do not directly tackle this cross-modal problem. This naturally
motivates us to explore the effectiveness of cross-modal retrieval methods in
SBIR, which have been applied in the image-text matching successfully. In this
paper, we introduce and compare a series of state-of-the-art cross-modal
subspace learning methods and benchmark them on two recently released
fine-grained SBIR datasets. Through thorough examination of the experimental
results, we have demonstrated that the subspace learning can effectively model
the sketch-photo domain-gap. In addition we draw a few key insights to drive
future research.Comment: Accepted by Neurocomputin
Learning for Multi-Model and Multi-Type Fitting
Multi-model fitting has been extensively studied from the random sampling and
clustering perspectives. Most assume that only a single type/class of model is
present and their generalizations to fitting multiple types of
models/structures simultaneously are non-trivial. The inherent challenges
include choice of types and numbers of models, sampling imbalance and parameter
tuning, all of which render conventional approaches ineffective. In this work,
we formulate the multi-model multi-type fitting problem as one of learning deep
feature embedding that is clustering-friendly. In other words, points of the
same clusters are embedded closer together through the network. For inference,
we apply K-means to cluster the data in the embedded feature space and model
selection is enabled by analyzing the K-means residuals. Experiments are
carried out on both synthetic and real world multi-type fitting datasets,
producing state-of-the-art results. Comparisons are also made on single-type
multi-model fitting tasks with promising results as well
Interpretable Convolutional Neural Networks via Feedforward Design
The model parameters of convolutional neural networks (CNNs) are determined
by backpropagation (BP). In this work, we propose an interpretable feedforward
(FF) design without any BP as a reference. The FF design adopts a data-centric
approach. It derives network parameters of the current layer based on data
statistics from the output of the previous layer in a one-pass manner. To
construct convolutional layers, we develop a new signal transform, called the
Saab (Subspace Approximation with Adjusted Bias) transform. It is a variant of
the principal component analysis (PCA) with an added bias vector to annihilate
activation's nonlinearity. Multiple Saab transforms in cascade yield multiple
convolutional layers. As to fully-connected (FC) layers, we construct them
using a cascade of multi-stage linear least squared regressors (LSRs). The
classification and robustness (against adversarial attacks) performances of BP-
and FF-designed CNNs applied to the MNIST and the CIFAR-10 datasets are
compared. Finally, we comment on the relationship between BP and FF designs.Comment: 32 page
Representation Learning with Deep Extreme Learning Machines for Efficient Image Set Classification
Efficient and accurate joint representation of a collection of images, that
belong to the same class, is a major research challenge for practical image set
classification. Existing methods either make prior assumptions about the data
structure, or perform heavy computations to learn structure from the data
itself. In this paper, we propose an efficient image set representation that
does not make any prior assumptions about the structure of the underlying data.
We learn the non-linear structure of image sets with Deep Extreme Learning
Machines (DELM) that are very efficient and generalize well even on a limited
number of training samples. Extensive experiments on a broad range of public
datasets for image set classification (Honda/UCSD, CMU Mobo, YouTube
Celebrities, Celebrity-1000, ETH-80) show that the proposed algorithm
consistently outperforms state-of-the-art image set classification methods both
in terms of speed and accuracy
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities
Cross-modal retrieval aims to retrieve relevant data across different
modalities (e.g., texts vs. images). The common strategy is to apply
element-wise constraints between manually labeled pair-wise items to guide the
generators to learn the semantic relationships between the modalities, so that
the similar items can be projected close to each other in the common
representation subspace. However, such constraints often fail to preserve the
semantic structure between unpaired but semantically similar items (e.g. the
unpaired items with the same class label are more similar than items with
different labels). To address the above problem, we propose a novel cross-modal
similarity transferring (CMST) method to learn and preserve the semantic
relationships between unpaired items in an unsupervised way. The key idea is to
learn the quantitative similarities in single-modal representation subspace,
and then transfer them to the common representation subspace to establish the
semantic relationships between unpaired items across modalities. Experiments
show that our method outperforms the state-of-the-art approaches both in the
class-based and pair-based retrieval tasks
Deep Clustering With Intra-class Distance Constraint for Hyperspectral Images
The high dimensionality of hyperspectral images often results in the
degradation of clustering performance. Due to the powerful ability of deep
feature extraction and non-linear feature representation, the clustering
algorithm based on deep learning has become a hot research topic in the field
of hyperspectral remote sensing. However, most deep clustering algorithms for
hyperspectral images utilize deep neural networks as feature extractor without
considering prior knowledge constraints that are suitable for clustering. To
solve this problem, we propose an intra-class distance constrained deep
clustering algorithm for high-dimensional hyperspectral images. The proposed
algorithm constrains the feature mapping procedure of the auto-encoder network
by intra-class distance so that raw images are transformed from the original
high-dimensional space to the low-dimensional feature space that is more
conducive to clustering. Furthermore, the related learning process is treated
as a joint optimization problem of deep feature extraction and clustering.
Experimental results demonstrate the intense competitiveness of the proposed
algorithm in comparison with state-of-the-art clustering methods of
hyperspectral images
Attention-based Multi-instance Neural Network for Medical Diagnosis from Incomplete and Low Quality Data
One way to extract patterns from clinical records is to consider each patient
record as a bag with various number of instances in the form of symptoms.
Medical diagnosis is to discover informative ones first and then map them to
one or more diseases. In many cases, patients are represented as vectors in
some feature space and a classifier is applied after to generate diagnosis
results. However, in many real-world cases, data is often of low-quality due to
a variety of reasons, such as data consistency, integrity, completeness,
accuracy, etc. In this paper, we propose a novel approach, attention based
multi-instance neural network (AMI-Net), to make the single disease
classification only based on the existing and valid information in the
real-world outpatient records. In the context of a patient, it takes a bag of
instances as input and output the bag label directly in end-to-end way.
Embedding layer is adopted at the beginning, mapping instances into an
embedding space which represents the individual patient condition. The
correlations among instances and their importance for the final classification
are captured by multi-head attention transformer, instance-level multi-instance
pooling and bag-level multi-instance pooling. The proposed approach was test on
two non-standardized and highly imbalanced datasets, one in the Traditional
Chinese Medicine (TCM) domain and the other in the Western Medicine (WM)
domain. Our preliminary results show that the proposed approach outperforms all
baselines results by a significant margin
Speech Recognition by Machine, A Review
This paper presents a brief survey on Automatic Speech Recognition and
discusses the major themes and advances made in the past 60 years of research,
so as to provide a technological perspective and an appreciation of the
fundamental progress that has been accomplished in this important area of
speech communication. After years of research and development the accuracy of
automatic speech recognition remains one of the important research challenges
(e.g., variations of the context, speakers, and environment).The design of
Speech Recognition system requires careful attentions to the following issues:
Definition of various types of speech classes, speech representation, feature
extraction techniques, speech classifiers, database and performance evaluation.
The problems that are existing in ASR and the various techniques to solve these
problems constructed by various research workers have been presented in a
chronological order. Hence authors hope that this work shall be a contribution
in the area of speech recognition. The objective of this review paper is to
summarize and compare some of the well known methods used in various stages of
speech recognition system and identify research topic and applications which
are at the forefront of this exciting and challenging field.Comment: 25 pages IEEE format, International Journal of Computer Science and
Information Security, IJCSIS December 2009, ISSN 1947 5500,
http://sites.google.com/site/ijcsis
Orthogonal Deep Features Decomposition for Age-Invariant Face Recognition
As facial appearance is subject to significant intra-class variations caused
by the aging process over time, age-invariant face recognition (AIFR) remains a
major challenge in face recognition community. To reduce the intra-class
discrepancy caused by the aging, in this paper we propose a novel approach
(namely, Orthogonal Embedding CNNs, or OE-CNNs) to learn the age-invariant deep
face features. Specifically, we decompose deep face features into two
orthogonal components to represent age-related and identity-related features.
As a result, identity-related features that are robust to aging are then used
for AIFR. Besides, for complementing the existing cross-age datasets and
advancing the research in this field, we construct a brand-new large-scale
Cross-Age Face dataset (CAF). Extensive experiments conducted on the three
public domain face aging datasets (MORPH Album 2, CACD-VS and FG-NET) have
shown the effectiveness of the proposed approach and the value of the
constructed CAF dataset on AIFR. Benchmarking our algorithm on one of the most
popular general face recognition (GFR) dataset LFW additionally demonstrates
the comparable generalization performance on GFR
- …