1,252 research outputs found
Feature Selection: A Data Perspective
Feature selection, as a data preprocessing strategy, has been proven to be
effective and efficient in preparing data (especially high-dimensional data)
for various data mining and machine learning problems. The objectives of
feature selection include: building simpler and more comprehensible models,
improving data mining performance, and preparing clean, understandable data.
The recent proliferation of big data has presented some substantial challenges
and opportunities to feature selection. In this survey, we provide a
comprehensive and structured overview of recent advances in feature selection
research. Motivated by current challenges and opportunities in the era of big
data, we revisit feature selection research from a data perspective and review
representative feature selection algorithms for conventional data, structured
data, heterogeneous data and streaming data. Methodologically, to emphasize the
differences and similarities of most existing feature selection algorithms for
conventional data, we categorize them into four main groups: similarity based,
information theoretical based, sparse learning based and statistical based
methods. To facilitate and promote the research in this community, we also
present an open-source feature selection repository that consists of most of
the popular feature selection algorithms
(\url{http://featureselection.asu.edu/}). Also, we use it as an example to show
how to evaluate feature selection algorithms. At the end of the survey, we
present a discussion about some open problems and challenges that require more
attention in future research
Semi-supervised Ranking Pursuit
We propose a novel sparse preference learning/ranking algorithm. Our
algorithm approximates the true utility function by a weighted sum of basis
functions using the squared loss on pairs of data points, and is a
generalization of the kernel matching pursuit method. It can operate both in a
supervised and a semi-supervised setting and allows efficient search for
multiple, near-optimal solutions. Furthermore, we describe the extension of the
algorithm suitable for combined ranking and regression tasks. In our
experiments we demonstrate that the proposed algorithm outperforms several
state-of-the-art learning methods when taking into account unlabeled data and
performs comparably in a supervised learning scenario, while providing sparser
solutions
Robust Sparse Coding via Self-Paced Learning
Sparse coding (SC) is attracting more and more attention due to its
comprehensive theoretical studies and its excellent performance in many signal
processing applications. However, most existing sparse coding algorithms are
nonconvex and are thus prone to becoming stuck into bad local minima,
especially when there are outliers and noisy data. To enhance the learning
robustness, in this paper, we propose a unified framework named Self-Paced
Sparse Coding (SPSC), which gradually include matrix elements into SC learning
from easy to complex. We also generalize the self-paced learning schema into
different levels of dynamic selection on samples, features and elements
respectively. Experimental results on real-world data demonstrate the efficacy
of the proposed algorithms.Comment: submitted to AAAI201
Hypergraph p-Laplacian Regularization for Remote Sensing Image Recognition
It is of great importance to preserve locality and similarity information in
semi-supervised learning (SSL) based applications. Graph based SSL and manifold
regularization based SSL including Laplacian regularization (LapR) and
Hypergraph Laplacian regularization (HLapR) are representative SSL methods and
have achieved prominent performance by exploiting the relationship of sample
distribution. However, it is still a great challenge to exactly explore and
exploit the local structure of the data distribution. In this paper, we present
an effect and effective approximation algorithm of Hypergraph p-Laplacian and
then propose Hypergraph p-Laplacian regularization (HpLapR) to preserve the
geometry of the probability distribution. In particular, p-Laplacian is a
nonlinear generalization of the standard graph Laplacian and Hypergraph is a
generalization of a standard graph. Therefore, the proposed HpLapR provides
more potential to exploiting the local structure preserving. We apply HpLapR to
logistic regression and conduct the implementations for remote sensing image
recognition. We compare the proposed HpLapR to several popular manifold
regularization based SSL methods including LapR, HLapR and HpLapR on UC-Merced
dataset. The experimental results demonstrate the superiority of the proposed
HpLapR.Comment: 9 pages, 6 figure
Unsupervised Feature Selection via Multi-step Markov Transition Probability
Feature selection is a widely used dimension reduction technique to select
feature subsets because of its interpretability. Many methods have been
proposed and achieved good results, in which the relationships between adjacent
data points are mainly concerned. But the possible associations between data
pairs that are may not adjacent are always neglected. Different from previous
methods, we propose a novel and very simple approach for unsupervised feature
selection, named MMFS (Multi-step Markov transition probability for Feature
Selection). The idea is using multi-step Markov transition probability to
describe the relation between any data pair. Two ways from the positive and
negative viewpoints are employed respectively to keep the data structure after
feature selection. From the positive viewpoint, the maximum transition
probability that can be reached in a certain number of steps is used to
describe the relation between two points. Then, the features which can keep the
compact data structure are selected. From the viewpoint of negative, the
minimum transition probability that can be reached in a certain number of steps
is used to describe the relation between two points. On the contrary, the
features that least maintain the loose data structure are selected. And the two
ways can also be combined. Thus three algorithms are proposed. Our main
contributions are a novel feature section approach which uses multi-step
transition probability to characterize the data structure, and three algorithms
proposed from the positive and negative aspects for keeping data structure. The
performance of our approach is compared with the state-of-the-art methods on
eight real-world data sets, and the experimental results show that the proposed
MMFS is effective in unsupervised feature selection
Structure fusion based on graph convolutional networks for semi-supervised classification
Suffering from the multi-view data diversity and complexity for
semi-supervised classification, most of existing graph convolutional networks
focus on the networks architecture construction or the salient graph structure
preservation, and ignore the the complete graph structure for semi-supervised
classification contribution. To mine the more complete distribution structure
from multi-view data with the consideration of the specificity and the
commonality, we propose structure fusion based on graph convolutional networks
(SF-GCN) for improving the performance of semi-supervised classification.
SF-GCN can not only retain the special characteristic of each view data by
spectral embedding, but also capture the common style of multi-view data by
distance metric between multi-graph structures. Suppose the linear relationship
between multi-graph structures, we can construct the optimization function of
structure fusion model by balancing the specificity loss and the commonality
loss. By solving this function, we can simultaneously obtain the fusion
spectral embedding from the multi-view data and the fusion structure as
adjacent matrix to input graph convolutional networks for semi-supervised
classification. Experiments demonstrate that the performance of SF-GCN
outperforms that of the state of the arts on three challenging datasets, which
are Cora,Citeseer and Pubmed in citation networks
Enhancing Person Re-identification in a Self-trained Subspace
Despite the promising progress made in recent years, person re-identification
(re-ID) remains a challenging task due to the complex variations in human
appearances from different camera views. For this challenging problem, a large
variety of algorithms have been developed in the fully-supervised setting,
requiring access to a large amount of labeled training data. However, the main
bottleneck for fully-supervised re-ID is the limited availability of labeled
training samples. To address this problem, in this paper, we propose a
self-trained subspace learning paradigm for person re-ID which effectively
utilizes both labeled and unlabeled data to learn a discriminative subspace
where person images across disjoint camera views can be easily matched. The
proposed approach first constructs pseudo pairwise relationships among
unlabeled persons using the k-nearest neighbors algorithm. Then, with the
pseudo pairwise relationships, the unlabeled samples can be easily combined
with the labeled samples to learn a discriminative projection by solving an
eigenvalue problem. In addition, we refine the pseudo pairwise relationships
iteratively, which further improves the learning performance. A multi-kernel
embedding strategy is also incorporated into the proposed approach to cope with
the non-linearity in person's appearance and explore the complementation of
multiple kernels. In this way, the performance of person re-ID can be greatly
enhanced when training data are insufficient. Experimental results on six
widely-used datasets demonstrate the effectiveness of our approach and its
performance can be comparable to the reported results of most state-of-the-art
fully-supervised methods while using much fewer labeled data.Comment: Accepted by ACM Transactions on Multimedia Computing, Communications,
and Applications (TOMM
Visual Understanding via Multi-Feature Shared Learning with Global Consistency
Image/video data is usually represented with multiple visual features. Fusion
of multi-source information for establishing the attributes has been widely
recognized. Multi-feature visual recognition has recently received much
attention in multimedia applications. This paper studies visual understanding
via a newly proposed l_2-norm based multi-feature shared learning framework,
which can simultaneously learn a global label matrix and multiple
sub-classifiers with the labeled multi-feature data. Additionally, a group
graph manifold regularizer composed of the Laplacian and Hessian graph is
proposed for better preserving the manifold structure of each feature, such
that the label prediction power is much improved through the semi-supervised
learning with global label consistency. For convenience, we call the proposed
approach Global-Label-Consistent Classifier (GLCC). The merits of the proposed
method include: 1) the manifold structure information of each feature is
exploited in learning, resulting in a more faithful classification owing to the
global label consistency; 2) a group graph manifold regularizer based on the
Laplacian and Hessian regularization is constructed; 3) an efficient
alternative optimization method is introduced as a fast solver owing to the
convex sub-problems. Experiments on several benchmark visual datasets for
multimedia understanding, such as the 17-category Oxford Flower dataset, the
challenging 101-category Caltech dataset, the YouTube & Consumer Videos dataset
and the large-scale NUS-WIDE dataset, demonstrate that the proposed approach
compares favorably with the state-of-the-art algorithms. An extensive
experiment on the deep convolutional activation features also show the
effectiveness of the proposed approach. The code is available on
http://www.escience.cn/people/lei/index.htmlComment: 13 pages,6 figures, this paper is accepted for publication in IEEE
Transactions on Multimedi
Effective Discriminative Feature Selection with Non-trivial Solutions
Feature selection and feature transformation, the two main ways to reduce
dimensionality, are often presented separately. In this paper, a feature
selection method is proposed by combining the popular transformation based
dimensionality reduction method Linear Discriminant Analysis (LDA) and sparsity
regularization. We impose row sparsity on the transformation matrix of LDA
through -norm regularization to achieve feature selection, and
the resultant formulation optimizes for selecting the most discriminative
features and removing the redundant ones simultaneously. The formulation is
extended to the -norm regularized case: which is more likely to
offer better sparsity when . Thus the formulation is a better
approximation to the feature selection problem. An efficient algorithm is
developed to solve the -norm based optimization problem and it is
proved that the algorithm converges when . Systematical experiments
are conducted to understand the work of the proposed method. Promising
experimental results on various types of real-world data sets demonstrate the
effectiveness of our algorithm
Learning with Low-Quality Data: Multi-View Semi-Supervised Learning with Missing Views
The focus of this thesis is on learning approaches for what we call ``low-quality data'' and in particular data in which only small amounts of labeled target data is available. The first part provides background discussion on low-quality data issues, followed by preliminary study in this area. The remainder of the thesis focuses on a particular scenario: multi-view semi-supervised learning. Multi-view learning generally refers to the case of learning with data that has multiple natural views, or sets of features, associated with it. Multi-view semi-supervised learning methods try to exploit the combination of multiple views along with large amounts of unlabeled data in order to learn better predictive functions when limited labeled data is available. However, lack of complete view data limits the applicability of multi-view semi-supervised learning to real world data. Commonly, one data view is readily and cheaply available, but additionally views may be costly or only available in some cases. This thesis work aims to make multi-view semi-supervised learning approaches more applicable to real world data specifically by addressing the issue of missing views through both feature generation and active learning, and addressing the issue of model selection for semi-supervised learning with limited labeled data. This thesis introduces a unified approach for handling missing view data in multi-view semi-supervised learning tasks, which applies to both data with completely missing additional views and data only missing views in some instances. The idea is to learn a feature generation function mapping one view to another with the mapping biased to encourage the features generated to be useful for multi-view semi-supervised learning algorithms. The mapping is then used to fill in views as pre-processing. Unlike previously proposed single-view multi-view learning approaches, the proposed approach is able to take advantage of additional view data when available, and for the case of partial view presence is the first feature-generation approach specifically designed to take into account the multi-view semi-supervised learning aspect. The next component of this thesis is the analysis of an active view completion scenario. In some tasks, it is possible to obtain missing view data for a particular instance, but with some associated cost. Recent work has shown an active selection strategy can be more effective than a random one. In this thesis, a better understanding of active approaches is sought, and it is demonstrated that the effectiveness of an active selection strategy over a random one can depend on the relationship between the views. Finally, an important component of making multi-view semi-supervised learning applicable to real world data is the task of model selection, an open problem which is often avoided entirely in previous work. For cases of very limited labeled training data the commonly used cross-validation approach can become ineffective. This thesis introduces a re-training alternative to the method-dependent approaches similar in motivation to cross-validation, that involves generating new training and test data by sampling from the large amount of unlabeled data and estimated conditional probabilities for the labels. The proposed approaches are evaluated on a variety of multi-view semi-supervised learning data sets, and the experimental results demonstrate their efficacy
- …