50,856 research outputs found
Unsupervised feature selection by means of external validity indices
Feature selection for unsupervised data is a difficult task because a reference partition is not available to evaluate the relevance of the features. Recently, different proposals of methods for consensus clustering have used external validity indices to assess the agreement among partitions obtained by clustering algorithms with different parameter values. Theses indices are independent of the characteristics of the attributes describing the data, the way the partitions are represented or the shape of the clusters. This independence allows to use these measures to assess the similarity of partitions with different subsets of attributes. As for supervised feature selection, the goal of unsupervised feature selection is to maintain the same patterns of the original data with less information. The hypothesis of this paper is that the clustering of the dataset with all the attributes, even when its quality is not perfect, can be used as the basis of the heuristic exploration the space of subsets of features. The proposal is to use external validation indices as the specific measure used to assess well this information is preserved by a subset of the original attributes. Different external validation indices have been proposed in the literature. This paper will present experiments using the adjusted Rand, Jaccard and Folkes&Mallow indices. Artificially generated datasets will be used to test the methodology with different experimental conditions such as the number of clusters, cluster spatial separanton and the ratio of irrelevant features. The methodology will also be applied to real datasets chosen from the UCI machine learning datasets repository.Preprin
Stable Feature Selection for Biomarker Discovery
Feature selection techniques have been used as the workhorse in biomarker
discovery applications for a long time. Surprisingly, the stability of feature
selection with respect to sampling variations has long been under-considered.
It is only until recently that this issue has received more and more attention.
In this article, we review existing stable feature selection methods for
biomarker discovery using a generic hierarchal framework. We have two
objectives: (1) providing an overview on this new yet fast growing topic for a
convenient reference; (2) categorizing existing methods under an expandable
framework for future research and development
Online Unsupervised Multi-view Feature Selection
In the era of big data, it is becoming common to have data with multiple
modalities or coming from multiple sources, known as "multi-view data".
Multi-view data are usually unlabeled and come from high-dimensional spaces
(such as language vocabularies), unsupervised multi-view feature selection is
crucial to many applications. However, it is nontrivial due to the following
challenges. First, there are too many instances or the feature dimensionality
is too large. Thus, the data may not fit in memory. How to select useful
features with limited memory space? Second, how to select features from
streaming data and handles the concept drift? Third, how to leverage the
consistent and complementary information from different views to improve the
feature selection in the situation when the data are too big or come in as
streams? To the best of our knowledge, none of the previous works can solve all
the challenges simultaneously. In this paper, we propose an Online unsupervised
Multi-View Feature Selection, OMVFS, which deals with large-scale/streaming
multi-view data in an online fashion. OMVFS embeds unsupervised feature
selection into a clustering algorithm via NMF with sparse learning. It further
incorporates the graph regularization to preserve the local structure
information and help select discriminative features. Instead of storing all the
historical data, OMVFS processes the multi-view data chunk by chunk and
aggregates all the necessary information into several small matrices. By using
the buffering technique, the proposed OMVFS can reduce the computational and
storage cost while taking advantage of the structure information. Furthermore,
OMVFS can capture the concept drifts in the data streams. Extensive experiments
on four real-world datasets show the effectiveness and efficiency of the
proposed OMVFS method. More importantly, OMVFS is about 100 times faster than
the off-line methods
Superpixel-based Two-view Deterministic Fitting for Multiple-structure Data
This paper proposes a two-view deterministic geometric model fitting method,
termed Superpixel-based Deterministic Fitting (SDF), for multiple-structure
data. SDF starts from superpixel segmentation, which effectively captures prior
information of feature appearances. The feature appearances are beneficial to
reduce the computational complexity for deterministic fitting methods. SDF also
includes two original elements, i.e., a deterministic sampling algorithm and a
novel model selection algorithm. The two algorithms are tightly coupled to
boost the performance of SDF in both speed and accuracy. Specifically, the
proposed sampling algorithm leverages the grouping cues of superpixels to
generate reliable and consistent hypotheses. The proposed model selection
algorithm further makes use of desirable properties of the generated
hypotheses, to improve the conventional fit-and-remove framework for more
efficient and effective performance. The key characteristic of SDF is that it
can efficiently and deterministically estimate the parameters of model
instances in multi-structure data. Experimental results demonstrate that the
proposed SDF shows superiority over several state-of-the-art fitting methods
for real images with single-structure and multiple-structure data.Comment: Accepted by European Conference on Computer Vision (ECCV
- …