50,856 research outputs found

    Unsupervised feature selection by means of external validity indices

    Get PDF
    Feature selection for unsupervised data is a difficult task because a reference partition is not available to evaluate the relevance of the features. Recently, different proposals of methods for consensus clustering have used external validity indices to assess the agreement among partitions obtained by clustering algorithms with different parameter values. Theses indices are independent of the characteristics of the attributes describing the data, the way the partitions are represented or the shape of the clusters. This independence allows to use these measures to assess the similarity of partitions with different subsets of attributes. As for supervised feature selection, the goal of unsupervised feature selection is to maintain the same patterns of the original data with less information. The hypothesis of this paper is that the clustering of the dataset with all the attributes, even when its quality is not perfect, can be used as the basis of the heuristic exploration the space of subsets of features. The proposal is to use external validation indices as the specific measure used to assess well this information is preserved by a subset of the original attributes. Different external validation indices have been proposed in the literature. This paper will present experiments using the adjusted Rand, Jaccard and Folkes&Mallow indices. Artificially generated datasets will be used to test the methodology with different experimental conditions such as the number of clusters, cluster spatial separanton and the ratio of irrelevant features. The methodology will also be applied to real datasets chosen from the UCI machine learning datasets repository.Preprin

    Stable Feature Selection for Biomarker Discovery

    Full text link
    Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development

    Online Unsupervised Multi-view Feature Selection

    Full text link
    In the era of big data, it is becoming common to have data with multiple modalities or coming from multiple sources, known as "multi-view data". Multi-view data are usually unlabeled and come from high-dimensional spaces (such as language vocabularies), unsupervised multi-view feature selection is crucial to many applications. However, it is nontrivial due to the following challenges. First, there are too many instances or the feature dimensionality is too large. Thus, the data may not fit in memory. How to select useful features with limited memory space? Second, how to select features from streaming data and handles the concept drift? Third, how to leverage the consistent and complementary information from different views to improve the feature selection in the situation when the data are too big or come in as streams? To the best of our knowledge, none of the previous works can solve all the challenges simultaneously. In this paper, we propose an Online unsupervised Multi-View Feature Selection, OMVFS, which deals with large-scale/streaming multi-view data in an online fashion. OMVFS embeds unsupervised feature selection into a clustering algorithm via NMF with sparse learning. It further incorporates the graph regularization to preserve the local structure information and help select discriminative features. Instead of storing all the historical data, OMVFS processes the multi-view data chunk by chunk and aggregates all the necessary information into several small matrices. By using the buffering technique, the proposed OMVFS can reduce the computational and storage cost while taking advantage of the structure information. Furthermore, OMVFS can capture the concept drifts in the data streams. Extensive experiments on four real-world datasets show the effectiveness and efficiency of the proposed OMVFS method. More importantly, OMVFS is about 100 times faster than the off-line methods

    Superpixel-based Two-view Deterministic Fitting for Multiple-structure Data

    Full text link
    This paper proposes a two-view deterministic geometric model fitting method, termed Superpixel-based Deterministic Fitting (SDF), for multiple-structure data. SDF starts from superpixel segmentation, which effectively captures prior information of feature appearances. The feature appearances are beneficial to reduce the computational complexity for deterministic fitting methods. SDF also includes two original elements, i.e., a deterministic sampling algorithm and a novel model selection algorithm. The two algorithms are tightly coupled to boost the performance of SDF in both speed and accuracy. Specifically, the proposed sampling algorithm leverages the grouping cues of superpixels to generate reliable and consistent hypotheses. The proposed model selection algorithm further makes use of desirable properties of the generated hypotheses, to improve the conventional fit-and-remove framework for more efficient and effective performance. The key characteristic of SDF is that it can efficiently and deterministically estimate the parameters of model instances in multi-structure data. Experimental results demonstrate that the proposed SDF shows superiority over several state-of-the-art fitting methods for real images with single-structure and multiple-structure data.Comment: Accepted by European Conference on Computer Vision (ECCV
    • …
    corecore