70,508 research outputs found

    Unsupervised feature selection for noisy data

    Get PDF
    Feature selection techniques are enormously applied in a variety of data analysis tasks in order to reduce the dimensionality. According to the type of learning, feature selection algorithms are categorized to: supervised or unsupervised. In unsupervised learning scenarios, selecting features is a much harder problem, due to the lack of class labels that would facilitate the search for relevant features. The selecting feature difficulty is amplified when the data is corrupted by different noises. Almost all traditional unsupervised feature selection methods are not robust against the noise in samples. These approaches do not have any explicit mechanism for detaching and isolating the noise thus they can not produce an optimal feature subset. In this article, we propose an unsupervised approach for feature selection on noisy data, called Robust Independent Feature Selection (RIFS). Specifically, we choose feature subset that contains most of the underlying information, using the same criteria as the Independent component analysis (ICA). Simultaneously, the noise is separated as an independent component. The isolation of representative noise samples is achieved using factor oblique rotation whereas noise identification is performed using factor pattern loadings. Extensive experimental results over divers real-life data sets have showed the efficiency and advantage of the proposed algorithm.We thankfully acknowledge the support of the Comision Interministerial de Ciencia y Tecnologa (CICYT) under contract No. TIN2015-65316-P which has partially funded this work.Peer ReviewedPostprint (author's final draft

    Unsupervised morphological segmentation for images

    Get PDF
    This paper deals with a morphological approach to unsupervised image segmentation. The proposed technique relies on a multiscale Top-Down approach allowing a hierarchical processing of the data ranging from the most global scale to the most detailed one. At each scale, the algorithm consists of four steps: image simplification, feature extraction, contour localization and quality estimation. The main emphasis of this paper is to discuss the selection of a simplification filter for segmentation. Morphological filters based on reconstruction proved to be very efficient for this purpose. The resulting unsupervised algorithm is very robust and can deal with very different type of images.Peer ReviewedPostprint (published version

    Unsupervised feature analysis for high dimensional big data

    Get PDF
    In practice we often encounter the scenario that label information is unavailable due to either high cost of manual labeling or unwillingness of users to label. When label information is not available, traditional supervised learning can not be directly applied so we need to study unsupervised methods which could work well even without supervision. Feature analysis has been proven effective and important for many applications. Feature analysis is a broad research field, whose research topics includes but are not limited to feature selection, feature extraction, feature construction, and feature composition e.g., in topic discovery the learned topics can be viewed as compound features. In many real systems, it is often necessary and important to do feature analysis to determine which individual or compound features should be used for posterior learning tasks. The effectiveness of traditional feature analysis often relies on labels of the training data examples. However, in the era of big data, label information is often unavailable. In the unsupervised scenario, it is more challenging to do feature analysis. Two important research topics in unsupervised feature analysis are unsupervised feature selection and unsupervised feature composition, e.g., to discover topics as compound features. This would naturally create two lines for unsupervised feature analysis. Also, combined with single-view or multiple-view for the data, we would generate a table with four cells. Except for the single-view feature composition (or topic discovery) where there're already many work done e.g., PLSA, LDA, and NMF, the other three cells correspond to new research topics, and there is few work done yet. For single view unsupervised feature analysis, we propose two unsupervised feature selection methods. For multi-view unsupervised feature analysis, we focus on text-image web news data and propose a multi-view unsupervised feature selection method and a text-image topic model. Specifically, for single-view unsupervised feature selection, we propose a new method that is called Robust Unsupervised Feature Selection (RUFS), where pseudo cluster labels are learned via local learning regularized robust NMF and feature selection is performed simultaneously by robust joint l2,1l_{2, 1}-norm minimization. Outliers could be effectively handled and redundant or noisy features could be effectively reduced. We also design a (projected) limited-memory BFGS based linear time iterative algorithm to efficiently solve the optimization problem. We also study how the choice of norms for data fitting and feature selection terms affect the ultimate unsupervised feature selection performance. Specifically, we propose to use joint adaptive loss and l2/l0l_2/l_0 minimization for data fitting and feature selection. We mathematically explain desirable properties of joint adaptive loss and l2/l0l_2/l_0 minimization over recent unsupervised feature selection models. We solve the optimization problem with an efficient iterative algorithm whose computational complexity and memory cost are linear to both sample size and feature size. For multiple-view unsupervised feature selection, we propose a more effective approach for high dimensional text-image web news data. We propose to use raw text features in label learning to avoid information loss. We propose a new multi-view unsupervised feature selection method in which image local learning regularized orthogonal nonnegative matrix factorization is used to learn pseudo labels and simultaneously robust joint l2,1l_{2,1}-norm minimization is performed to select discriminative features. Cross-view consensus on pseudo labels can be obtained as much as possible. For multi-view topic discovery, we study how to systematically mine topics from high dimensional text-image web news data. The application problem is important because almost all news articles have one picture associated. Unlike traditional topic modeling which considers text alone, the new task aims to discover heterogeneous topics from web news of multiple data types. We propose to tackle the problem by a regularized nonnegative constrained l2,1l_{2,1}-norm minimization framework. We also present a new iterative algorithm to solve the optimization problem. The proposed single-view feature selection methods can be applied on almost all single-view data. The proposed multi-view methods are designed to process text-image web news data, but the idea can be naturally generalized to analyze any multi-view data. Practitioners could run the proposed methods to select features that will be used in posterior learning tasks. One can also run our multi-view topic model to analyze and visualize topics in text-image web news corpora to help interpret the data

    A new unsupervised feature selection method for text clustering based on genetic algorithms

    Get PDF
    Nowadays a vast amount of textual information is collected and stored in various databases around the world, including the Internet as the largest database of all. This rapidly increasing growth of published text means that even the most avid reader cannot hope to keep up with all the reading in a field and consequently the nuggets of insight or new knowledge are at risk of languishing undiscovered in the literature. Text mining offers a solution to this problem by replacing or supplementing the human reader with automatic systems undeterred by the text explosion. It involves analyzing a large collection of documents to discover previously unknown information. Text clustering is one of the most important areas in text mining, which includes text preprocessing, dimension reduction by selecting some terms (features) and finally clustering using selected terms. Feature selection appears to be the most important step in the process. Conventional unsupervised feature selection methods define a measure of the discriminating power of terms to select proper terms from corpus. However up to now the valuation of terms in groups has not been investigated in reported works. In this paper a new and robust unsupervised feature selection approach is proposed that evaluates terms in groups. In addition a new Modified Term Variance measuring method is proposed for evaluating groups of terms. Furthermore a genetic based algorithm is designed and implemented for finding the most valuable groups of terms based on the new measure. These terms then will be utilized to generate the final feature vector for the clustering process . In order to evaluate and justify our approach the proposed method and also a conventional term variance method are implemented and tested using corpus collection Reuters-21578. For a more accurate comparison, methods have been tested on three corpuses and for each corpus clustering task has been done ten times and results are averaged. Results of comparing these two methods are very promising and show that our method produces better average accuracy and F1-measure than the conventional term variance method

    Labeling the Features Not the Samples: Efficient Video Classification with Minimal Supervision

    Full text link
    Feature selection is essential for effective visual recognition. We propose an efficient joint classifier learning and feature selection method that discovers sparse, compact representations of input features from a vast sea of candidates, with an almost unsupervised formulation. Our method requires only the following knowledge, which we call the \emph{feature sign}---whether or not a particular feature has on average stronger values over positive samples than over negatives. We show how this can be estimated using as few as a single labeled training sample per class. Then, using these feature signs, we extend an initial supervised learning problem into an (almost) unsupervised clustering formulation that can incorporate new data without requiring ground truth labels. Our method works both as a feature selection mechanism and as a fully competitive classifier. It has important properties, low computational cost and excellent accuracy, especially in difficult cases of very limited training data. We experiment on large-scale recognition in video and show superior speed and performance to established feature selection approaches such as AdaBoost, Lasso, greedy forward-backward selection, and powerful classifiers such as SVM.Comment: arXiv admin note: text overlap with arXiv:1411.771
    • …
    corecore