269 research outputs found
Dimensionality Reduction and Classification feature using Mutual Information applied to Hyperspectral Images : A Filter strategy based algorithm
Hyperspectral images (HIS) classification is a high technical remote sensing
tool. The goal is to reproduce a thematic map that will be compared with a
reference ground truth map (GT), constructed by expecting the region. The HIS
contains more than a hundred bidirectional measures, called bands (or simply
images), of the same region. They are taken at juxtaposed frequencies.
Unfortunately, some bands contain redundant information, others are affected by
the noise, and the high dimensionality of features made the accuracy of
classification lower. The problematic is how to find the good bands to classify
the pixels of regions. Some methods use Mutual Information (MI) and threshold,
to select relevant bands, without treatment of redundancy. Others control and
eliminate redundancy by selecting the band top ranking the MI, and if its
neighbors have sensibly the same MI with the GT, they will be considered
redundant and so discarded. This is the most inconvenient of this method,
because this avoids the advantage of hyperspectral images: some precious
information can be discarded. In this paper we'll accept the useful redundancy.
A band contains useful redundancy if it contributes to produce an estimated
reference map that has higher MI with the GT.nTo control redundancy, we
introduce a complementary threshold added to last value of MI. This process is
a Filter strategy; it gets a better performance of classification accuracy and
not expensive, but less preferment than Wrapper strategy.Comment: 11 pages, 5 figures, journal pape
A Novel Memetic Feature Selection Algorithm
Feature selection is a problem of finding efficient
features among all features in which the final feature set can improve accuracy and reduce complexity. In feature selection algorithms search strategies are key aspects. Since feature selection is an NP-Hard problem; therefore heuristic algorithms have been studied to solve this problem.
In this paper, we have proposed a method based on memetic algorithm to find an efficient feature subset for a classification problem. It incorporates a filter method in the genetic algorithm to improve classification performance and accelerates the search in identifying core feature subsets. Particularly, the method adds or deletes a feature from a candidate feature subset based on the multivariate feature information. Empirical study on commonly data sets of the university of California, Irvine shows that the proposed method outperforms existing methods
A Rank Minrelation - Majrelation Coefficient
Improving the detection of relevant variables using a new bivariate measure
could importantly impact variable selection and large network inference
methods. In this paper, we propose a new statistical coefficient that we call
the rank minrelation coefficient. We define a minrelation of X to Y (or
equivalently a majrelation of Y to X) as a measure that estimate p(Y > X) when
X and Y are continuous random variables. The approach is similar to Lin's
concordance coefficient that rather focuses on estimating p(X = Y). In other
words, if a variable X exhibits a minrelation to Y then, as X increases, Y is
likely to increases too. However, on the contrary to concordance or
correlation, the minrelation is not symmetric. More explicitly, if X decreases,
little can be said on Y values (except that the uncertainty on Y actually
increases). In this paper, we formally define this new kind of bivariate
dependencies and propose a new statistical coefficient in order to detect those
dependencies. We show through several key examples that this new coefficient
has many interesting properties in order to select relevant variables, in
particular when compared to correlation
An Effective Feature Selection Method Based on Pair-Wise Feature Proximity for High Dimensional Low Sample Size Data
Feature selection has been studied widely in the literature. However, the
efficacy of the selection criteria for low sample size applications is
neglected in most cases. Most of the existing feature selection criteria are
based on the sample similarity. However, the distance measures become
insignificant for high dimensional low sample size (HDLSS) data. Moreover, the
variance of a feature with a few samples is pointless unless it represents the
data distribution efficiently. Instead of looking at the samples in groups, we
evaluate their efficiency based on pairwise fashion. In our investigation, we
noticed that considering a pair of samples at a time and selecting the features
that bring them closer or put them far away is a better choice for feature
selection. Experimental results on benchmark data sets demonstrate the
effectiveness of the proposed method with low sample size, which outperforms
many other state-of-the-art feature selection methods.Comment: European Signal Processing Conference 201
Weighted Heuristic Ensemble of Filters
Feature selection has become increasingly important in data mining in recent years due to the rapid increase in the dimensionality of big data. However, the reliability and consistency of feature selection methods (filters) vary considerably on different data and no single filter performs consistently well under various conditions. Therefore, feature selection ensemble has been investigated recently to provide more reliable and effective results than any individual one but all the existing feature selection ensemble treat the feature selection methods equally regardless of their performance. In this paper, we present a novel framework which applies weighted feature selection ensemble through proposing a systemic way of adding different weights to the feature selection methods-filters. Also, we investigate how to determine the appropriate weight for each filter in an ensemble. Experiments based on ten benchmark datasets show that theoretically and intuitively adding more weight to βgood filtersβ should lead to better results but in reality it is very uncertain. This assumption was found to be correct for some examples in our experiment. However, for other situations, filters which had been assumed to perform well showed bad performance leading to even worse results. Therefore adding weight to filters might not achieve much in accuracy terms, in addition to increasing complexity, time consumption and clearly decreasing the stability
ΠΡΠ±ΠΎΡ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΡΡ Π³Π΅ΠΎΠΌΠ΅ΡΡΠΈΡΠ΅ΡΠΊΠΈΡ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² ΡΠ΄Π΅Ρ ΠΊΠ»Π΅ΡΠΎΠΊ Π½Π° Π»ΡΠΌΠΈΠ½Π΅ΡΡΠ΅Π½ΡΠ½ΡΡ ΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΡΡ ΡΠ°ΠΊΠΎΠ²ΡΡ ΠΊΠ»Π΅ΡΠΎΠΊ
The methods of geometric informative features selection of nuclei on fluorescent images of cancer cells are considered. During the survey, a review of existing geometric features was carried out, including both the signs of rotation resisted shape and displacement of the image, as well as signs of location in space. For the selection of characteristics, the methods were used: median, correlation with calculation of the Pearson correlation coefficient, correlation with calculation of the Spearman correlation coefficient, logistic regression model, random forest with CART trees and Gini criterion, random forest with CART trees and error minimization criterion. As a result of the investigation 11 characteristics were selected from 59 features, the quality of classification and time costs were calculated depending on the number of features for describing the objects. The use of 11 features is sufficient for the accuracy of classification as it allows to reduce time costs in 2,3 times.Π Π°ΡΡΠΌΠΎΡΡΠ΅Π½Ρ ΠΌΠ΅ΡΠΎΠ΄Ρ ΠΎΡΠ±ΠΎΡΠ° ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΡΡ
ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² Π΄Π»Ρ Π²ΡΠ΄Π΅Π»Π΅Π½ΠΈΡ Π³Π΅ΠΎΠΌΠ΅ΡΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² ΠΏΡΠΈ ΠΎΠΏΠΈΡΠ°Π½ΠΈΠΈ ΡΠ΄Π΅Ρ Π½Π° Π»ΡΠΌΠΈΠ½Π΅ΡΡΠ΅Π½ΡΠ½ΡΡ
ΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΡΡ
ΡΠ°ΠΊΠΎΠ²ΡΡ
ΠΊΠ»Π΅ΡΠΎΠΊ. ΠΡΠΏΠΎΠ»Π½Π΅Π½ ΠΎΠ±Π·ΠΎΡ ΡΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΡ
Π³Π΅ΠΎΠΌΠ΅ΡΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ², ΠΊΠΎΡΠΎΡΡΠΉ Π²ΠΊΠ»ΡΡΠ°Π΅Ρ Π² ΡΠ΅Π±Ρ ΠΊΠ°ΠΊ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΈ ΡΠΎΡΠΌΡ, ΡΡΡΠΎΠΉΡΠΈΠ²ΡΠ΅ ΠΊ ΠΏΠΎΠ²ΠΎΡΠΎΡΡ ΠΈ ΠΏΠ΅ΡΠ΅ΠΌΠ΅ΡΠ΅Π½ΠΈΡ ΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΡ, ΡΠ°ΠΊ ΠΈ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΈ ΡΠ°ΡΠΏΠΎΠ»ΠΎΠΆΠ΅Π½ΠΈΡ Π² ΠΏΡΠΎΡΡΡΠ°Π½ΡΡΠ²Π΅. ΠΠ»Ρ ΠΎΡΠ±ΠΎΡΠ° Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΡΡ
ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½Ρ ΡΠ΅ΡΡΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ²: ΠΌΠ΅Π΄ΠΈΠ°Π½Π½ΡΠΉ, ΠΊΠΎΡΡΠ΅Π»ΡΡΠΈΠΎΠ½Π½ΡΠΉ Ρ ΡΠ°ΡΡΠ΅ΡΠΎΠΌ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠ° ΠΊΠΎΡΡΠ΅Π»ΡΡΠΈΠΈ ΠΏΠΎ ΠΠΈΡΡΠΎΠ½Ρ, ΠΊΠΎΡΡΠ΅Π»ΡΡΠΈΠΎΠ½Π½ΡΠΉ Ρ ΡΠ°ΡΡΠ΅ΡΠΎΠΌ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠ° ΠΊΠΎΡΡΠ΅Π»ΡΡΠΈΠΈ ΠΏΠΎ Π‘ΠΏΠΈΡΠΌΠ΅Π½Ρ, ΠΌΠ΅ΡΠΎΠ΄ Π»ΠΎΠ³ΠΈΡΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΡΠ΅Π³ΡΠ΅ΡΡΠΈΠΈ, ΡΠ»ΡΡΠ°ΠΉΠ½ΠΎΠ³ΠΎ Π»Π΅ΡΠ° Ρ CART-Π΄Π΅ΡΠ΅Π²ΡΡΠΌΠΈ ΠΈ ΠΊΡΠΈΡΠ΅ΡΠΈΠ΅ΠΌ Gini, ΡΠ»ΡΡΠ°ΠΉΠ½ΠΎΠ³ΠΎ Π»Π΅ΡΠ° Ρ CART-Π΄Π΅ΡΠ΅Π²ΡΡΠΌΠΈ ΠΈ ΠΊΡΠΈΡΠ΅ΡΠΈΠ΅ΠΌ ΠΌΠΈΠ½ΠΈΠΌΠΈΠ·Π°ΡΠΈΠΈ ΠΎΡΠΈΠ±ΠΊΠΈ. Π ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠ΅ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΡ ΠΈΠ· 59 ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² ΠΎΡΠΎΠ±ΡΠ°Π½Ρ 11 Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΡΡ
, Π²ΡΠΏΠΎΠ»Π½Π΅Π½ Π°Π½Π°Π»ΠΈΠ· ΠΊΠ°ΡΠ΅ΡΡΠ²Π° ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΠΌΠ΅ΡΠΎΠ΄Π° ΡΠ»ΡΡΠ°ΠΉΠ½ΠΎΠ³ΠΎ Π»Π΅ΡΠ° ΠΈ ΡΠ°ΡΡΡΠΈΡΠ°Π½Ρ Π²ΡΠ΅ΠΌΠ΅Π½Π½ΡΠ΅ Π·Π°ΡΡΠ°ΡΡ Π² Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΠΎΡ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π° ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² Π΄Π»Ρ ΠΎΠΏΠΈΡΠ°Π½ΠΈΡ ΠΎΠ±ΡΠ΅ΠΊΡΠΎΠ². ΠΠ»Ρ ΠΌΠ΅ΡΠΎΠ΄Π° ΡΠ»ΡΡΠ°ΠΉΠ½ΠΎΠ³ΠΎ Π»Π΅ΡΠ° ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ 11 ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² ΡΠ²Π»ΡΠ΅ΡΡΡ Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΡΠΌ ΠΏΠΎ ΡΠΎΡΠ½ΠΎΡΡΠΈ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΈ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΡΠ½ΠΈΠ·ΠΈΡΡ Π²ΡΠ΅ΠΌΠ΅Π½Π½ΡΠ΅ Π·Π°ΡΡΠ°ΡΡ Π² 2,3 ΡΠ°Π·Π°
An Effective Algorithm for Correlation Attribute Subset Selection by Using Genetic Algorithm Based On Naive Bays Classifier
In recent years, application of feature selection methods in various datasets has greatly increased. Feature selection is an important topic in data mining, especially for high dimensional datasets. Feature selection (also known as subset selection) is a process commonly used in machine learning, wherein subsets of the features available from the data are selected for application of a learning algorithm. The main idea of feature selection is to choose a subset of input variables by eliminating features with little or no predictive information. The challenging task in feature selection is how to obtain an optimal subset of relevant and non redundant features which will give an optimal solution without increasing the complexity of the modeling task. Feature selection that selects a subset of most salient features and removes irrelevant, redundant and noisy features is a process commonly employed in machine learning to solve the high dimensionality problem. It focuses learning algorithms on most useful aspects of data, thereby making learning task faster and more accurate. A data warehouse is designed to consolidate and maintain all features that are relevant for the analysis processes
- β¦