Search CORE

10,630 research outputs found

Practical feature subset selection for machine learning

Author: Hall Mark A.
Smith Lloyd A.
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/1998
Field of study

Machine learning algorithms automatically extract knowledge from machine readable information. Unfortunately, their success is usually dependant on the quality of the data that they operate on. If the data is inadequate, or contains extraneous and irrelevant information, machine learning algorithms may produce less accurate and less understandable results, or may fail to discover anything of use at all. Feature subset selection can result in enhanced performance, a reduced hypothesis search space, and, in some cases, reduced storage requirement. This paper describes a new feature selection algorithm that uses a correlation based heuristic to determine the “goodness” of feature subsets, and evaluates its effectiveness with three common machine learning algorithms. Experiments using a number of standard machine learning data sets are presented. Feature subset selection gave significant improvement for all three algorithm

CiteSeerX

Research Commons@Waikato

Feature selection using genetic algorithms and probabilistic neural networks

Author: Hunter Andrew
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 07/07/2000
Field of study

Selection of input variables is a key stage in building predictive models, and an important form of data mining. As exhaustive evaluation of potential input sets using full non-linear models is impractical, it is necessary to use simple fast-evaluating models and heuristic selection strategies. This paper discusses a fast, efficient, and powerful nonlinear input selection procedure using a combination of Probabilistic Neural Networks and repeated bitwise gradient descent. The algorithm is compared with forward elimination, backward elimination and genetic algorithms using a selection of real-world data sets. The algorithm has comparative performance and greatly reduced execution time with respect to these alternative approaches. It is demonstrated empirically that reliable results cannot be gained using any of these approaches without the use of resampling

University of Lincoln Institutional Repository

Efficient Feature Subset Selection Algorithm for High Dimensional Data

Author: Chormunge Smita
Jena Sudarson
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/08/2016
Field of study

Feature selection approach solves the dimensionality problem by removing irrelevant and redundant features. Existing Feature selection algorithms take more time to obtain feature subset for high dimensional data. This paper proposes a feature selection algorithm based on Information gain measures for high dimensional data termed as IFSA (Information gain based Feature Selection Algorithm) to produce optimal feature subset in efficient time and improve the computational performance of learning algorithms. IFSA algorithm works in two folds: First apply filter on dataset. Second produce the small feature subset by using information gain measure. Extensive experiments are carried out to compare proposed algorithm and other methods with respect to two different classifiers (Naive bayes and IBK) on microarray and text data sets. The results demonstrate that IFSA not only produces the most select feature subset in efficient time but also improves the classifier performance

IAES journal

Crossref

Institute of Advanced Engineering and Science

Relevance-Redundancy Dominance: a threshold-free approach to filter-based feature selection

Author: Browne David
Manna Carlo
Prestwich Steven D.
Publication venue: Sun SITE Central Europe / RWTH Aachen University
Publication date: 01/09/2016
Field of study

Feature selection is used to select a subset of relevant features in machine learning, and is vital for simplification, improving efficiency and reducing overfitting. In filter-based feature selection, a statistic such as correlation or entropy is computed between each feature and the target variable to evaluate feature relevance. A relevance threshold is typically used to limit the set of selected features, and features can also be removed based on redundancy (similarity to other features). Some methods are designed for use with a specific statistic or certain types of data. We present a new filter-based method called Relevance-Redundancy Dominance that applies to mixed data types, can use a wide variety of statistics, and does not require a threshold. Finally, we provide preliminary results, through extensive numerical experiments on public credit datasets

Irish Universities

Cork Open Research Archive

Analysis of the contour structural irregularity of skin lesions using wavelet decomposition

Author: Ma Li
Staunton Richard C.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2013
Field of study

The boundary irregularity of skin lesions is of clinical significance for the early detection of malignant melanomas and to distinguish them from other lesions such as benign moles. The structural components of the contour are of particular importance. To extract the structure from the contour, wavelet decomposition was used as these components tend to locate in the lower frequency sub-bands. Lesion contours were modeled as signatures with scale normalization to give position and frequency resolution invariance. Energy distributions among different wavelet sub-bands were then analyzed to extract those with significant levels and differences to enable maximum discrimination. Based on the coefficients in the significant sub-bands, structural components from the original contours were modeled, and a set of statistical and geometric irregularity descriptors researched that were applied at each of the significant sub-bands. The effectiveness of the descriptors was measured using the Hausdorff distance between sets of data from melanoma and mole contours. The best descriptor outputs were input to a back projection neural network to construct a combined classifier system. Experimental results showed that thirteen features from four sub-bands produced the best discrimination between sets of melanomas and moles, and that a small training set of nine melanomas and nine moles was optimum

Warwick Research Archives Portal Repository