Search CORE

3,924 research outputs found

Search Strategies for Binary Feature Selection for a Naive Bayes Classifier

Author: Cottrell Marie
Lacaille Jérôme
Rabenoro Tsirizo
Rossi Fabrice
Publication venue
Publication date: 22/04/2015
Field of study

We compare in this paper several feature selection methods for the Naive Bayes Classifier (NBC) when the data under study are described by a large number of redundant binary indicators. Wrapper approaches guided by the NBC estimation of the classification error probability out-perform filter approaches while retaining a reasonable computational cost

arXiv.org e-Print Archive

HAL-Paris1

k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)

Author: Cunningham Padraig
Delany Sarah Jane
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/04/2020
Field of study

Perhaps the most straightforward classifier in the arsenal or machine learning techniques is the Nearest Neighbour Classifier -- classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data. This paper is the second edition of a paper previously published as a technical report. Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN

arXiv.org e-Print Archive

Arrow@TUDublin

Feature Selection via Coalitional Game Theory

Author: Cohen S. B.
Dror G.
Ruppin E.
Publication venue
Publication date: 01/01/2007
Field of study

We present and study the contribution-selection algorithm (CSA), a novel algorithm for feature selection. The algorithm is based on the multiperturbation shapley analysis (MSA), a framework that relies on game theory to estimate usefulness. The algorithm iteratively estimates the usefulness of features and selects them accordingly, using either forward selection or backward elimination. It can optimize various performance measures over unseen data such as accuracy, balanced error rate, and area under receiver-operator-characteristic curve. Empirical comparison with several other existing feature selection methods shows that the backward elimination variant of CSA leads to the most accurate classification results on an array of data sets

CiteSeerX

Edinburgh Research Explorer

Combining similarity in time and space for training set formation under concept drift

Author: Zliobaite Indre
Publication venue: 'IOS Press'
Publication date: 01/01/2011
Field of study

Concept drift is a challenge in supervised learning for sequential data. It describes a phenomenon when the data distributions change over time. In such a case accuracy of a classifier benefits from the selective sampling for training. We develop a method for training set selection, particularly relevant when the expected drift is gradual. Training set selection at each time step is based on the distance to the target instance. The distance function combines similarity in space and in time. The method determines an optimal training set size online at every time step using cross validation. It is a wrapper approach, it can be used plugging in different base classifiers. The proposed method shows the best accuracy in the peer group on the real and artificial drifting data. The method complexity is reasonable for the field applications

Repository TU/e

Crossref

Pure OAI Repository

Bournemouth University Research Online

Resampling methods for parameter-free and robust feature selection with mutual information

Author: Andreas Hahn
Battiti
Bellmann
Benoudjit
Bonnlander
Conrad
Craddock
D. François
Dijck
Diks
F. Rossi
Fleuret
Frank
François
Friedman
Fung
Good
Guyon
Guyon
Hammer
Hild
Hoffman
Hummel
Kraskov
Kwak
Kwak
M. Verleysen
Nicolaou
Opdyke
Purushothaman
Rossi
Rossi
Rossi
Scott
Stefansson
V. Wertz
Verikas
Zhou
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

Combining the mutual information criterion with a forward feature selection strategy offers a good trade-off between optimality of the selected feature subset and computation time. However, it requires to set the parameter(s) of the mutual information estimator and to determine when to halt the forward procedure. These two choices are difficult to make because, as the dimensionality of the subset increases, the estimation of the mutual information becomes less and less reliable. This paper proposes to use resampling methods, a K-fold cross-validation and the permutation test, to address both issues. The resampling methods bring information about the variance of the estimator, information which can then be used to automatically set the parameter and to calculate a threshold to stop the forward procedure. The procedure is illustrated on a synthetic dataset as well as on real-world examples

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

DIAL UCLouvain

Weka: A machine learning workbench for data mining

Author: Frank Eibe
Hall Mark A.
Holmes Geoffrey
Kirkby Richard Brendon
Pfahringer Bernhard
Witten Ian H.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

The Weka workbench is an organized collection of state-of-the-art machine learning algorithms and data preprocessing tools. The basic way of interacting with these methods is by invoking them from the command line. However, convenient interactive graphical user interfaces are provided for data exploration, for setting up large-scale experiments on distributed computing platforms, and for designing configurations for streamed data processing. These interfaces constitute an advanced environment for experimental data mining. The system is written in Java and distributed under the terms of the GNU General Public License

Research Commons@Waikato

Robust Feature Selection by Mutual Information Distributions

Author: Hutter Marcus
Zaffalon Marco
Publication venue
Publication date: 01/01/2002
Field of study

Mutual information is widely used in artificial intelligence, in a descriptive way, to measure the stochastic dependence of discrete random variables. In order to address questions such as the reliability of the empirical value, one must consider sample-to-population inferential approaches. This paper deals with the distribution of mutual information, as obtained in a Bayesian framework by a second-order Dirichlet prior distribution. The exact analytical expression for the mean and an analytical approximation of the variance are reported. Asymptotic approximations of the distribution are proposed. The results are applied to the problem of selecting features for incremental learning and classification of the naive Bayes classifier. A fast, newly defined method is shown to outperform the traditional approach based on empirical mutual information on a number of real data sets. Finally, a theoretical development is reported that allows one to efficiently extend the above methods to incomplete samples in an easy and effective way.Comment: 8 two-column page

arXiv.org e-Print Archive

CiteSeerX

Feature selection algorithms: a survey and experimental evaluation

Author: Belanche Muñoz Luis Antonio
Molina Luis
Nebot Castells M. Àngela
Publication venue
Publication date: 01/01/2003
Field of study

In view of the substantial number of existing feature selection algorithms, the need arises to count on criteria that enables to adequately decide which algorithm to use in certain situations. This work reviews several fundamental algorithms found in the literature and assesses their performance in a controlled scenario. A scoring measure ranks the algorithms by taking into account the amount of relevance, irrelevance and redundance on sample data sets. This measure computes the degree of matching between the output given by the algorithm and the known optimal solution. Sample size effects are also studied.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC