Search CORE

290 research outputs found

Understanding a Version of Multivariate Symmetric Uncertainty to assist in Feature Selection

Author: Divina Federico
García-Torres Miguel
Gómez Santiago
Schaerer Christian
Sosa-Cabrera Gustavo
Publication venue
Publication date: 25/09/2017
Field of study

In this paper, we analyze the behavior of the multivariate symmetric uncertainty (MSU) measure through the use of statistical simulation techniques under various mixes of informative and non-informative randomly generated features. Experiments show how the number of attributes, their cardinalities, and the sample size affect the MSU. We discovered a condition that preserves good quality in the MSU under different combinations of these three factors, providing a new useful criterion to help drive the process of dimension reduction

arXiv.org e-Print Archive

Thesaurus-based index term extraction for agricultural documents

Author: Medelyan Olena
Witten Ian H.
Publication venue: EFITA/WICCA
Publication date: 01/01/2005
Field of study

This paper describes a new algorithm for automatically extracting index terms from documents relating to the domain of agriculture. The domain-specific Agrovoc thesaurus developed by the FAO is used both as a controlled vocabulary and as a knowledge base for semantic matching. The automatically assigned terms are evaluated against a manually indexed 200-item sample of the FAO’s document repository, and the performance of the new algorithm is compared with a state-of-the-art system for keyphrase extraction

CiteSeerX

Research Commons@Waikato

Coherent Keyphrase Extraction via Web Mining

Author: Turney Peter D.
Publication venue
Publication date: 01/01/2003
Field of study

Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. A limitation of previous keyphrase extraction algorithms is that the selected keyphrases are occasionally incoherent. That is, the majority of the output keyphrases may fit together well, but there may be a minority that appear to be outliers, with no clear semantic relation to the majority or to each other. This paper presents enhancements to the Kea keyphrase extraction algorithm that are designed to increase the coherence of the extracted keyphrases. The approach is to use the degree of statistical association among candidate keyphrases as evidence that they may be semantically related. The statistical association is measured using web mining. Experiments demonstrate that the enhancements improve the quality of the extracted keyphrases. Furthermore, the enhancements are not domain-specific: the algorithm generalizes well when it is trained on one domain (computer science documents) and tested on another (physics documents).Comment: 6 pages, related work available at http://purl.org/peter.turney

arXiv.org e-Print Archive

CiteSeerX

NRC Publications Archive

CogPrints Cognitive Sciences Eprint Archive

Using a unified measure function for heuristics, discretization, and rule quality evaluation in Ant-Miner

Author: Otero Fernando E.B.
Salama Khalid M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2013
Field of study

Ant-Miner is a classification rule discovery algorithm that is based on Ant Colony Optimization (ACO) meta-heuristic. cAnt-Miner is the extended version of the algorithm that handles continuous attributes on-the-fly during the rule construction process, while ?Ant-Miner is an extension of the algorithm that selects the rule class prior to its construction, and utilizes multiple pheromone types, one for each permitted rule class. In this paper, we combine these two algorithms to derive a new approach for learning classification rules using ACO. The proposed approach is based on using the measure function for 1) computing the heuristics for rule term selection, 2) a criteria for discretizing continuous attributes, and 3) evaluating the quality of the constructed rule for pheromone update as well. We explore the effect of using different measure functions for on the output model in terms of predictive accuracy and model size. Empirical evaluations found that hypothesis of different functions produce different results are acceptable according to Friedman’s statistical test

Crossref

Kent Academic Repository

Interaktivna interakcijska analiza

Author: Jakulin Aleks
Leban Gregor
Publication venue
Publication date: 01/01/2003
Field of study

Interakcije lahko razumemo kot korelacije, ki obsegajo več kot le dva atributa. Neka skupina atributov je med seboj v interakciji, če njihovih medsebojnih povezanosti ne moremo popolnoma razumeti, ne da bi jih vse opazovali hkrati. Interakcije so zakonitosti skupin več atributov. V tem članku merimo pomembnost interakcije s postopki, ki temeljijo na Shannonovi entropiji kot pojmu negotovosti, ki je bolj splošen od koncepta statistične variance. Cilj interakcijske analize je analitiku predstaviti interakcije grafično z več tipi diagramov. S tem namenom smo izdelali orodja, ki omogočajo interaktivno preučevanje podatkov in nudijo pomoč pri iskanju zanimivih pogledov na podatke. Interakcije prinašajo tudi nov pogled na nekatere težave postopkov strojnega učenja

ePrints.FRI

Decision diagrams in machine learning: an empirical study on real-life credit-risk data.

Author: Baesens Bart
Files CM
Mues Christophe
Vanthienen Jan
Publication venue
Publication date
Field of study

Decision trees are a widely used knowledge representation in machine learning. However, one of their main drawbacks is the inherent replication of isomorphic subtrees, as a result of which the produced classifiers might become too large to be comprehensible by the human experts that have to validate them. Alternatively, decision diagrams, a generalization of decision trees taking on the form of a rooted, acyclic digraph instead of a tree, have occasionally been suggested as a potentially more compact representation. Their application in machine learning has nonetheless been criticized, because the theoretical size advantages of subgraph sharing did not always directly materialize in the relatively scarce reported experiments on real-world data. Therefore, in this paper, starting from a series of rule sets extracted from three real-life credit-scoring data sets, we will empirically assess to what extent decision diagrams are able to provide a compact visual description. Furthermore, we will investigate the practical impact of finding a good attribute ordering on the achieved size savings.Advantages; Classifiers; Credit scoring; Data; Decision; Decision diagrams; Decision trees; Empirical study; Knowledge; Learning; Real life; Representation; Size; Studies;

Research Papers in Economics

Efficient algorithms for decision tree cross-validation

Author: Blockeel Hendrik
Struyf Jan
Publication venue
Publication date: 01/01/2001
Field of study

Cross-validation is a useful and generally applicable technique often employed in machine learning, including decision tree induction. An important disadvantage of straightforward implementation of the technique is its computational overhead. In this paper we show that, for decision trees, the computational overhead of cross-validation can be reduced significantly by integrating the cross-validation with the normal decision tree induction process. We discuss how existing decision tree algorithms can be adapted to this aim, and provide an analysis of the speedups these adaptations may yield. The analysis is supported by experimental results.Comment: 9 pages, 6 figures. http://www.cs.kuleuven.ac.be/cgi-bin-dtai/publ_info.pl?id=3478

arXiv.org e-Print Archive

Lirias

CiteSeerX

A Decision tree-based attribute weighting filter for naive Bayes

Author: Hall Mark A.
Publication venue
Publication date: 01/05/2006
Field of study

The naive Bayes classifier continues to be a popular learning algorithm for data mining applications due to its simplicity and linear run-time. Many enhancements to the basic algorithm have been proposed to help mitigate its primary weakness--the assumption that attributes are independent given the class. All of them improve the performance of naïve Bayes at the expense (to a greater or lesser degree) of execution time and/or simplicity of the final model. In this paper we present a simple filter method for setting attribute weights for use with naive Bayes. Experimental results show that naive Bayes with attribute weights rarely degrades the quality of the model compared to standard naive Bayes and, in many cases, improves it dramatically. The main advantages of this method compared to other approaches for improving naive Bayes is its run-time complexity and the fact that it maintains the simplicity of the final model

Research Commons@Waikato