290 research outputs found
Understanding a Version of Multivariate Symmetric Uncertainty to assist in Feature Selection
In this paper, we analyze the behavior of the multivariate symmetric
uncertainty (MSU) measure through the use of statistical simulation techniques
under various mixes of informative and non-informative randomly generated
features. Experiments show how the number of attributes, their cardinalities,
and the sample size affect the MSU. We discovered a condition that preserves
good quality in the MSU under different combinations of these three factors,
providing a new useful criterion to help drive the process of dimension
reduction
Thesaurus-based index term extraction for agricultural documents
This paper describes a new algorithm for automatically extracting index terms from documents relating to the domain of agriculture. The domain-specific Agrovoc thesaurus developed by the FAO is used both as a controlled vocabulary and as a knowledge base for semantic matching. The automatically assigned terms are evaluated against a manually indexed 200-item sample of the FAOās document repository, and the performance of the new algorithm is compared with a state-of-the-art system for keyphrase extraction
Coherent Keyphrase Extraction via Web Mining
Keyphrases are useful for a variety of purposes, including summarizing,
indexing, labeling, categorizing, clustering, highlighting, browsing, and
searching. The task of automatic keyphrase extraction is to select keyphrases
from within the text of a given document. Automatic keyphrase extraction makes
it feasible to generate keyphrases for the huge number of documents that do not
have manually assigned keyphrases. A limitation of previous keyphrase
extraction algorithms is that the selected keyphrases are occasionally
incoherent. That is, the majority of the output keyphrases may fit together
well, but there may be a minority that appear to be outliers, with no clear
semantic relation to the majority or to each other. This paper presents
enhancements to the Kea keyphrase extraction algorithm that are designed to
increase the coherence of the extracted keyphrases. The approach is to use the
degree of statistical association among candidate keyphrases as evidence that
they may be semantically related. The statistical association is measured using
web mining. Experiments demonstrate that the enhancements improve the quality
of the extracted keyphrases. Furthermore, the enhancements are not
domain-specific: the algorithm generalizes well when it is trained on one
domain (computer science documents) and tested on another (physics documents).Comment: 6 pages, related work available at http://purl.org/peter.turney
Using a unified measure function for heuristics, discretization, and rule quality evaluation in Ant-Miner
Ant-Miner is a classification rule discovery algorithm that is based on Ant Colony Optimization (ACO) meta-heuristic. cAnt-Miner is the extended version of the algorithm that handles continuous attributes on-the-fly during the rule construction process, while ?Ant-Miner is an extension of the algorithm that selects the rule class prior to its construction, and utilizes multiple pheromone types, one for each permitted rule class. In this paper, we combine these two algorithms to derive a new approach for learning classification rules using ACO. The proposed approach is based on using the measure function for 1) computing the heuristics for rule term selection, 2) a criteria for discretizing continuous attributes, and 3) evaluating the quality of the constructed rule for pheromone update as well. We explore the effect of using different measure functions for on the output model in terms of predictive accuracy and model size. Empirical evaluations found that hypothesis of different functions produce different results are acceptable according to Friedmanās statistical test
Interaktivna interakcijska analiza
Interakcije lahko razumemo kot korelacije, ki obsegajo veÄ kot le dva atributa. Neka skupina atributov je med seboj v interakciji, Äe njihovih medsebojnih povezanosti ne moremo popolnoma razumeti, ne da bi jih vse opazovali hkrati. Interakcije so zakonitosti skupin veÄ atributov. V tem Älanku merimo pomembnost interakcije s postopki, ki temeljijo na Shannonovi entropiji kot pojmu negotovosti, ki je bolj sploÅ”en od koncepta statistiÄne variance. Cilj interakcijske analize je analitiku predstaviti interakcije grafiÄno z veÄ tipi diagramov. S tem namenom smo izdelali orodja, ki omogoÄajo interaktivno preuÄevanje podatkov in nudijo pomoÄ pri iskanju zanimivih pogledov na podatke. Interakcije prinaÅ”ajo tudi nov pogled na nekatere težave postopkov strojnega uÄenja
Decision diagrams in machine learning: an empirical study on real-life credit-risk data.
Decision trees are a widely used knowledge representation in machine learning. However, one of their main drawbacks is the inherent replication of isomorphic subtrees, as a result of which the produced classifiers might become too large to be comprehensible by the human experts that have to validate them. Alternatively, decision diagrams, a generalization of decision trees taking on the form of a rooted, acyclic digraph instead of a tree, have occasionally been suggested as a potentially more compact representation. Their application in machine learning has nonetheless been criticized, because the theoretical size advantages of subgraph sharing did not always directly materialize in the relatively scarce reported experiments on real-world data. Therefore, in this paper, starting from a series of rule sets extracted from three real-life credit-scoring data sets, we will empirically assess to what extent decision diagrams are able to provide a compact visual description. Furthermore, we will investigate the practical impact of finding a good attribute ordering on the achieved size savings.Advantages; Classifiers; Credit scoring; Data; Decision; Decision diagrams; Decision trees; Empirical study; Knowledge; Learning; Real life; Representation; Size; Studies;
Efficient algorithms for decision tree cross-validation
Cross-validation is a useful and generally applicable technique often
employed in machine learning, including decision tree induction. An important
disadvantage of straightforward implementation of the technique is its
computational overhead. In this paper we show that, for decision trees, the
computational overhead of cross-validation can be reduced significantly by
integrating the cross-validation with the normal decision tree induction
process. We discuss how existing decision tree algorithms can be adapted to
this aim, and provide an analysis of the speedups these adaptations may yield.
The analysis is supported by experimental results.Comment: 9 pages, 6 figures.
http://www.cs.kuleuven.ac.be/cgi-bin-dtai/publ_info.pl?id=3478
A Decision tree-based attribute weighting filter for naive Bayes
The naive Bayes classifier continues to be a popular learning algorithm for data mining applications due to its simplicity and linear run-time. Many enhancements to the basic algorithm have been proposed to help mitigate its primary weakness--the assumption that attributes are independent given the class. All of them improve the performance of naĆÆve Bayes at the expense (to a greater or lesser degree) of execution time and/or simplicity of the final model. In this paper we present a simple filter method for setting attribute weights for use with naive Bayes. Experimental results show that naive Bayes with attribute weights rarely degrades the quality of the model compared to standard naive Bayes and, in many cases, improves it dramatically. The main advantages of this method compared to other approaches for improving naive Bayes is its
run-time complexity and the fact that it maintains the simplicity of the final model
- ā¦