2,304 research outputs found
A new approach of top-down induction of decision trees for knowledge discovery
Top-down induction of decision trees is the most popular technique for classification in the field of data mining and knowledge discovery. Quinlan developed the basic induction algorithm of decision trees, ID3 (1984), and extended to C4.5 (1993). There is a lot of research work for dealing with a single attribute decision-making node (so-called the first-order decision) of decision trees. Murphy and Pazzani (1991) addressed about multiple-attribute conditions at decision-making nodes. They show that higher order decision-making generates smaller decision trees and better accuracy. However, there always exist NP-complete combinations of multiple-attribute decision-makings.;We develop a new algorithm of second-order decision-tree inductions (SODI) for nominal attributes. The induction rules of first-order decision trees are combined by \u27AND\u27 logic only, but those of SODI consist of \u27AND\u27, \u27OR\u27, and \u27OTHERWISE\u27 logics. It generates more accurate results and smaller decision trees than any first-order decision tree inductions.;Quinlan used information gains via VC-dimension (Vapnik-Chevonenkis; Vapnik, 1995) for clustering the experimental values for each numerical attribute. However, many researchers have discovered the weakness of the use of VC-dim analysis. Bennett (1997) sophistically applies support vector machines (SVM) to decision tree induction. We suggest a heuristic algorithm (SVMM; SVM for Multi-category) that combines a TDIDT scheme with SVM. In this thesis it will be also addressed how to solve multiclass classification problems.;Our final goal for this thesis is IDSS (Induction of Decision Trees using SODI and SVMM). We will address how to combine SODI and SVMM for the construction of top-down induction of decision trees in order to minimize the generalized penalty cost
Recommended from our members
J-measure based hybrid pruning for complexity reduction in classification rules
Prism is a modular classification rule generation method based on the ‘separate and conquer’ approach that is alternative to the rule induction approach using decision trees also known as ‘divide and conquer’. Prism often achieves a similar level of classification accuracy compared with decision trees, but tends to produce a more compact noise tolerant set of classification rules. As with other classification rule generation methods, a principle problem arising with Prism is that of overfitting due to over-specialised rules. In addition, over-specialised rules increase the associated computational complexity. These problems can be solved by pruning methods. For the Prism method, two pruning algorithms have been introduced recently for reducing overfitting of classification rules - J-pruning and Jmax-pruning. Both algorithms are based on the J-measure, an information theoretic means for quantifying the theoretical information content of a rule. Jmax-pruning attempts to exploit the J-measure to its full potential because J-pruning does not actually achieve this and may even lead to underfitting. A series of experiments have proved that Jmax-pruning may outperform J-pruning in reducing overfitting. However, Jmax-pruning is computationally relatively expensive and may also lead to underfitting. This paper reviews the Prism method and the two existing pruning algorithms above. It also proposes a novel pruning algorithm called Jmid-pruning. The latter is based on the J-measure and it reduces overfitting to a similar level as the other two algorithms but is better in avoiding underfitting and unnecessary computational effort. The authors conduct an experimental study on the performance of the Jmid-pruning algorithm in terms of classification accuracy and computational efficiency. The algorithm is also evaluated comparatively with the J-pruning and Jmax-pruning algorithms
Efficient Closed Pattern Mining in the Presence of Tough Block Constraints
In recent years, various constrained frequent pattern mining problem formulations and associated algorithms have been developed that enable the user to specify various itemsetbased constraints that better capture the underlying application requirements and characteristics. In this paper we introduce a new class of block constraints that determine the significance of an itemset pattern by considering the dense block that is formed by the pattern's items and its associated set of transactions. Block constraints provide a natural framework by which a number of important problems can be specified and make it possible to solve numerous problems on binary and real-valued datasets. However, developing computationally e#cient algorithms to find these block constraints poses a number of challenges as unlike the di#erent itemset-based constraints studied earlier, these block constraints are tough as they are neither anti-monotone, monotone, nor convertible. To overcome this problem, we introduce a new class of pruning methods that can be used to significantly reduce the overall search space and make it possible to develop computationally e#cient block constraint mining algorithms. We present an algorithm called CBMiner that takes advantage of these pruning methods to develop an algorithm for finding the closed itemsets that satisfy the block constraints. Our extensive performance study shows that CBMiner generates more concise result set and can be order(s) of magnitude faster than the traditional frequent closed itemset mining algorithms
More than the sum of its parts – pattern mining, neural networks, and how they complement each other
In this thesis we explore pattern mining and deep learning. Often seen as orthogonal, we show that these fields complement each other and propose to combine them to gain from each other’s strengths. We, first, show how to efficiently discover succinct and non-redundant sets of patterns that provide insight into data beyond conjunctive statements. We leverage the interpretability of such patterns to unveil how and which information flows through neural networks, as well as what characterizes their decisions. Conversely, we show how to combine continuous optimization with pattern discovery, proposing a neural network that directly encodes discrete patterns, which allows us to apply pattern mining at a scale orders of magnitude larger than previously possible. Large neural networks are, however, exceedingly expensive to train for which ‘lottery tickets’ – small, well-trainable sub-networks in randomly initialized neural networks – offer a remedy. We identify theoretical limitations of strong tickets and overcome them by equipping these tickets with the property of universal approximation. To analyze whether limitations in ticket sparsity are algorithmic or fundamental, we propose a framework to plant and hide lottery tickets. With novel ticket benchmarks we then conclude that the limitation is likely algorithmic, encouraging further developments for which our framework offers means to measure progress.In dieser Arbeit befassen wir uns mit Mustersuche und Deep Learning. Oft als gegensätzlich betrachtet, verbinden wir diese Felder, um von den Stärken beider zu profitieren. Wir zeigen erst, wie man effizient prägnante Mengen von Mustern entdeckt, die Einsichten über konjunktive Aussagen hinaus geben. Wir nutzen dann die Interpretierbarkeit solcher Muster, um zu verstehen wie und welche Information durch neuronale Netze fließen und was ihre Entscheidungen charakterisiert. Umgekehrt verbinden wir kontinuierliche Optimierung mit Mustererkennung durch ein neuronales Netz welches diskrete Muster direkt abbildet, was Mustersuche in einigen Größenordnungen höher erlaubt als bisher möglich. Das Training großer neuronaler Netze ist jedoch extrem teuer, für das ’Lotterietickets’ – kleine, gut trainierbare Subnetzwerke in zufällig initialisierten neuronalen Netzen – eine Lösung bieten. Wir zeigen theoretische Einschränkungen von starken Tickets und wie man diese überwindet, indem man die Tickets mit der Eigenschaft der universalen Approximierung ausstattet. Um zu beantworten, ob Einschränkungen in Ticketgröße algorithmischer oder fundamentaler Natur sind, entwickeln wir ein Rahmenwerk zum Einbetten und Verstecken von Tickets, die als Modellfälle dienen. Basierend auf unseren Ergebnissen schließen wir, dass die Einschränkungen algorithmische Ursachen haben, was weitere Entwicklungen begünstigt, für die unser Rahmenwerk Fortschrittsevaluierungen ermöglicht
LC an effective classification based association rule mining algorithm
Classification using association rules is a research field in data mining that primarily uses association rule discovery techniques in classification benchmarks. It has been confirmed by many research studies in the literature that classification using association tends to generate more predictive classification systems than traditional classification data mining techniques like probabilistic, statistical and decision tree. In this thesis, we introduce a novel data mining algorithm based on classification using association called “Looking at the Class” (LC), which can be used in for mining a range of classification data sets. Unlike known algorithms in classification using the association approach such as Classification based on Association rule (CBA) system and Classification based on Predictive Association (CPAR) system, which merge disjoint items in the rule learning step without anticipating the class label similarity, the proposed algorithm merges only items with identical class labels. This saves too many unnecessary items combining during the rule learning step, and consequently results in large saving in computational time and memory.
Furthermore, the LC algorithm uses a novel prediction procedure that employs multiple rules to make the prediction decision instead of a single rule. The proposed algorithm has been evaluated thoroughly on real world security data sets collected using an automated tool developed at Huddersfield University. The security application which we have considered in this thesis is about categorizing websites based on their features to legitimate or fake which is a typical binary classification problem. Also, experimental results on a number of UCI data sets have been conducted and the measures used for evaluation is the classification accuracy, memory usage, and others. The results show that LC algorithm outperformed traditional classification algorithms such as C4.5, PART and Naïve Bayes as well as known classification based association algorithms like CBA with respect to classification accuracy, memory usage, and execution time on most data sets we consider
- …