6 research outputs found

    Fisher’s decision tree

    Get PDF
    Univariate decision trees are classifiers currently used in many data mining applications. This classifier discovers partitions in the input space via hyperplanes that are orthogonal to the axes of attributes, producing a model that can be understood by human experts. One disadvantage of univariate decision trees is that they produce complex and inaccurate models when decision boundaries are not orthogonal to axes. In this paper we introduce the Fisher’s Tree, it is a classifier that takes advantage of dimensionality reduction of Fisher’s linear discriminant and uses the decomposition strategy of decision trees, to come up with an oblique decision tree. Our proposal generates an artificial attribute that is used to split the data in a recursive way. The Fisher’s decision tree induces oblique trees whose accuracy, size, number of leaves and training time are competitive with respect to other decision trees reported in the literature. We use more than ten public available data sets to demonstrate the effectiveness of our method

    Segmentation of Nonstationary Time Series with Geometric Clustering

    Get PDF

    Entropy-based machine learning algorithms applied to genomics and pattern recognition

    Get PDF
    Transcription factors (TF) are proteins that interact with DNA to regulate the transcription of DNA to RNA and play key roles in both healthy and cancerous cells. Thus, gaining a deeper understanding of the biological factors underlying transcription factor (TF) binding specificity is important for understanding the mechanism of oncogenesis. As large, biological datasets become more readily available, machine learning (ML) algorithms have proven to make up an important and useful set of tools for cancer researchers. However, there remain many areas for potential improvements for these ML models, including a higher degree of model interpretability and overall accuracy. In this thesis, we present decision tree (DT) methods applied to DNA sequence analysis that result in highly interpretable and accurate predictions. We propose a boosted decision tree (BDT) model using the binary counts of important DNA motifs to predict the binding specificity of TFs belonging to the same protein family of binding similar DNA sequences. We then proceed to introduce a novel application of Convolutional Decision Trees (CDT) and demonstrate that this approach has distinct advantages over the BDT modeil while still accurately predicting the binding specificty of TFs. The CDT models are trained using the Cross Entropy (CE) optimization method, a Monte Carlo optimization method based on concepts from information theory related to statistical mechanics. We then further study the CDT model as a general pattern recognition and transfer learning technique and demonstrate that this approach can learn translationally invariant patterns that lead to high classification accuracy while remaining more interpretable and learning higher quality convolutional filters compared to convolutional neural networks (CNN)

    Task Relationship Modeling in Lifelong Multitask Learning

    Get PDF
    Multitask Learning is a learning framework which explores the concept of sharing training information among multiple related tasks to improve the generalization error of each task. The benefits of multitask learning have been shown both empirically and theoretically. There are a number of fields that benefit from multitask learning such as toxicology, image annotation, compressive sensing etc. However, majority of multitask learning algorithms make a very important key assumption that all the tasks are related to each other in a similar fashion in multitask learning. The users often do not have the knowledge of which tasks are related and train all tasks together. This results in sharing of training information even among the unrelated tasks. Training unrelated tasks together can cause a negative transfer and deteriorate the performance of multitask learning. For example, consider the case of predicting in vivo toxicity of chemicals at various endpoints from the chemical structure. Toxicity at all the endpoints are not related. Since, biological networks are highly complex, it is also not possible to predetermine which endpoints are related. Training all the endpoints together may cause a negative effect on the overall performance. Therefore, it is important to establish the task relationship models in multitask learning. Multitask learning with task relationship modeling may be explored in three different settings, namely, static learning, online fixed task learning and most recent lifelong learning. The multitask learning algorithms in static setting have been present for more than a decade and there is a lot of literature in this field. However, utilization of task relationships in multitask learning framework has been studied in detail for past several years only. The literature which uses feature selection with task relationship modeling is even further limited. For the cases of online and lifelong learning, task relationship modeling becomes a challenge. In online learning, the knowledge of all the tasks is present before starting the training of the algorithms, and the samples arrive in online fashion. However, in case of lifelong multitask learning, the tasks also arrive in an online fashion. Therefore, modeling the task relationship is even a further challenge in lifelong multitask learning framework as compared to online multitask learning. The main contribution of this thesis is to propose a framework for modeling task relationships in lifelong multitask learning. The initial algorithms are preliminary studies which focus on static setting and learn the clusters of related tasks with feature selection. These algorithms enforce that all the tasks which are related select a common set of features. The later part of the thesis shifts gear to lifelong multitask learning setting. Here, we propose learning functions to represent the relationship between tasks. Learning functions is faster and computationally less expensive as opposed to the traditional manner of learning fixed sized matrices for depicting the task relationship models

    Induction of classification rules and decision trees using genetic algorithms.

    Get PDF
    Ng Sai-Cheong.Thesis submitted in: December 2004.Thesis (M.Phil.)--Chinese University of Hong Kong, 2005.Includes bibliographical references (leaves 172-178).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Data Mining --- p.1Chapter 1.2 --- Problem Specifications and Motivations --- p.3Chapter 1.3 --- Contributions of the Thesis --- p.5Chapter 1.4 --- Thesis Roadmap --- p.6Chapter 2 --- Related Work --- p.9Chapter 2.1 --- Supervised Classification Techniques --- p.9Chapter 2.1.1 --- Classification Rules --- p.9Chapter 2.1.2 --- Decision Trees --- p.11Chapter 2.2 --- Evolutionary Algorithms --- p.19Chapter 2.2.1 --- Genetic Algorithms --- p.19Chapter 2.2.2 --- Genetic Programming --- p.24Chapter 2.2.3 --- Evolution Strategies --- p.26Chapter 2.2.4 --- Evolutionary Programming --- p.32Chapter 2.3 --- Applications of Evolutionary Algorithms to Induction of Classification Rules --- p.33Chapter 2.3.1 --- SCION --- p.33Chapter 2.3.2 --- GABIL --- p.34Chapter 2.3.3 --- LOGENPRO --- p.35Chapter 2.4 --- Applications of Evolutionary Algorithms to Construction of Decision Trees --- p.35Chapter 2.4.1 --- Binary Tree Genetic Algorithm --- p.35Chapter 2.4.2 --- OC1-GA --- p.36Chapter 2.4.3 --- OC1-ES --- p.38Chapter 2.4.4 --- GATree --- p.38Chapter 2.4.5 --- Induction of Linear Decision Trees using Strong Typing GP --- p.39Chapter 2.5 --- Spatial Data Structures and its Applications --- p.40Chapter 2.5.1 --- Spatial Data Structures --- p.40Chapter 2.5.2 --- Applications of Spatial Data Structures --- p.42Chapter 3 --- Induction of Classification Rules using Genetic Algorithms --- p.45Chapter 3.1 --- Introduction --- p.45Chapter 3.2 --- Rule Learning using Genetic Algorithms --- p.46Chapter 3.2.1 --- Population Initialization --- p.47Chapter 3.2.2 --- Fitness Evaluation of Chromosomes --- p.49Chapter 3.2.3 --- Token Competition --- p.50Chapter 3.2.4 --- Chromosome Elimination --- p.51Chapter 3.2.5 --- Rule Migration --- p.52Chapter 3.2.6 --- Crossover --- p.53Chapter 3.2.7 --- Mutation --- p.55Chapter 3.2.8 --- Calculating the Number of Correctly Classified Training Samples in a Rule Set --- p.56Chapter 3.3 --- Performance Evaluation --- p.56Chapter 3.3.1 --- Performance Comparison of the GA-based CPRLS and Various Supervised Classifi- cation Algorithms --- p.57Chapter 3.3.2 --- Performance Comparison of the GA-based CPRLS and RS-based CPRLS --- p.68Chapter 3.3.3 --- Effects of Token Competition --- p.69Chapter 3.3.4 --- Effects of Rule Migration --- p.70Chapter 3.4 --- Chapter Summary --- p.73Chapter 4 --- Genetic Algorithm-based Quadratic Decision Trees --- p.74Chapter 4.1 --- Introduction --- p.74Chapter 4.2 --- Construction of Quadratic Decision Trees --- p.76Chapter 4.3 --- Evolving the Optimal Quadratic Hypersurface using Genetic Algorithms --- p.77Chapter 4.3.1 --- Population Initialization --- p.80Chapter 4.3.2 --- Fitness Evaluation --- p.81Chapter 4.3.3 --- Selection --- p.81Chapter 4.3.4 --- Crossover --- p.82Chapter 4.3.5 --- Mutation --- p.83Chapter 4.4 --- Performance Evaluation --- p.84Chapter 4.4.1 --- Performance Comparison of the GA-based QDT and Various Supervised Classification Algorithms --- p.85Chapter 4.4.2 --- Performance Comparison of the GA-based QDT and RS-based QDT --- p.92Chapter 4.4.3 --- Effects of Changing Parameters of the GA-based QDT --- p.93Chapter 4.5 --- Chapter Summary --- p.109Chapter 5 --- Induction of Linear and Quadratic Decision Trees using Spatial Data Structures --- p.111Chapter 5.1 --- Introduction --- p.111Chapter 5.2 --- Construction of k-D Trees --- p.113Chapter 5.3 --- Construction of Generalized Quadtrees --- p.119Chapter 5.4 --- Induction of Oblique Decision Trees using Spatial Data Structures --- p.124Chapter 5.5. --- Induction of Quadratic Decision Trees using Spatial Data Structures --- p.130Chapter 5.6 --- Performance Evaluation --- p.139Chapter 5.6.1 --- Performance Comparison with Various Supervised Classification Algorithms --- p.142Chapter 5.6.2 --- Effects of Changing the Minimum Number of Training Samples at Each Node of a k-D Tree --- p.155Chapter 5.6.3 --- Effects of Changing the Minimum Number of Training Samples at Each Node of a Generalized Quadtree --- p.157Chapter 5.6.4 --- Effects of Changing the Size of Datasets . --- p.158Chapter 5.7 --- Chapter Summary --- p.160Chapter 6 --- Conclusions --- p.164Chapter 6.1 --- Contributions --- p.164Chapter 6.2 --- Future Work --- p.167Chapter A --- Implementation of Data Mining Algorithms Specified in the Thesis --- p.170Bibliography --- p.17

    Oblique Linear Tree

    No full text
    . In this paper we present system Ltree for proposicional supervised learning. Ltree is able to define decision surfaces both orthogonal and oblique to the axes defined by the attributes of the input space. This is done combining a decision tree with a linear discriminant by means of constructive induction. At each decision node Ltree defines a new instance space by insertion of new attributes that are projections of the examples that fall at this node over the hyperplanes given by a linear discriminant function. This new instance space is propagated down through the tree. Tests based on those new attributes are oblique with respect to the original input space. Ltree is a probabilistic tree in the sense that it outputs a class probability distribution for each query example. The class probability distribution is computed at learning time, taking into account the different class distributions on the path from the root to the actual node. We have carried out experiments on sixteen benchm..
    corecore