7,487 research outputs found
Multi-test Decision Tree and its Application to Microarray Data Classification
Objective:
The desirable property of tools used to investigate biological data is
easy to understand models and predictive decisions.
Decision trees are particularly promising in this regard due to their comprehensible nature that resembles the hierarchical process of human decision making. However, existing algorithms for learning decision trees have tendency to underfit gene expression data. The main aim of this work is to improve the performance and stability of decision trees with only a small increase in their complexity.
Methods:
We propose a multi-test decision tree (MTDT); our main contribution is the application of several univariate tests in each non-terminal node of the decision tree. We also search for alternative, lower-ranked features in order to obtain more stable and reliable predictions.
Results:
Experimental validation was performed on several real-life gene expression datasets. Comparison results with eight classifiers show that MTDT has a statistically significantly higher accuracy than popular decision tree classifiers, and it was highly competitive with ensemble learning algorithms. The proposed solution managed to outperform its baseline algorithm on datasets by an average percent. A study performed on one of the datasets showed that the discovered genes used in the MTDT classification model
are supported by biological evidence in the literature.
Conclusion:
This paper introduces a new type of decision tree which is more suitable for solving biological problems.
MTDTs are relatively easy to analyze and much more powerful in modeling high dimensional microarray data than their popular counterparts
Recommended from our members
Adverse Drug Reaction Classification With Deep Neural Networks
We study the problem of detecting sentences describing adverse drug reactions (ADRs) and frame the problem as binary classification. We investigate different neural network (NN) architectures for ADR classification. In particular, we propose two new neural network models, Convolutional Recurrent Neural Network (CRNN) by concatenating convolutional neural networks with recurrent neural networks, and Convolutional Neural Network with Attention (CNNA) by adding attention weights into convolutional neural networks. We evaluate various NN architectures on a Twitter dataset containing informal language and an Adverse Drug Effects (ADE) dataset constructed by sampling from MEDLINE case reports. Experimental results show that all the NN architectures outperform the traditional maximum entropy classifiers trained from n-grams with different weighting strategies considerably on both datasets. On the Twitter dataset, all the NN architectures perform similarly. But on the ADE dataset, CNN performs better than other more complex CNN variants. Nevertheless, CNNA allows the visualisation of attention weights of words when making classification decisions and hence is more appropriate for the extraction of word subsequences describing ADRs
Prediction in Financial Markets: The Case for Small Disjuncts
Predictive models in regression and classification problems typically
have a single model that covers most, if not all, cases in the data. At
the opposite end of the spectrum is a collection of models each of which
covers a very small subset of the decision space. These are referred to
as “small disjuncts.” The tradeoffs between the two types of
models have been well documented. Single models, especially linear ones,
are easy to interpret and explain. In contrast, small disjuncts do not
provide as clean or as simple an interpretation of the data, and have
been shown by several researchers to be responsible for a
disproportionately large number of errors when applied to out of sample
data. This research provides a counterpoint, demonstrating that
“simple” small disjuncts provide a credible model for
financial market prediction, a problem with a high degree of noise. A
related novel contribution of this paper is a simple method for
measuring the “yield” of a learning system, which is the
percentage of in sample performance that the learned model can be
expected to realize on out-of-sample data. Curiously, such a measure is
missing from the literature on regression learning algorithms.NYU Stern School of Busines
Evaluating multi-label classifiers and recommender systems in the financial service sector
status: publishe
Customer retention
A research report submitted to the Faculty of Engineering and the Built Environment, University of the Witwatersrand, Johannesburg, in partial fulfillment of the requirements for the degree of Master of Science in Engineering.
Johannesburg, May 2018The aim of this study is to model the probability of a customer to attrite/defect from a bank where, for example, the bank is not their preferred/primary bank for salary deposits. The termination of deposit inflow serves as the outcome parameter and the random forest modelling technique was used to predict the outcome, in which new data sources (transactional data) were explored to add predictive power. The conventional logistic regression modelling technique was used to benchmark the random forest’s results.
It was found that the random forest model slightly overfit during the training process and loses predictive power during validation and out of training period data. The random forest model, however, remains predictive and performs better than logistic regression at a cut-off probability of 20%.MT 201
A Cluster Elastic Net for Multivariate Regression
We propose a method for estimating coefficients in multivariate regression
when there is a clustering structure to the response variables. The proposed
method includes a fusion penalty, to shrink the difference in fitted values
from responses in the same cluster, and an L1 penalty for simultaneous variable
selection and estimation. The method can be used when the grouping structure of
the response variables is known or unknown. When the clustering structure is
unknown the method will simultaneously estimate the clusters of the response
and the regression coefficients. Theoretical results are presented for the
penalized least squares case, including asymptotic results allowing for p >> n.
We extend our method to the setting where the responses are binomial variables.
We propose a coordinate descent algorithm for both the normal and binomial
likelihood, which can easily be extended to other generalized linear model
(GLM) settings. Simulations and data examples from business operations and
genomics are presented to show the merits of both the least squares and
binomial methods.Comment: 37 Pages, 11 Figure
- …