98,274 research outputs found
Automating biomedical data science through tree-based pipeline optimization
Over the past decade, data science and machine learning has grown from a
mysterious art form to a staple tool across a variety of fields in academia,
business, and government. In this paper, we introduce the concept of tree-based
pipeline optimization for automating one of the most tedious parts of machine
learning---pipeline design. We implement a Tree-based Pipeline Optimization
Tool (TPOT) and demonstrate its effectiveness on a series of simulated and
real-world genetic data sets. In particular, we show that TPOT can build
machine learning pipelines that achieve competitive classification accuracy and
discover novel pipeline operators---such as synthetic feature
constructors---that significantly improve classification accuracy on these data
sets. We also highlight the current challenges to pipeline optimization, such
as the tendency to produce pipelines that overfit the data, and suggest future
research paths to overcome these challenges. As such, this work represents an
early step toward fully automating machine learning pipeline design.Comment: 16 pages, 5 figures, to appear in EvoBIO 2016 proceeding
Non-uniform Feature Sampling for Decision Tree Ensembles
We study the effectiveness of non-uniform randomized feature selection in
decision tree classification. We experimentally evaluate two feature selection
methodologies, based on information extracted from the provided dataset:
\emph{leverage scores-based} and \emph{norm-based} feature selection.
Experimental evaluation of the proposed feature selection techniques indicate
that such approaches might be more effective compared to naive uniform feature
selection and moreover having comparable performance to the random forest
algorithm [3]Comment: 7 pages, 7 figures, 1 tabl
Power System Parameters Forecasting Using Hilbert-Huang Transform and Machine Learning
A novel hybrid data-driven approach is developed for forecasting power system
parameters with the goal of increasing the efficiency of short-term forecasting
studies for non-stationary time-series. The proposed approach is based on mode
decomposition and a feature analysis of initial retrospective data using the
Hilbert-Huang transform and machine learning algorithms. The random forests and
gradient boosting trees learning techniques were examined. The decision tree
techniques were used to rank the importance of variables employed in the
forecasting models. The Mean Decrease Gini index is employed as an impurity
function. The resulting hybrid forecasting models employ the radial basis
function neural network and support vector regression. Apart from introduction
and references the paper is organized as follows. The section 2 presents the
background and the review of several approaches for short-term forecasting of
power system parameters. In the third section a hybrid machine learning-based
algorithm using Hilbert-Huang transform is developed for short-term forecasting
of power system parameters. Fourth section describes the decision tree learning
algorithms used for the issue of variables importance. Finally in section six
the experimental results in the following electric power problems are
presented: active power flow forecasting, electricity price forecasting and for
the wind speed and direction forecasting
Multi-test Decision Tree and its Application to Microarray Data Classification
Objective:
The desirable property of tools used to investigate biological data is
easy to understand models and predictive decisions.
Decision trees are particularly promising in this regard due to their comprehensible nature that resembles the hierarchical process of human decision making. However, existing algorithms for learning decision trees have tendency to underfit gene expression data. The main aim of this work is to improve the performance and stability of decision trees with only a small increase in their complexity.
Methods:
We propose a multi-test decision tree (MTDT); our main contribution is the application of several univariate tests in each non-terminal node of the decision tree. We also search for alternative, lower-ranked features in order to obtain more stable and reliable predictions.
Results:
Experimental validation was performed on several real-life gene expression datasets. Comparison results with eight classifiers show that MTDT has a statistically significantly higher accuracy than popular decision tree classifiers, and it was highly competitive with ensemble learning algorithms. The proposed solution managed to outperform its baseline algorithm on datasets by an average percent. A study performed on one of the datasets showed that the discovered genes used in the MTDT classification model
are supported by biological evidence in the literature.
Conclusion:
This paper introduces a new type of decision tree which is more suitable for solving biological problems.
MTDTs are relatively easy to analyze and much more powerful in modeling high dimensional microarray data than their popular counterparts
Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer
Quantitative extraction of high-dimensional mineable data from medical images
is a process known as radiomics. Radiomics is foreseen as an essential
prognostic tool for cancer risk assessment and the quantification of
intratumoural heterogeneity. In this work, 1615 radiomic features (quantifying
tumour image intensity, shape, texture) extracted from pre-treatment FDG-PET
and CT images of 300 patients from four different cohorts were analyzed for the
risk assessment of locoregional recurrences (LR) and distant metastases (DM) in
head-and-neck cancer. Prediction models combining radiomic and clinical
variables were constructed via random forests and imbalance-adjustment
strategies using two of the four cohorts. Independent validation of the
prediction and prognostic performance of the models was carried out on the
other two cohorts (LR: AUC = 0.69 and CI = 0.67; DM: AUC = 0.86 and CI = 0.88).
Furthermore, the results obtained via Kaplan-Meier analysis demonstrated the
potential of radiomics for assessing the risk of specific tumour outcomes using
multiple stratification groups. This could have important clinical impact,
notably by allowing for a better personalization of chemo-radiation treatments
for head-and-neck cancer patients from different risk groups.Comment: (1) Paper: 33 pages, 4 figures, 1 table; (2) SUPP info: 41 pages, 7
figures, 8 table
- …