22,385 research outputs found
Recommended from our members
Prediction of progression in idiopathic pulmonary fibrosis using CT scans atbaseline: A quantum particle swarm optimization - Random forest approach
Idiopathic pulmonary fibrosis (IPF) is a fatal lung disease characterized by an unpredictable progressive declinein lung function. Natural history of IPF is unknown and the prediction of disease progression at the time ofdiagnosis is notoriously difficult. High resolution computed tomography (HRCT) has been used for the diagnosisof IPF, but not generally for monitoring purpose. The objective of this work is to develop a novel predictivemodel for the radiological progression pattern at voxel-wise level using only baseline HRCT scans. Mainly, thereare two challenges: (a) obtaining a data set of features for region of interest (ROI) on baseline HRCT scans andtheir follow-up status; and (b) simultaneously selecting important features from high-dimensional space, andoptimizing the prediction performance. We resolved the first challenge by implementing a study design andhaving an expert radiologist contour ROIs at baseline scans, depending on its progression status in follow-upvisits. For the second challenge, we integrated the feature selection with prediction by developing an algorithmusing a wrapper method that combines quantum particle swarm optimization to select a small number of featureswith random forest to classify early patterns of progression. We applied our proposed algorithm to analyzeanonymized HRCT images from 50 IPF subjects from a multi-center clinical trial. We showed that it yields aparsimonious model with 81.8% sensitivity, 82.2% specificity and an overall accuracy rate of 82.1% at the ROIlevel. These results are superior to other popular feature selections and classification methods, in that ourmethod produces higher accuracy in prediction of progression and more balanced sensitivity and specificity witha smaller number of selected features. Our work is the first approach to show that it is possible to use onlybaseline HRCT scans to predict progressive ROIs at 6 months to 1year follow-ups using artificial intelligence
Markov blanket: efficient strategy for feature subset selection method for high dimensionality microarray cancer datasets
Currently, feature subset selection methods are very important, especially in areas of application for which
datasets with tens or hundreds of thousands of variables (genes) are available. Feature subset selection
methods help us select a small number of variables out of thousands of genes in microarray datasets for a
more accurate and balanced classification. Efficient gene selection can be considered as an easy computational hold of the subsequent classification
task, and can give subset of gene set without the loss of classification performance. In classifying
microarray data, the main objective of gene selection is to search for the genes while keeping the maximum
amount of relevant information about the class and minimize classification errors. In this paper, explain the
importance of feature subset selection methods in machine learning and data mining fields. Consequently,
the analysis of microarray expression was used to check whether global biological differences underlie
common pathological features in different types of cancer datasets and identify genes that might anticipate
the clinical behavior of this disease. Using the feature subset selection model for gene expression contains
large amounts of raw data that needs analyzing to obtain useful information for specific biological and
medical applications. One way of finding relevant (and removing redundant ) genes is by using the
Bayesian network based on the Markov blanket [1]. We present and compare the performance of the
different approaches to feature (genes) subset selection methods based on Wrapper and Markov Blanket
models for the five-microarray cancer datasets. The first way depends on the Memetic algorithms (MAs)
used for the feature selection method. The second way uses MRMR (Minimum Redundant Maximum
Relevant) for feature subset selection hybridized by genetic search optimization techniques and afterwards
compares the Markov blanket model’s performance with the most common classical classification
algorithms for the selected set of features. For the memetic algorithm, we present a comparison between two embedded approaches for feature subset
selection which are the wrapper filter for feature selection algorithm (WFFSA) and Markov Blanket
Embedded Genetic Algorithm (MBEGA). The memetic algorithm depends on genetic operators (crossover,
mutation) and the dedicated local search procedure. For comparisons, we depend on two evaluations
techniques for learning and testing data which are 10-Kfold cross validation and 30-Bootstraping. The
results of the memetic algorithm clearly show MBEGA often outperforms WFFSA methods by yielding
more significant differentiation among different microarray cancer datasets. In the second part of this paper, we focus mainly on MRMR for feature subset selection methods and the
Bayesian network based on Markov blanket (MB) model that are useful for building a good predictor and
defying the curse of dimensionality to improve prediction performance. These methods cover a wide range
of concerns: providing a better definition of the objective function, feature construction, feature ranking,
efficient search methods, and feature validity assessment methods as well as defining the relationships
among attributes to make predictions. We present performance measures for some common (or classical) learning classification algorithms (Naive
Bayes, Support vector machine [LiBSVM], K-nearest neighbor, and AdBoostM Ensampling) before and
after using the MRMR method. We compare the Bayesian network classification algorithm based on the
Markov Blanket model’s performance measure with the performance of these common classification
algorithms. The result of performance measures for classification algorithm based on the Bayesian network
of the Markov blanket model get higher accuracy rates than other types of classical classification algorithms
for the cancer Microarray datasets.
Bayesian networks clearly depend on relationships among attributes to make predictions. The Bayesian
network based on the Markov blanket (MB) classification method of classifying variables provides all
necessary information for predicting its value. In this paper, we recommend the Bayesian network based on the Markov blanket for learning and classification processing, which is highly effective and efficient on
feature subset selection measures.Master of Science (MSc) in Computational Science
Automating biomedical data science through tree-based pipeline optimization
Over the past decade, data science and machine learning has grown from a
mysterious art form to a staple tool across a variety of fields in academia,
business, and government. In this paper, we introduce the concept of tree-based
pipeline optimization for automating one of the most tedious parts of machine
learning---pipeline design. We implement a Tree-based Pipeline Optimization
Tool (TPOT) and demonstrate its effectiveness on a series of simulated and
real-world genetic data sets. In particular, we show that TPOT can build
machine learning pipelines that achieve competitive classification accuracy and
discover novel pipeline operators---such as synthetic feature
constructors---that significantly improve classification accuracy on these data
sets. We also highlight the current challenges to pipeline optimization, such
as the tendency to produce pipelines that overfit the data, and suggest future
research paths to overcome these challenges. As such, this work represents an
early step toward fully automating machine learning pipeline design.Comment: 16 pages, 5 figures, to appear in EvoBIO 2016 proceeding
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science
As the field of data science continues to grow, there will be an
ever-increasing demand for tools that make machine learning accessible to
non-experts. In this paper, we introduce the concept of tree-based pipeline
optimization for automating one of the most tedious parts of machine
learning---pipeline design. We implement an open source Tree-based Pipeline
Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a
series of simulated and real-world benchmark data sets. In particular, we show
that TPOT can design machine learning pipelines that provide a significant
improvement over a basic machine learning analysis while requiring little to no
input nor prior knowledge from the user. We also address the tendency for TPOT
to design overly complex pipelines by integrating Pareto optimization, which
produces compact pipelines without sacrificing classification accuracy. As
such, this work represents an important step toward fully automating machine
learning pipeline design.Comment: 8 pages, 5 figures, preprint to appear in GECCO 2016, edits not yet
made from reviewer comment
- …