9,467 research outputs found
Elephant Search with Deep Learning for Microarray Data Analysis
Even though there is a plethora of research in Microarray gene expression
data analysis, still, it poses challenges for researchers to effectively and
efficiently analyze the large yet complex expression of genes. The feature
(gene) selection method is of paramount importance for understanding the
differences in biological and non-biological variation between samples. In
order to address this problem, a novel elephant search (ES) based optimization
is proposed to select best gene expressions from the large volume of microarray
data. Further, a promising machine learning method is envisioned to leverage
such high dimensional and complex microarray dataset for extracting hidden
patterns inside to make a meaningful prediction and most accurate
classification. In particular, stochastic gradient descent based Deep learning
(DL) with softmax activation function is then used on the reduced features
(genes) for better classification of different samples according to their gene
expression levels. The experiments are carried out on nine most popular Cancer
microarray gene selection datasets, obtained from UCI machine learning
repository. The empirical results obtained by the proposed elephant search
based deep learning (ESDL) approach are compared with most recent published
article for its suitability in future Bioinformatics research.Comment: 12 pages, 5 Tabl
Overview of Random Forest Methodology and Practical Guidance with Emphasis on Computational Biology and Bioinformatics
The Random Forest (RF) algorithm by Leo Breiman has become a
standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and returns measures of variable importance. This paper synthesizes ten years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is given to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research
Multi-Target Prediction: A Unifying View on Problems and Methods
Multi-target prediction (MTP) is concerned with the simultaneous prediction
of multiple target variables of diverse type. Due to its enormous application
potential, it has developed into an active and rapidly expanding research field
that combines several subfields of machine learning, including multivariate
regression, multi-label classification, multi-task learning, dyadic prediction,
zero-shot learning, network inference, and matrix completion. In this paper, we
present a unifying view on MTP problems and methods. First, we formally discuss
commonalities and differences between existing MTP problems. To this end, we
introduce a general framework that covers the above subfields as special cases.
As a second contribution, we provide a structured overview of MTP methods. This
is accomplished by identifying a number of key properties, which distinguish
such methods and determine their suitability for different types of problems.
Finally, we also discuss a few challenges for future research
Towards Data-Driven Autonomics in Data Centers
Continued reliance on human operators for managing data centers is a major
impediment for them from ever reaching extreme dimensions. Large computer
systems in general, and data centers in particular, will ultimately be managed
using predictive computational and executable models obtained through
data-science tools, and at that point, the intervention of humans will be
limited to setting high-level goals and policies rather than performing
low-level operations. Data-driven autonomics, where management and control are
based on holistic predictive models that are built and updated using generated
data, opens one possible path towards limiting the role of operators in data
centers. In this paper, we present a data-science study of a public Google
dataset collected in a 12K-node cluster with the goal of building and
evaluating a predictive model for node failures. We use BigQuery, the big data
SQL platform from the Google Cloud suite, to process massive amounts of data
and generate a rich feature set characterizing machine state over time. We
describe how an ensemble classifier can be built out of many Random Forest
classifiers each trained on these features, to predict if machines will fail in
a future 24-hour window. Our evaluation reveals that if we limit false positive
rates to 5%, we can achieve true positive rates between 27% and 88% with
precision varying between 50% and 72%. We discuss the practicality of including
our predictive model as the central component of a data-driven autonomic
manager and operating it on-line with live data streams (rather than off-line
on data logs). All of the scripts used for BigQuery and classification analyses
are publicly available from the authors' website.Comment: 12 pages, 6 figure
Machine learning application for development of a data-driven predictive model able to investigate quality of life scores in a rare disease.
BACKGROUND:Alkaptonuria (AKU) is an ultra-rare autosomal recessive disease caused by a mutation in the homogentisate 1,2-dioxygenase (HGD) gene. One of the main obstacles in studying AKU, and other ultra-rare diseases, is the lack of a standardized methodology to assess disease severity or response to treatment. Quality of Life scores (QoL) are a reliable way to monitor patients' clinical condition and health status. QoL scores allow to monitor the evolution of diseases and assess the suitability of treatments by taking into account patients' symptoms, general health status and care satisfaction. However, more comprehensive tools to study a complex and multi-systemic disease like AKU are needed. In this study, a Machine Learning (ML) approach was implemented with the aim to perform a prediction of QoL scores based on clinical data deposited in the ApreciseKUre, an AKU- dedicated database. METHOD:Data derived from 129 AKU patients have been firstly examined through a preliminary statistical analysis (Pearson correlation coefficient) to measure the linear correlation between 11 QoL scores. The variable importance in QoL scores prediction of 110 ApreciseKUre biomarkers has been then calculated using XGBoost, with K-nearest neighbours algorithm (k-NN) approach. Due to the limited number of data available, this model has been validated using surrogate data analysis. RESULTS:We identified a direct correlation of 6 (age, Serum Amyloid A, Chitotriosidase, Advanced Oxidation Protein Products, S-thiolated proteins and Body Mass Index) out of 110 biomarkers with the QoL health status, in particular with the KOOS (Knee injury and Osteoarthritis Outcome Score) symptoms (Relative Absolute Error (RAE) 0.25). The error distribution of surrogate-model (RAE 0.38) was unequivocally higher than the true-model one (RAE of 0.25), confirming the consistency of our dataset. Our data showed that inflammation, oxidative stress, amyloidosis and lifestyle of patients correlates with the QoL scores for physical status, while no correlation between the biomarkers and patients' mental health was present (RAE 1.1). CONCLUSIONS:This proof of principle study for rare diseases confirms the importance of database, allowing data management and analysis, which can be used to predict more effective treatments
Towards Operator-less Data Centers Through Data-Driven, Predictive, Proactive Autonomics
Continued reliance on human operators for managing data centers is a major
impediment for them from ever reaching extreme dimensions. Large computer
systems in general, and data centers in particular, will ultimately be managed
using predictive computational and executable models obtained through
data-science tools, and at that point, the intervention of humans will be
limited to setting high-level goals and policies rather than performing
low-level operations. Data-driven autonomics, where management and control are
based on holistic predictive models that are built and updated using live data,
opens one possible path towards limiting the role of operators in data centers.
In this paper, we present a data-science study of a public Google dataset
collected in a 12K-node cluster with the goal of building and evaluating
predictive models for node failures. Our results support the practicality of a
data-driven approach by showing the effectiveness of predictive models based on
data found in typical data center logs. We use BigQuery, the big data SQL
platform from the Google Cloud suite, to process massive amounts of data and
generate a rich feature set characterizing node state over time. We describe
how an ensemble classifier can be built out of many Random Forest classifiers
each trained on these features, to predict if nodes will fail in a future
24-hour window. Our evaluation reveals that if we limit false positive rates to
5%, we can achieve true positive rates between 27% and 88% with precision
varying between 50% and 72%.This level of performance allows us to recover
large fraction of jobs' executions (by redirecting them to other nodes when a
failure of the present node is predicted) that would otherwise have been wasted
due to failures. [...
- …