7,812 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Recommended from our members
The effect of missing values using genetic programming on evolvable diagnosis
Medical databases usually contain missing values due the policy of
reducing stress and harm to the patient. In practice missing values has been a
problem mainly due to the necessity to evaluate mathematical equations obtained
by genetic programming. The solution to this problem is to use fill in methods to
estimate the missing values. This paper analyses three fill in methods: (1) attribute
means, (2) conditional means, and (3) random number generation. The methods
are evaluated using sensitivity, specificity, and entropy to explain the exchange in
knowledge of the results. The results are illustrated based on the breast cancer
database. Conditional means produced the best fill in experimental results
Algorithms Implemented for Cancer Gene Searching and Classifications
Understanding the gene expression is an important factor to cancer diagnosis. One target of this understanding is implementing cancer gene search and classification methods. However, cancer gene search and classification is a challenge in that there is no an obvious exact algorithm that can be implemented individually for various cancer cells. In this paper a research is con-ducted through the most common top ranked algorithms implemented for cancer gene search and classification, and how they are implemented to reach a better performance. The paper will distinguish algorithms implemented for Bio image analysis for cancer cells and algorithms implemented based on DNA array data. The main purpose of this paper is to explore a road map towards presenting the most current algorithms implemented for cancer gene search and classification
Elephant Search with Deep Learning for Microarray Data Analysis
Even though there is a plethora of research in Microarray gene expression
data analysis, still, it poses challenges for researchers to effectively and
efficiently analyze the large yet complex expression of genes. The feature
(gene) selection method is of paramount importance for understanding the
differences in biological and non-biological variation between samples. In
order to address this problem, a novel elephant search (ES) based optimization
is proposed to select best gene expressions from the large volume of microarray
data. Further, a promising machine learning method is envisioned to leverage
such high dimensional and complex microarray dataset for extracting hidden
patterns inside to make a meaningful prediction and most accurate
classification. In particular, stochastic gradient descent based Deep learning
(DL) with softmax activation function is then used on the reduced features
(genes) for better classification of different samples according to their gene
expression levels. The experiments are carried out on nine most popular Cancer
microarray gene selection datasets, obtained from UCI machine learning
repository. The empirical results obtained by the proposed elephant search
based deep learning (ESDL) approach are compared with most recent published
article for its suitability in future Bioinformatics research.Comment: 12 pages, 5 Tabl
Gene Expression based Survival Prediction for Cancer Patients: A Topic Modeling Approach
Cancer is one of the leading cause of death, worldwide. Many believe that
genomic data will enable us to better predict the survival time of these
patients, which will lead to better, more personalized treatment options and
patient care. As standard survival prediction models have a hard time coping
with the high-dimensionality of such gene expression (GE) data, many projects
use some dimensionality reduction techniques to overcome this hurdle. We
introduce a novel methodology, inspired by topic modeling from the natural
language domain, to derive expressive features from the high-dimensional GE
data. There, a document is represented as a mixture over a relatively small
number of topics, where each topic corresponds to a distribution over the
words; here, to accommodate the heterogeneity of a patient's cancer, we
represent each patient (~document) as a mixture over cancer-topics, where each
cancer-topic is a mixture over GE values (~words). This required some
extensions to the standard LDA model eg: to accommodate the "real-valued"
expression values - leading to our novel "discretized" Latent Dirichlet
Allocation (dLDA) procedure. We initially focus on the METABRIC dataset, which
describes breast cancer patients using the r=49,576 GE values, from
microarrays. Our results show that our approach provides survival estimates
that are more accurate than standard models, in terms of the standard
Concordance measure. We then validate this approach by running it on the
Pan-kidney (KIPAN) dataset, over r=15,529 GE values - here using the mRNAseq
modality - and find that it again achieves excellent results. In both cases, we
also show that the resulting model is calibrated, using the recent
"D-calibrated" measure. These successes, in two different cancer types and
expression modalities, demonstrates the generality, and the effectiveness, of
this approach
Defining a robust biological prior from Pathway Analysis to drive Network Inference
Inferring genetic networks from gene expression data is one of the most
challenging work in the post-genomic era, partly due to the vast space of
possible networks and the relatively small amount of data available. In this
field, Gaussian Graphical Model (GGM) provides a convenient framework for the
discovery of biological networks. In this paper, we propose an original
approach for inferring gene regulation networks using a robust biological prior
on their structure in order to limit the set of candidate networks.
Pathways, that represent biological knowledge on the regulatory networks,
will be used as an informative prior knowledge to drive Network Inference. This
approach is based on the selection of a relevant set of genes, called the
"molecular signature", associated with a condition of interest (for instance,
the genes involved in disease development). In this context, differential
expression analysis is a well established strategy. However outcome signatures
are often not consistent and show little overlap between studies. Thus, we will
dedicate the first part of our work to the improvement of the standard process
of biomarker identification to guarantee the robustness and reproducibility of
the molecular signature.
Our approach enables to compare the networks inferred between two conditions
of interest (for instance case and control networks) and help along the
biological interpretation of results. Thus it allows to identify differential
regulations that occur in these conditions. We illustrate the proposed approach
by applying our method to a study of breast cancer's response to treatment
- …