985 research outputs found
A novel computational framework for fast, distributed computing and knowledge integration for microarray gene expression data analysis
The healthcare burden and suffering due to life-threatening diseases such as cancer would be significantly reduced by the design and refinement of computational interpretation of micro-molecular data collected by bioinformaticians. Rapid technological advancements in the field of microarray analysis, an important component in the design of in-silico molecular medicine methods, have generated enormous amounts of such data, a trend that has been increasing exponentially over the last few years. However, the analysis and handling of these data has become one of the major bottlenecks in the utilization of the technology. The rate of collection of these data has far surpassed our ability to analyze the data for novel, non-trivial, and important knowledge. The high-performance computing platform, and algorithms that utilize its embedded computing capacity, has emerged as a leading technology that can handle such data-intensive knowledge discovery applications.
In this dissertation, we present a novel framework to achieve fast, robust, and accurate (biologically-significant) multi-class classification of gene expression data using distributed knowledge discovery and integration computational routines, specifically for cancer genomics applications. The research presents a unique computational paradigm for the rapid, accurate, and efficient selection of relevant marker genes, while providing parametric controls to ensure flexibility of its application.
The proposed paradigm consists of the following key computational steps: (a) preprocess, normalize the gene expression data; (b) discretize the data for knowledge mining application; (c) partition the data using two proposed methods: partitioning with overlapped windows and adaptive selection; (d) perform knowledge discovery on the partitioned data-spaces for association rule discovery; (e) integrate association rules from partitioned data and knowledge spaces on distributed processor nodes using a novel knowledge integration algorithm; and (f) post-analysis and functional elucidation of the discovered gene rule sets. The framework is implemented on a shared-memory multiprocessor supercomputing environment, and several experimental results are demonstrated to evaluate the algorithms. We conclude with a functional interpretation of the computational discovery routines for enhanced biological physiological discovery from cancer genomics datasets, while suggesting some directions for future research
PMP-SVM: A Hybrid Approach for effective Cancer Diagnosis using Feature Selection and Optimization
Cancer disease is becoming a prominent factor in increasing the death ration over the world due to the late diagnosis. Machine Learning (ML) is playing a vital role in providing computer aided diagnosis models for early diagnosis of cancer. For the diagnosis process the microarray data has its own place. Microarray data contain the genetic information of a patient with a large number of dimensions such as genes with a small sample such as patient details. If the microarray is directly taken without reducing the dimension as the input to any ML model for classification, then Small Sample Size is the resulting issue. So, size of the microarray data needs to be reduces by using either of dimensionality reduction technique or the feature selection technique to increase the model’s performance. In this work, proposed a hybrid model using Principal Component Analysis (PCA), Maximum Relevance Minimum Redundancy (MRMR), Particle Swarm Optimization (PSO) and Support Vector Machine (SVM) for cancer diagnosis. PCA and MRMR is used for feature selection and PSO is applied to get the optimized feature set. Finally, SVM is applied as the classification model. The proposed model is evaluated against multiple cancer microarray datasets to measure the performance in terms of accuracy, precision, recall, and F1 score. Result shows that proposed model performs better than existing state of art model
Tellipsoid: Exploiting inter-gene correlation for improved detection of differential gene expression
Motivation: Algorithms for differential analysis of microarray data are vital
to modern biomedical research. Their accuracy strongly depends on effective
treatment of inter-gene correlation. Correlation is ordinarily accounted for in
terms of its effect on significance cut-offs. In this paper it is shown that
correlation can, in fact, be exploited {to share information across tests},
which, in turn, can increase statistical power.
Results: Vastly and demonstrably improved differential analysis approaches
are the result of combining identifiability (the fact that in most microarray
data sets, a large proportion of genes can be identified a priori as
non-differential) with optimization criteria that incorporate correlation. As a
special case, we develop a method which builds upon the widely used two-sample
t-statistic based approach and uses the Mahalanobis distance as an optimality
criterion. Results on the prostate cancer data of Singh et al. (2002) suggest
that the proposed method outperforms all published approaches in terms of
statistical power.
Availability: The proposed algorithm is implemented in MATLAB and in R. The
software, called Tellipsoid, and relevant data sets are available at
http://www.egr.msu.edu/~desaikeyComment: 19 pages, Submitted to Bioinformatic
High-Performance Cloud Computing: A View of Scientific Applications
Scientific computing often requires the availability of a massive number of
computers for performing large scale experiments. Traditionally, these needs
have been addressed by using high-performance computing solutions and installed
facilities such as clusters and super computers, which are difficult to setup,
maintain, and operate. Cloud computing provides scientists with a completely
new model of utilizing the computing infrastructure. Compute resources, storage
resources, as well as applications, can be dynamically provisioned (and
integrated within the existing infrastructure) on a pay per use basis. These
resources can be released when they are no more needed. Such services are often
offered within the context of a Service Level Agreement (SLA), which ensure the
desired Quality of Service (QoS). Aneka, an enterprise Cloud computing
solution, harnesses the power of compute resources by relying on private and
public Clouds and delivers to users the desired QoS. Its flexible and service
based infrastructure supports multiple programming paradigms that make Aneka
address a variety of different scenarios: from finance applications to
computational science. As examples of scientific computing in the Cloud, we
present a preliminary case study on using Aneka for the classification of gene
expression data and the execution of fMRI brain imaging workflow.Comment: 13 pages, 9 figures, conference pape
En-PaFlower: An Ensemble Approach using PSO and Flower Pollination Algorithm for Cancer Diagnosis
Machine learning now is used across many sectors and provides consistently precise predictions. The machine learning system is able to learn effectively because the training dataset contains examples of previously completed tasks. After learning how to process the necessary data, researchers have proven that machine learning algorithms can carry out the whole work autonomously. In recent years, cancer has become a major cause of the worldwide increase in mortality. Therefore, early detection of cancer improves the chance of a complete recovery, and Machine Learning (ML) plays a significant role in this perspective. Cancer diagnostic and prognosis microarray dataset is available with the biopsy dataset. Because of its importance in making diagnoses and classifying cancer diseases, the microarray data represents a massive amount. It may be challenging to do an analysis on a large number of datasets, though. As a result, feature selection is crucial, and machine learning provides classification techniques. These algorithms choose the relevant features that help build a more precise categorization model. Accurately classifying diseases is facilitated as a result, which aids in disease prevention. This work aims to synthesize existing knowledge on cancer diagnosis using machine learning techniques into a compact report. Current research work aims to propose an ensemble-based machine learning model En-PaFlower using Particle Swarm Optimization (PSO) as the feature selection algorithm, Flower Pollination algorithm (FPA) as the optimization algorithm with the majority voting algorithm. Finally, the performance of the proposed algorithm is evaluated over three different types of cancer disease datasets with accuracy, precision, recall, specificity, and F-1 Score etc as the evaluation parameters. The empirical analysis shows that the proposed methodology shows highest accuracy as 95.65%
Feature Selection and Molecular Classification of Cancer Using Genetic Programming
AbstractDespite important advances in microarray-based molecular classification of tumors, its application in clinical settings remains formidable. This is in part due to the limitation of current analysis programs in discovering robust biomarkers and developing classifiers with a practical set of genes. Genetic programming (GP) is a type of machine learning technique that uses evolutionary algorithm to simulate natural selection as well as population dynamics, hence leading to simple and comprehensible classifiers. Here we applied GP to cancer expression profiling data to select feature genes and build molecular classifiers by mathematical integration of these genes. Analysis of thousands of GP classifiers generated for a prostate cancer data set revealed repetitive use of a set of highly discriminative feature genes, many of which are known to be disease associated. GP classifiers often comprise five or less genes and successfully predict cancer types and subtypes. More importantly, GP classifiers generated in one study are able to predict samples from an independent study, which may have used different microarray platforms. In addition, GP yielded classification accuracy better than or similar to conventional classification methods. Furthermore, the mathematical expression of GP classifiers provides insights into relationships between classifier genes. Taken together, our results demonstrate that GP may be valuable for generating effective classifiers containing a practical set of genes for diagnostic/ prognostic cancer classification
Gene Expression based Survival Prediction for Cancer Patients: A Topic Modeling Approach
Cancer is one of the leading cause of death, worldwide. Many believe that
genomic data will enable us to better predict the survival time of these
patients, which will lead to better, more personalized treatment options and
patient care. As standard survival prediction models have a hard time coping
with the high-dimensionality of such gene expression (GE) data, many projects
use some dimensionality reduction techniques to overcome this hurdle. We
introduce a novel methodology, inspired by topic modeling from the natural
language domain, to derive expressive features from the high-dimensional GE
data. There, a document is represented as a mixture over a relatively small
number of topics, where each topic corresponds to a distribution over the
words; here, to accommodate the heterogeneity of a patient's cancer, we
represent each patient (~document) as a mixture over cancer-topics, where each
cancer-topic is a mixture over GE values (~words). This required some
extensions to the standard LDA model eg: to accommodate the "real-valued"
expression values - leading to our novel "discretized" Latent Dirichlet
Allocation (dLDA) procedure. We initially focus on the METABRIC dataset, which
describes breast cancer patients using the r=49,576 GE values, from
microarrays. Our results show that our approach provides survival estimates
that are more accurate than standard models, in terms of the standard
Concordance measure. We then validate this approach by running it on the
Pan-kidney (KIPAN) dataset, over r=15,529 GE values - here using the mRNAseq
modality - and find that it again achieves excellent results. In both cases, we
also show that the resulting model is calibrated, using the recent
"D-calibrated" measure. These successes, in two different cancer types and
expression modalities, demonstrates the generality, and the effectiveness, of
this approach
Evolutionary algorithms and weighting strategies for feature selection in predictive data mining
The improvements in Deoxyribonucleic Acid (DNA) microarray technology mean
that thousands of genes can be profiled simultaneously in a quick and efficient manner.
DNA microarrays are increasingly being used for prediction and early diagnosis
in cancer treatment. Feature selection and classification play a pivotal role in this
process. The correct identification of an informative subset of genes may directly
lead to putative drug targets. These genes can also be used as an early diagnosis or
predictive tool. However, the large number of features (many thousands) present in
a typical dataset present a formidable barrier to feature selection efforts.
Many approaches have been presented in literature for feature selection in such
datasets. Most of them use classical statistical approaches (e.g. correlation). Classical
statistical approaches, although fast, are incapable of detecting non-linear interactions
between features of interest. By default, Evolutionary Algorithms (EAs)
are capable of taking non-linear interactions into account. Therefore, EAs are very
promising for feature selection in such datasets.
It has been shown that dimensionality reduction increases the efficiency of feature
selection in large and noisy datasets such as DNA microarray data. The two-phase
Evolutionary Algorithm/k-Nearest Neighbours (EA/k-NN) algorithm is a promising
approach that carries out initial dimensionality reduction as well as feature selection
and classification.
This thesis further investigates the two-phase EA/k-NN algorithm and also introduces
an adaptive weights scheme for the k-Nearest Neighbours (k-NN) classifier.
It also introduces a novel weighted centroid classification technique and a correlation
guided mutation approach. Results show that the weighted centroid approach
is capable of out-performing the EA/k-NN algorithm across five large biomedical
datasets. It also identifies promising new areas of research that would complement
the techniques introduced and investigated
Clustering and graph mining techniques for classification of complex structural variations in cancer genomes
For many years, a major question in cancer genomics has been the identification of those variations that can have a functional role in cancer, and distinguish from the majority of genomic changes that have no functional consequences. This is particularly challenging when considering complex chromosomal rearrangements, often composed of multiple DNA breaks, resulting in difficulties in classifying and interpreting them functionally. Despite recent efforts towards classifying structural variants (SVs), more robust statistical frames are needed to better classify these variants and isolate those that derive from specific molecular mechanisms. We present a new statistical approach to analyze SVs patterns from 2392 tumor samples from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium and identify significant recurrence, which can inform relevant mechanisms involved in the biology of tumors. The method is based on recursive KDE clustering of 152,926 SVs, randomization methods, graph mining techniques and statistical measures. The proposed methodology was able not only to identify complex patterns across different cancer types but also to prove them as not random occurrences. Furthermore, a new class of pattern that was not previously described has been identified.Among others, this study has been supported by projects: SAF2017-89450-R (TransTumVar) and PID2020-119797RB-100 (BenchSV) from Science
and Innovation Spanish Minstry. It has also been supported by the Spanish Goverment (contract PID2019-107255GB), Generalitat de Catalunya (contract 2014-SGR-1051) and Universitat Politècnica de Catalunya (45-FPIUPC2018).Peer ReviewedPostprint (published version
- …