238 research outputs found
Gene expression-based prediction of malignancies
Molecular classification of malignancies can potentially stratify patients into distinct subclasses not detectable using traditional classification of tumors, opening new perspectives on the diagnosis and personalized therapy of polygenic diseases. In this paper we present a brief overview of our work on gene expression based prediction of malignancies, starting from the dichotomic classification problem of normal versus tumoural tissues, to multiclasss cancer diagnosis and to functional class discovery and gene selection problems. The last part of this work present preliminary results about the applicatin of ensembles of SVMs based on bias-variance decomposition of the error to the analysis of gene expression data of malignant tissues
Kernel methods in genomics and computational biology
Support vector machines and kernel methods are increasingly popular in
genomics and computational biology, due to their good performance in real-world
applications and strong modularity that makes them suitable to a wide range of
problems, from the classification of tumors to the automatic annotation of
proteins. Their ability to work in high dimension, to process non-vectorial
data, and the natural framework they provide to integrate heterogeneous data
are particularly relevant to various problems arising in computational biology.
In this chapter we survey some of the most prominent applications published so
far, highlighting the particular developments in kernel methods triggered by
problems in biology, and mention a few promising research directions likely to
expand in the future
Assessment of SVM Reliability for Microarray Data Analysis
The goal of our research is to provide techniques that can assess and validate the results of SVM-based analysis of microarray data. We present preliminary results of the effect of mislabeled training samples. We conducted several systematic experiments on artificial and real medical data using SVMs. We systematically flipped the labels of a fraction of the training data. We show that a relatively small number of mislabeled examples can dramatically decrease the performance as visualized on the ROC graphs. This phenomenon persists even if the dimensionality of the input space is drastically decreased, by using for example feature selection. Moreover we show that for SVM recursive feature elimination, even a small fraction of mislabeled samples can completely change the resulting set of genes. This work is an extended version of the previous paper [MBN04]
Random subspace ensembles for the bio-molecular diagnosis of tumors.
The bio-molecular diagnosis of malignancies, based on DNA
microarray biotechnologies, is a difficult learning task, because of the
high dimensionality and low cardinality of the data. Many supervised
learning techniques, among them support vector machines (SVMs), have
been experimented, using also feature selection methods to reduce the
dimensionality of the data. In this paper we investigate an alternative
approach based on random subspace ensemble methods. The high dimensionality
of the data is reduced by randomly sampling subsets of
features (gene expression levels), and accuracy is improved by aggregating
the resulting base classifiers. Our experiments, in the area of the
diagnosis of malignancies at bio-molecular level, show the effectiveness
of the proposed approach
Ensemble deep learning: A review
Ensemble learning combines several individual models to obtain better
generalization performance. Currently, deep learning models with multilayer
processing architecture is showing better performance as compared to the
shallow or traditional classification models. Deep ensemble learning models
combine the advantages of both the deep learning models as well as the ensemble
learning such that the final model has better generalization performance. This
paper reviews the state-of-art deep ensemble models and hence serves as an
extensive summary for the researchers. The ensemble models are broadly
categorised into ensemble models like bagging, boosting and stacking, negative
correlation based deep ensemble models, explicit/implicit ensembles,
homogeneous /heterogeneous ensemble, decision fusion strategies, unsupervised,
semi-supervised, reinforcement learning and online/incremental, multilabel
based deep ensemble models. Application of deep ensemble models in different
domains is also briefly discussed. Finally, we conclude this paper with some
future recommendations and research directions
Computational Methods for the Analysis of Genomic Data and Biological Processes
In recent decades, new technologies have made remarkable progress in helping to understand biological systems. Rapid advances in genomic profiling techniques such as microarrays or high-performance sequencing have brought new opportunities and challenges in the fields of computational biology and bioinformatics. Such genetic sequencing techniques allow large amounts of data to be produced, whose analysis and cross-integration could provide a complete view of organisms. As a result, it is necessary to develop new techniques and algorithms that carry out an analysis of these data with reliability and efficiency. This Special Issue collected the latest advances in the field of computational methods for the analysis of gene expression data, and, in particular, the modeling of biological processes. Here we present eleven works selected to be published in this Special Issue due to their interest, quality, and originality
Model order selection for bio-molecular data clustering
Background: Cluster analysis has been widely applied for investigating structure in bio-molecular data. A drawback of most clustering algorithms is that they cannot automatically detect the ”natural ” number of clusters underlying the data, and in many cases we have no enough ”a priori ” biological knowledge to evaluate both the number of clusters as well as their validity. Recently several methods based on the concept of stability have been proposed to estimate the ”optimal ” number of clusters, but despite their successful application to the analysis of complex bio-molecular data, the assessment of the statistical significance of the discovered clustering solutions and the detection of multiple structures simultaneously present in high-dimensional bio-molecular data are still major problems. Results: We propose a stability method based on randomized maps that exploits the high-dimensionality and relatively low cardinality that characterize bio-molecular data, by selecting subsets of randomized linear combinations of the input variables, and by using stability indices based on the overall distribution of similarity measures between multiple pairs of clusterings performed on the randomly projected data. A χ 2-based statistical test is proposed to assess the significance of the clustering solutions and to detect significant and if possible multi-level structures simultaneously present in the data (e.g. hierarchical structures)
Ensembles based on Random Projection for gene expression data analysis
In this work we focused on methods to solve classification problems characterized
by high dimensionality and low cardinality data. These features are relevant
in bio-molecular data analysis and particularly in class prediction whith microarray
data.
Many methods have been proposed to approach this problem, characterized by
the so called curse of dimensionality (term introduced by Richard Bellman (9)).
Among them, gene selection methods, principal and independent component analysis,
kernel methods.
In this work we propose and we experimentally analyze two ensemble methods
based on two randomized techniques for data compression: Random Subspaces
and Random Projections. While Random Subspaces, originally proposed by T.
K. Ho, is a technique related to feature subsampling, Random Projections is a feature
extraction technique motivated by the Johnson-Lindenstrauss theory about
distance preserving random projections.
The randomness underlying the proposed approach leads to diverse sets of extracted
features corresponding to low dimensional subspaces with low metric distortion
and approximate preservation of the expected loss of the trained base
classifiers.
In the first part of the work we justify our approach with two theoretical results.
The first regards unsupervised learning: we prove that a clustering algorithm minimizing
the objective (quadratic) function provides a -closed solution if applied
to compressed data according to Johnson-Lindenstrauss theory.
The second one is related to supervised learning: we prove that Polynomials kernels
are approximatively preserved by Random Projections, up to a degradation proportional to the square of the degree of the polynomial.
In the second part of the work, we propose ensemble algorithms based on Random
Subspaces and Random Projections, and we experimentally compare them
with single SVM and other state-of-the-art ensemble methods, using three gene
expression data set: Colon, Leukemia and DLBL-FL - i.e. Diffuse Large B-cell
and Follicular Lymphoma. The obtained results confirm the effectiveness of the
proposed approach.
Moreover, we observed a certain performance degradation of Random Projection
methods when the base learners are SVMs with polynomial kernel of high degree
Recommended from our members
A robust and stable gene selection algorithm based on graph theory and machine learning
Background
Nowadays we are observing an explosion of gene expression data with phenotypes. It enables us to accurately identify genes responsible for certain medical condition as well as classify them for drug target. Like any other phenotype data in medical domain, gene expression data with phenotypes also suffer from being a very underdetermined system. In a very large set of features but a very small sample size domain (e.g. DNA microarray, RNA-seq data, GWAS data, etc.), it is often reported that several contrasting feature subsets may yield near equally optimal results. This phenomenon is known as instability. Considering these facts, we have developed a robust and stable supervised gene selection algorithm to select a set of robust and stable genes having a better prediction ability from the gene expression datasets with phenotypes. Stability and robustness is ensured by class and instance level perturbations, respectively.
Results
We have performed rigorous experimental evaluations using 10 real gene expression microarray datasets with phenotypes. They reveal that our algorithm outperforms the state-of-the-art algorithms with respect to stability and classification accuracy. We have also performed biological enrichment analysis based on gene ontology-biological processes (GO-BP) terms, disease ontology (DO) terms, and biological pathways.
Conclusions
It is indisputable from the results of the performance evaluations that our proposed method is indeed an effective and efficient supervised gene selection algorithm
A jackknife-like method for classification and uncertainty assessment of multi-category tumor samples using gene expression information
<p>Abstract</p> <p>Background</p> <p>The use of gene expression profiling for the classification of human cancer tumors has been widely investigated. Previous studies were successful in distinguishing several tumor types in binary problems. As there are over a hundred types of cancers, and potentially even more subtypes, it is essential to develop multi-category methodologies for molecular classification for any meaningful practical application.</p> <p>Results</p> <p>A jackknife-based supervised learning method called paired-samples test algorithm (PST), coupled with a binary classification model based on linear regression, was proposed and applied to two well known and challenging datasets consisting of 14 (GCM dataset) and 9 (NC160 dataset) tumor types. The results showed that the proposed method improved the prediction accuracy of the test samples for the GCM dataset, especially when t-statistic was used in the primary feature selection. For the NCI60 dataset, the application of PST improved prediction accuracy when the numbers of used genes were relatively small (100 or 200). These improvements made the binary classification method more robust to the gene selection mechanism and the size of genes to be used. The overall prediction accuracies were competitive in comparison to the most accurate results obtained by several previous studies on the same datasets and with other methods. Furthermore, the relative confidence R(T) provided a unique insight into the sources of the uncertainty shown in the statistical classification and the potential variants within the same tumor type.</p> <p>Conclusion</p> <p>We proposed a novel bagging method for the classification and uncertainty assessment of multi-category tumor samples using gene expression information. The strengths were demonstrated in the application to two bench datasets.</p
- …