28,045 research outputs found
A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics
The combination of multiple classifiers using ensemble methods is
increasingly important for making progress in a variety of difficult prediction
problems. We present a comparative analysis of several ensemble methods through
two case studies in genomics, namely the prediction of genetic interactions and
protein functions, to demonstrate their efficacy on real-world datasets and
draw useful conclusions about their behavior. These methods include simple
aggregation, meta-learning, cluster-based meta-learning, and ensemble selection
using heterogeneous classifiers trained on resampled data to improve the
diversity of their predictions. We present a detailed analysis of these methods
across 4 genomics datasets and find the best of these methods offer
statistically significant improvements over the state of the art in their
respective domains. In addition, we establish a novel connection between
ensemble selection and meta-learning, demonstrating how both of these disparate
methods establish a balance between ensemble diversity and performance.Comment: 10 pages, 3 figures, 8 tables, to appear in Proceedings of the 2013
International Conference on Data Minin
Multiscale, multimodal analysis of tumor heterogeneity in IDH1 mutant vs wild-type diffuse gliomas.
Glioma is recognized to be a highly heterogeneous CNS malignancy, whose diverse cellular composition and cellular interactions have not been well characterized. To gain new clinical- and biological-insights into the genetically-bifurcated IDH1 mutant (mt) vs wildtype (wt) forms of glioma, we integrated data from protein, genomic and MR imaging from 20 treatment-naïve glioma cases and 16 recurrent GBM cases. Multiplexed immunofluorescence (MxIF) was used to generate single cell data for 43 protein markers representing all cancer hallmarks, Genomic sequencing (exome and RNA (normal and tumor) and magnetic resonance imaging (MRI) quantitative features (protocols were T1-post, FLAIR and ADC) from whole tumor, peritumoral edema and enhancing core vs equivalent normal region were also collected from patients. Based on MxIF analysis, 85,767 cells (glioma cases) and 56,304 cells (GBM cases) were used to generate cell-level data for 24 biomarkers. K-means clustering was used to generate 7 distinct groups of cells with divergent biomarker profiles and deconvolution was used to assign RNA data into three classes. Spatial and molecular heterogeneity metrics were generated for the cell data. All features were compared between IDH mt and IDHwt patients and were finally combined to provide a holistic/integrated comparison. Protein expression by hallmark was generally lower in the IDHmt vs wt patients. Molecular and spatial heterogeneity scores for angiogenesis and cell invasion also differed between IDHmt and wt gliomas irrespective of prior treatment and tumor grade; these differences also persisted in the MR imaging features of peritumoral edema and contrast enhancement volumes. A coherent picture of enhanced angiogenesis in IDHwt tumors was derived from multiple platforms (genomic, proteomic and imaging) and scales from individual proteins to cell clusters and heterogeneity, as well as bulk tumor RNA and imaging features. Longer overall survival for IDH1mt glioma patients may reflect mutation-driven alterations in cellular, molecular, and spatial heterogeneity which manifest in discernable radiological manifestations
Detection of regulator genes and eQTLs in gene networks
Genetic differences between individuals associated to quantitative phenotypic
traits, including disease states, are usually found in non-coding genomic
regions. These genetic variants are often also associated to differences in
expression levels of nearby genes (they are "expression quantitative trait
loci" or eQTLs for short) and presumably play a gene regulatory role, affecting
the status of molecular networks of interacting genes, proteins and
metabolites. Computational systems biology approaches to reconstruct causal
gene networks from large-scale omics data have therefore become essential to
understand the structure of networks controlled by eQTLs together with other
regulatory genes, and to generate detailed hypotheses about the molecular
mechanisms that lead from genotype to phenotype. Here we review the main
analytical methods and softwares to identify eQTLs and their associated genes,
to reconstruct co-expression networks and modules, to reconstruct causal
Bayesian gene and module networks, and to validate predicted networks in
silico.Comment: minor revision with typos corrected; review article; 24 pages, 2
figure
Recovering complete and draft population genomes from metagenome datasets.
Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution
Asterias: a parallelized web-based suite for the analysis of expression and aCGH data
Asterias (\url{http://www.asterias.info}) is an integrated collection of
freely-accessible web tools for the analysis of gene expression and aCGH data.
Most of the tools use parallel computing (via MPI). Most of our applications
allow the user to obtain additional information for user-selected genes by
using clickable links in tables and/or figures. Our tools include:
normalization of expression and aCGH data; converting between different types
of gene/clone and protein identifiers; filtering and imputation; finding
differentially expressed genes related to patient class and survival data;
searching for models of class prediction; using random forests to search for
minimal models for class prediction or for large subsets of genes with
predictive capacity; searching for molecular signatures and predictive genes
with survival data; detecting regions of genomic DNA gain or loss. The
capability to send results between different applications, access to additional
functional information, and parallelized computation make our suite unique and
exploit features only available to web-based applications.Comment: web based application; 3 figure
clValid: An R Package for Cluster Validation
The R package clValid contains functions for validating the results of a clustering analysis. There are three main types of cluster validation measures available, "internal", "stability", and "biological". The user can choose from nine clustering algorithms in existing R packages, including hierarchical, K-means, self-organizing maps (SOM), and model-based clustering. In addition, we provide a function to perform the self-organizing tree algorithm (SOTA) method of clustering. Any combination of validation measures and clustering methods can be requested in a single function call. This allows the user to simultaneously evaluate several clustering algorithms while varying the number of clusters, to help determine the most appropriate method and number of clusters for the dataset of interest. Additionally, the package can automatically make use of the biological information contained in the Gene Ontology (GO) database to calculate the biological validation measures, via the annotation packages available in Bioconductor. The function returns an object of S4 class "clValid", which has summary, plot, print, and additional methods which allow the user to display the optimal validation scores and extract clustering results.
Noise resistant generalized parametric validity index of clustering for gene expression data
This article has been made available through the Brunel Open Access Publishing Fund.Validity indices have been investigated for decades. However, since there is no study of noise-resistance performance of these indices in the literature, there is no guideline for determining the best clustering in noisy data sets, especially microarray data sets. In this paper, we propose a generalized parametric validity (GPV) index which employs two tunable parameters α and β to control the proportions of objects being considered to calculate the dissimilarities. The greatest advantage of the proposed GPV index is its noise-resistance ability, which results from the flexibility of tuning the parameters. Several rules are set to guide the selection of parameter values. To illustrate the noise-resistance performance of the proposed index, we evaluate the GPV index for assessing five clustering algorithms in two gene expression data simulation models with different noise levels and compare the ability of determining the number of clusters with eight existing indices. We also test the GPV in three groups of real gene expression data sets. The experimental results suggest that the proposed GPV index has superior noise-resistance ability and provides fairly accurate judgements
- …