4,561 research outputs found

    Partition Decoupling for Multi-gene Analysis of Gene Expression Profiling Data

    Get PDF
    We present the extention and application of a new unsupervised statistical learning technique--the Partition Decoupling Method--to gene expression data. Because it has the ability to reveal non-linear and non-convex geometries present in the data, the PDM is an improvement over typical gene expression analysis algorithms, permitting a multi-gene analysis that can reveal phenotypic differences even when the individual genes do not exhibit differential expression. Here, we apply the PDM to publicly-available gene expression data sets, and demonstrate that we are able to identify cell types and treatments with higher accuracy than is obtained through other approaches. By applying it in a pathway-by-pathway fashion, we demonstrate how the PDM may be used to find sets of mechanistically-related genes that discriminate phenotypes.Comment: Revise

    Robust Detection of Hierarchical Communities from Escherichia coli Gene Expression Data

    Get PDF
    Determining the functional structure of biological networks is a central goal of systems biology. One approach is to analyze gene expression data to infer a network of gene interactions on the basis of their correlated responses to environmental and genetic perturbations. The inferred network can then be analyzed to identify functional communities. However, commonly used algorithms can yield unreliable results due to experimental noise, algorithmic stochasticity, and the influence of arbitrarily chosen parameter values. Furthermore, the results obtained typically provide only a simplistic view of the network partitioned into disjoint communities and provide no information of the relationship between communities. Here, we present methods to robustly detect coregulated and functionally enriched gene communities and demonstrate their application and validity for Escherichia coli gene expression data. Applying a recently developed community detection algorithm to the network of interactions identified with the context likelihood of relatedness (CLR) method, we show that a hierarchy of network communities can be identified. These communities significantly enrich for gene ontology (GO) terms, consistent with them representing biologically meaningful groups. Further, analysis of the most significantly enriched communities identified several candidate new regulatory interactions. The robustness of our methods is demonstrated by showing that a core set of functional communities is reliably found when artificial noise, modeling experimental noise, is added to the data. We find that noise mainly acts conservatively, increasing the relatedness required for a network link to be reliably assigned and decreasing the size of the core communities, rather than causing association of genes into new communities.Comment: Due to appear in PLoS Computational Biology. Supplementary Figure S1 was not uploaded but is available by contacting the author. 27 pages, 5 figures, 15 supplementary file

    Mining gene expression data by interpreting principal components

    Get PDF
    BACKGROUND: There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively small sets of genes that share coherent expression across only some conditions, rather than all or most conditions as required in traditional clustering; e.g. genes that are highly up-regulated and/or down-regulated similarly across only a subset of conditions. Equally important is the need to learn which conditions are the decisive ones in forming such gene sets of interest, and how they relate to diverse conditional covariates, such as disease diagnosis or prognosis. RESULTS: We present a method for automatically identifying such candidate sets of biologically relevant genes using a combination of principal components analysis and information theoretic metrics. To enable easy use of our methods, we have developed a data analysis package that facilitates visualization and subsequent data mining of the independent sources of significant variation present in gene microarray expression datasets (or in any other similarly structured high-dimensional dataset). We applied these tools to two public datasets, and highlight sets of genes most affected by specific subsets of conditions (e.g. tissues, treatments, samples, etc.). Statistically significant associations for highlighted gene sets were shown via global analysis for Gene Ontology term enrichment. Together with covariate associations, the tool provides a basis for building testable hypotheses about the biological or experimental causes of observed variation. CONCLUSION: We provide an unsupervised data mining technique for diverse microarray expression datasets that is distinct from major methods now in routine use. In test uses, this method, based on publicly available gene annotations, appears to identify numerous sets of biologically relevant genes. It has proven especially valuable in instances where there are many diverse conditions (10's to hundreds of different tissues or cell types), a situation in which many clustering and ordering algorithms become problematic. This approach also shows promise in other topic domains such as multi-spectral imaging datasets

    Clustering and Classification Methods for Gene Expression Data Analysis

    Get PDF
    Efficient use of the large data sets generated by gene expression microarray experiments requires computerized data analysis approaches. In this chapter we briefly describe and illustrate two broad families of commonly used data analysis methods: class discovery and class prediction methods. A wide range of alternative approaches for clustering and classification of gene expression data are available. While differences in efficiency do exist, none of the well established approaches is uniformly superior to others. Choosing an approach requires consideration of the goals of the analysis, the background knowledge, and the specific experimental constraints. The quality of an algorithm is important, but is not in itself a guarantee of the quality of a specific data analysis. Uncertainty, sensitivity analysis and, in the case of classifiers, external validation or cross-validation should be used to support the legitimacy of results of microarray data analyses

    A cDNA Microarray Gene Expression Data Classifier for Clinical Diagnostics Based on Graph Theory

    Get PDF
    Despite great advances in discovering cancer molecular profiles, the proper application of microarray technology to routine clinical diagnostics is still a challenge. Current practices in the classification of microarrays' data show two main limitations: the reliability of the training data sets used to build the classifiers, and the classifiers' performances, especially when the sample to be classified does not belong to any of the available classes. In this case, state-of-the-art algorithms usually produce a high rate of false positives that, in real diagnostic applications, are unacceptable. To address this problem, this paper presents a new cDNA microarray data classification algorithm based on graph theory and is able to overcome most of the limitations of known classification methodologies. The classifier works by analyzing gene expression data organized in an innovative data structure based on graphs, where vertices correspond to genes and edges to gene expression relationships. To demonstrate the novelty of the proposed approach, the authors present an experimental performance comparison between the proposed classifier and several state-of-the-art classification algorithm

    Transcriptomic and proteomic analyses of Desulfovibrio vulgaris biofilms: carbon and energy flow contribute to the distinct biofilm growth state.

    Get PDF
    BackgroundDesulfovibrio vulgaris Hildenborough is a sulfate-reducing bacterium (SRB) that is intensively studied in the context of metal corrosion and heavy-metal bioremediation, and SRB populations are commonly observed in pipe and subsurface environments as surface-associated populations. In order to elucidate physiological changes associated with biofilm growth at both the transcript and protein level, transcriptomic and proteomic analyses were done on mature biofilm cells and compared to both batch and reactor planktonic populations. The biofilms were cultivated with lactate and sulfate in a continuously fed biofilm reactor, and compared to both batch and reactor planktonic populations.ResultsThe functional genomic analysis demonstrated that biofilm cells were different compared to planktonic cells, and the majority of altered abundances for genes and proteins were annotated as hypothetical (unknown function), energy conservation, amino acid metabolism, and signal transduction. Genes and proteins that showed similar trends in detected levels were particularly involved in energy conservation such as increases in an annotated ech hydrogenase, formate dehydrogenase, pyruvate:ferredoxin oxidoreductase, and rnf oxidoreductase, and the biofilm cells had elevated formate dehydrogenase activity. Several other hydrogenases and formate dehydrogenases also showed an increased protein level, while decreased transcript and protein levels were observed for putative coo hydrogenase as well as a lactate permease and hyp hydrogenases for biofilm cells. Genes annotated for amino acid synthesis and nitrogen utilization were also predominant changers within the biofilm state. Ribosomal transcripts and proteins were notably decreased within the biofilm cells compared to exponential-phase cells but were not as low as levels observed in planktonic, stationary-phase cells. Several putative, extracellular proteins (DVU1012, 1545) were also detected in the extracellular fraction from biofilm cells.ConclusionsEven though both the planktonic and biofilm cells were oxidizing lactate and reducing sulfate, the biofilm cells were physiologically distinct compared to planktonic growth states due to altered abundances of genes/proteins involved in carbon/energy flow and extracellular structures. In addition, average expression values for multiple rRNA transcripts and respiratory activity measurements indicated that biofilm cells were metabolically more similar to exponential-phase cells although biofilm cells are structured differently. The characterization of physiological advantages and constraints of the biofilm growth state for sulfate-reducing bacteria will provide insight into bioremediation applications as well as microbially-induced metal corrosion

    Comprehensive evaluation of matrix factorization methods for the analysis of DNA microarray gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Clustering-based methods on gene-expression analysis have been shown to be useful in biomedical applications such as cancer subtype discovery. Among them, Matrix factorization (MF) is advantageous for clustering gene expression patterns from DNA microarray experiments, as it efficiently reduces the dimension of gene expression data. Although several MF methods have been proposed for clustering gene expression patterns, a systematic evaluation has not been reported yet.</p> <p>Results</p> <p>Here we evaluated the clustering performance of orthogonal and non-orthogonal MFs by a total of nine measurements for performance in four gene expression datasets and one well-known dataset for clustering. Specifically, we employed a non-orthogonal MF algorithm, BSNMF (Bi-directional Sparse Non-negative Matrix Factorization), that applies bi-directional sparseness constraints superimposed on non-negative constraints, comprising a few dominantly co-expressed genes and samples together. Non-orthogonal MFs tended to show better clustering-quality and prediction-accuracy indices than orthogonal MFs as well as a traditional method, K-means. Moreover, BSNMF showed improved performance in these measurements. Non-orthogonal MFs including BSNMF showed also good performance in the functional enrichment test using Gene Ontology terms and biological pathways.</p> <p>Conclusions</p> <p>In conclusion, the clustering performance of orthogonal and non-orthogonal MFs was appropriately evaluated for clustering microarray data by comprehensive measurements. This study showed that non-orthogonal MFs have better performance than orthogonal MFs and <it>K</it>-means for clustering microarray data.</p

    Principal component tests: applied to temporal gene expression data

    Get PDF
    Clustering analysis is a common statistical tool for knowledge discovery. It is mainly conducted when a project still is in the exploratory phase without any priori hypotheses. However, the statistical significance testing between the clusters can be meaningful in helping the researchers to assess if the classification results from implementing a clustering algorithm need to be improved, even after the cluster number has been determined by a well-established criterion. This is important when we want to identify highly-specific patterns through classification. We proposed to use a principal component (PC) test, which is an implementation of an exact F statistic for the measures at multiple endpoints based on elliptical distribution theory, to assess the statistical significance between clusters. A challenge in the implementation is the choice of the number (q) of principal components to be considered, which can severely influence the statistical power of the method. We optimized the determination via validation according to a permutation test based on the clustering to be evaluated. The method was applied to a public dataset in classifying genes according to their temporal gene expression profiles. The results demonstrated that the PC testing were useful for determining the optimal number of clusters.https://doi.org/10.1186/1471-2105-10-S1-S2
    corecore