16 research outputs found

    Extracting biologically significant patterns from short time series gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Time series gene expression data analysis is used widely to study the dynamics of various cell processes. Most of the time series data available today consist of few time points only, thus making the application of standard clustering techniques difficult.</p> <p>Results</p> <p>We developed two new algorithms that are capable of extracting biological patterns from short time point series gene expression data. The two algorithms, <it>ASTRO </it>and <it>MiMeSR</it>, are inspired by the <it>rank order preserving </it>framework and the <it>minimum mean squared residue </it>approach, respectively. However, <it>ASTRO </it>and <it>MiMeSR </it>differ from previous approaches in that they take advantage of the relatively few number of time points in order to reduce the problem from NP-hard to linear. Tested on well-defined short time expression data, we found that our approaches are robust to noise, as well as to random patterns, and that they can correctly detect the temporal expression profile of relevant functional categories. Evaluation of our methods was performed using Gene Ontology (GO) annotations and chromatin immunoprecipitation (ChIP-chip) data.</p> <p>Conclusion</p> <p>Our approaches generally outperform both standard clustering algorithms and algorithms designed specifically for clustering of short time series gene expression data. Both algorithms are available at <url>http://www.benoslab.pitt.edu/astro/</url>.</p

    Mining biological information from 3D short time-series gene expression data: the OPTricluster algorithm

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Nowadays, it is possible to collect expression levels of a set of genes from a set of biological samples during a series of time points. Such data have three dimensions: gene-sample-time (GST). Thus they are called 3D microarray gene expression data. To take advantage of the 3D data collected, and to fully understand the biological knowledge hidden in the GST data, novel subspace clustering algorithms have to be developed to effectively address the biological problem in the corresponding space.</p> <p>Results</p> <p>We developed a subspace clustering algorithm called Order Preserving Triclustering (OPTricluster), for 3D short time-series data mining. OPTricluster is able to identify 3D clusters with coherent evolution from a given 3D dataset using a combinatorial approach on the sample dimension, and the order preserving (OP) concept on the time dimension. The fusion of the two methodologies allows one to study similarities and differences between samples in terms of their temporal expression profile. OPTricluster has been successfully applied to four case studies: immune response in mice infected by malaria (<it>Plasmodium chabaudi</it>), systemic acquired resistance in <it>Arabidopsis thaliana</it>, similarities and differences between inner and outer cotyledon in <it>Brassica napus </it>during seed development, and to <it>Brassica napus </it>whole seed development. These studies showed that OPTricluster is robust to noise and is able to detect the similarities and differences between biological samples.</p> <p>Conclusions</p> <p>Our analysis showed that OPTricluster generally outperforms other well known clustering algorithms such as the TRICLUSTER, gTRICLUSTER and K-means; it is robust to noise and can effectively mine the biological knowledge hidden in the 3D short time-series gene expression data.</p

    DNA Microarray Data Analysis: A Novel Biclustering Algorithm Approach

    No full text
    Biclustering algorithms refer to a distinct class of clustering algorithms that perform simultaneous row-column clustering. Biclustering problems arise in DNA microarray data analysis, collaborative filtering, market research, information retrieval, text mining, electoral trends, exchange analysis, and so forth. When dealing with DNA microarray experimental data for example, the goal of biclustering algorithms is to find submatrices, that is, subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated activities for every condition. In this study, we develop novel biclustering algorithms using basic linear algebra and arithmetic tools. The proposed biclustering algorithms can be used to search for all biclusters with constant values, biclusters with constant values on rows, biclusters with constant values on columns, and biclusters with coherent values from a set of data in a timely manner and without solving any optimization problem. We also show how one of the proposed biclustering algorithms can be adapted to identify biclusters with coherent evolution. The algorithms developed in this study discover all valid biclusters of each type, while almost all previous biclustering approaches will miss some.</p

    An enrichment model for wheat gene annotations using phylogeny, orthology and existing gene ontologies in 9 plant species

    No full text
    Genome sequencing efforts for the Triticum aestivum genome produce massive amounts of contigs, preliminary assemblies and putative genes/proteins, nevertheless their annotation is still in its infancy. Given the much larger percentage of annotated genes in other previously sequenced plant genomes such as Arabidopsis thaliana and Oryza sativa and the known phylogenetic and orthology relationship among these plant species and their corresponding genes, we propose an enrichment model that will further expand the horizon of wheat gene annotations. Our sequences and annotations base includes data from Ensembl Plants for 9 plant species: Aegilops tauschii, Arabidopsis thaliana, Brachypodium distachyon, Brassica rapa, Hordeum vulgare, Oryza sativa subsp. japonica, Sorghum bicolor, Triticum urartu and Zea mays. Orthology relationships between wheat genes and each of the 9 plant species are predicted using an in-house software package. Next, ortholog cliques are identified such that each set of genes within a clique represents pairwise orthologs. Using the phylogenetic distances between wheat and each plant species to quantify the level of confidence for gene ontology assignments within each ortholog clique, new gene annotations are assigned to wheat genes such that either novel or more specific GO terms are associated with those genes. Overall, based on clique size equal or larger than 3, our model enriched the existing gene-GO term associations for 7,838 (8%) wheat genes, of which 2,139 had no previous annotation. For the particular case of ortholog cliques of size 10 (13 in total) where all 10 genes within a clique are tightly connected via pairwise orthology, 85 new and more specific GO terms were identified, which represent a 65% increase compared with the previously 130 known GO terms. These observations are further supported for 4 out of the 10 plant species considered in this work by experimental evidence using expressologs (Patel et al., Plant J. 2012).Peer reviewed: YesNRC publication: Ye

    GOAL : A software tool for assessing biological significance of genes groups

    No full text
    Background: Modern high throughput experimental techniques such as DNA microarrays often result in large lists of genes. Computational biology tools such as clustering are then used to group together genes based on their similarity in expression profiles. Genes in each group are probably functionally related. The functional relevance among the genes in each group is usually characterized by utilizing available biological knowledge in public databases such as Gene Ontology (GO), KEGG pathways, association between a transcription factor (TF) and its target genes, and/or gene networks. Results: We developed GOAL: Gene Ontology AnaLyzer, a software tool specifically designed for the functional evaluation of gene groups. GOAL implements and supports efficient and statistically rigorous functional interpretations of gene groups through its integration with available GO, TF-gene association data, and association with KEGG pathways. In order to facilitate more specific functional characterization of a gene group, we implement three GO-tree search strategies rather than one as in most existing GO analysis tools. Furthermore, GOAL offers flexibility in deployment. It can be used as a standalone tool, a plug-in to other computational biology tools, or a web server application. Conclusion: We developed a functional evaluation software tool, GOAL, to perform functional characterization of a gene group. GOAL offers three GO-tree search strategies and combines its strength in function integration, portability and visualization, and its flexibility in deployment. Furthermore, GOAL can be used to evaluate and compare gene groups as the output from computational biology tools such as clustering algorithms.Des techniques exp\ue9rimentales modernes \ue0 haut d\ue9bit comme les puces \ue0 ADN donnent souvent lieu \ue0 de longues listes de g\ue8nes. On a alors recours \ue0 des outils bio-informatiques comme le regroupement (clustering) pour classer les g\ue8nes en fonction de leur similitude dans leur profil d\u2019expression. Les g\ue8nes de chaque groupe sont probablement li\ue9s de mani\ue8re fonctionnelle. La caract\ue9risation de la pertinence fonctionnelle parmi les g\ue8nes de chaque groupe s\u2019effectue habituellement en utilisant soit les connaissances biologiques accessibles dans des bases de donn\ue9es publiques comme la Gene Ontology (GO), soit la KEGG PATHWAY, soit l\u2019association entre un facteur de transcription et ses g\ue8nes cibles, ou les r\ue9seaux de g\ue8nes.Peer reviewed: YesNRC publication: Ye

    Adsorbate-Dependent Electronic Structure Descriptors for Machine Learning-Driven Binding Energy Predictions in Diverse Single Atom Alloys: A Reductionist Approach

    No full text
    A long-standing challenge in the design of single atom alloys (SAAs), for catalytic applications, is the determination of a feature space that maximally correlates to molecular binding energies per the Sabatier principle. The more representative a feature space is of the underlying binding properties, the greater the predictive capability of a given machine learning (ML) algorithm. Moreover, the greater the diversity and range of SAA impurities/sites examined, the greater the difficulty in arriving at such a predictive feature. In this work, we undertake to examine the degree to which adsorbate electronic structure properties might address this challenge, in a distinct departure from the traditional substrate electronic structure feature construction found in the catalysis literature. Specifically, as a model system, we explore the predictive capacity of the p-orbital projected density of states (PDOS) pertaining to the adsorption of CO molecules on a wide range of SAA substrates, impurity embeddings, and vicinal cuts. This analysis is executed in two parts. First, we explore the degree to which the entire PDOS distribution, in the form of an energy-dependent vector, can predict binding energies. Subsequently, guided by a rigorous intrinsic dimensionality analysis, uniform manifold approximation and projection visualization, and chemical intuition, we are able to reduce the predictive feature space to just three physical quantities based on semicore level properties and charge filling of the adsorbate–as embedded with the PDOS distribution. This near-intrinsic feature space and the PDOS distribution are both shown to provide significant improvements in predictive accuracy when coupled with regression-based ML methods, even when tackling highly diverse chemical datasets. The results of this analysis both further substantiate the transferability characteristics of SAAs and indicate that adsorbate-based electronic structure features (from either relaxed or unrelaxed chemical datasets) are powerful tools in the prediction of catalytic binding energies in such systems. They also underscore the predictive benefit of finding a feature space with a dimension equal to the intrinsic dimensionality of the data that can maximally correlate with the physical property under investigation when employing ML methods in catalysis studies
    corecore