29 research outputs found

    Clustering with shallow trees

    Full text link
    We propose a new method for hierarchical clustering based on the optimisation of a cost function over trees of limited depth, and we derive a message--passing method that allows to solve it efficiently. The method and algorithm can be interpreted as a natural interpolation between two well-known approaches, namely single linkage and the recently presented Affinity Propagation. We analyze with this general scheme three biological/medical structured datasets (human population based on genetic information, proteins based on sequences and verbal autopsies) and show that the interpolation technique provides new insight.Comment: 11 pages, 7 figure

    MetNet: Software to Build and Model the Biogenetic Lattice of Arabidopsis

    Get PDF
    MetNet (http://www.botany.iastate.edu/∼mash/metnetex/metabolicnetex.html) is publicly available software in development for analysis of genome-wide RNA, protein and metabolite profiling data. The software is designed to enable the biologist to visualize, statistically analyse and model a metabolic and regulatory network map of Arabidopsis, combined with gene expression profiling data. It contains a JAVA interface to an interactions database (MetNetDB) containing information on regulatory and metabolic interactions derived from a combination of web databases (TAIR, KEGG, BRENDA) and input from biologists in their area of expertise. FCModeler captures input from MetNetDB in a graphical form. Sub-networks can be identified and interpreted using simple fuzzy cognitive maps. FCModeler is intended to develop and evaluate hypotheses, and provide a modelling framework for assessing the large amounts of data captured by high-throughput gene expression experiments. FCModeler and MetNetDB are currently being extended to three-dimensional virtual reality display. The MetNet map, together with gene expression data, can be viewed using multivariate graphics tools in GGobi linked with the data analytic tools in R. Users can highlight different parts of the metabolic network and see the relevant expression data highlighted in other data plots. Multi-dimensional expression data can be rotated through different dimensions. Statistical analysis can be computed alongside the visual. MetNet is designed to provide a framework for the formulation of testable hypotheses regarding the function of specific genes, and in the long term provide the basis for identification of metabolic and regulatory networks that control plant composition and development

    CarGene: Characterisation of sets of genes based on metabolic pathways analysis

    Get PDF
    The great amount of biological information provides scientists with an incomparable framework for testing the results of new algorithms. Several tools have been developed for analysing gene-enrichment and most of them are Gene Ontology-based tools. We developed a Kyoto Encyclopedia of Genes and Genomes (Kegg)-based tool that provides a friendly graphical environment for analysing gene-enrichment. The tool integrates two statistical corrections and simultaneously analysing the information about many groups of genes in both visual and textual manner. We tested the usefulness of our approach on a previous analysis (Huttenshower et al.). Furthermore, our tool is freely available (http://www.upo.es/eps/bigs/cargene.html).Ministerio de Ciencia y Tecnología TIN2007-68084-C02-00Ministerio de Ciencia e Innovación PCI2006-A7-0575Junta de Andalucía P07-TIC-02611Junta de Andalucía TIC-20

    Genex: a conditional independence based hybrid model for the analysis of gene expression data

    Get PDF
    Gene expression microarrays have resulted in a vast pool of data which is still not being utilized to its full potential. While current methods allow for considerable reliability in measuring the change in a gene\u27s expression in response to a set of conditions, relationships between genes are usually avoided due to the high dimensionality associated with this data type. Broadly speaking, there are two major types of exploratory analyses conducted on such relationships. The first is the category of exploratory clustering algorithms. Pioneered by Michael Eisen in 1998, this includes the software Cluster that performs a hierarchical clustering analysis on the basis of pair-wise correlations. While useful due to its ease of interpretation and user friendly software, Cluster does not take higher order relationships into account and as a result can be potentially misleading. The second category is that of network models. Commonly used models are Bayesian networks and several types of Gaussian models. Network models take higher order relationships into account and, in general, improve the signal to noise ratio. The potential drawback is the complexity of visual representation, making interpretation extremely difficult. Since the results are not forced into dendrogram structure, but are represented as points in multivariate space, it can be extremely challenging to draw useful inferences in the absence of explicit a-priori information. We build a hybrid model that attempts to combine the key features of both types of approaches. We construct a hierarchical dendrogram from a conditional independence network model, facilitating the same ease of interpretation inherent of clustering algorithms while preserving the benefits of a network model, namely the consideration of higher order relationships and the improvement of the signal to noise ratio. Presently limited to datasets of about 500 genes, the approach is probably most useful for smaller microarrays conducted after a key set of significantly expressed genes have been identified from a genome wide microarray experiment

    Nearest Neighbor Networks: clustering expression data based on gene neighborhoods

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The availability of microarrays measuring thousands of genes simultaneously across hundreds of biological conditions represents an opportunity to understand both individual biological pathways and the integrated workings of the cell. However, translating this amount of data into biological insight remains a daunting task. An important initial step in the analysis of microarray data is clustering of genes with similar behavior. A number of classical techniques are commonly used to perform this task, particularly hierarchical and K-means clustering, and many novel approaches have been suggested recently. While these approaches are useful, they are not without drawbacks; these methods can find clusters in purely random data, and even clusters enriched for biological functions can be skewed towards a small number of processes (e.g. ribosomes).</p> <p>Results</p> <p>We developed Nearest Neighbor Networks (NNN), a graph-based algorithm to generate clusters of genes with similar expression profiles. This method produces clusters based on overlapping cliques within an interaction network generated from mutual nearest neighborhoods. This focus on nearest neighbors rather than on absolute distance measures allows us to capture clusters with high connectivity even when they are spatially separated, and requiring mutual nearest neighbors allows genes with no sufficiently similar partners to remain unclustered. We compared the clusters generated by NNN with those generated by eight other clustering methods. NNN was particularly successful at generating functionally coherent clusters with high precision, and these clusters generally represented a much broader selection of biological processes than those recovered by other methods.</p> <p>Conclusion</p> <p>The Nearest Neighbor Networks algorithm is a valuable clustering method that effectively groups genes that are likely to be functionally related. It is particularly attractive due to its simplicity, its success in the analysis of large datasets, and its ability to span a wide range of biological functions with high precision.</p

    Human promoter genomic composition demonstrates non-random groupings that reflect general cellular function

    Get PDF
    BACKGROUND: The purpose of this study is to determine whether or not there exists nonrandom grouping of cis-regulatory elements within gene promoters that can be perceived independent of gene expression data and whether or not there is any correlation between this grouping and the biological function of the gene. RESULTS: Using ProSpector, a web-based promoter search and annotation tool, we have applied an unbiased approach to analyze the transcription factor binding site frequencies of 1400 base pair genomic segments positioned at 1200 base pairs upstream and 200 base pairs downstream of the transcriptional start site of 7298 commonly studied human genes. Partitional clustering of the transcription factor binding site composition within these promoter segments reveals a small number of gene groups that are selectively enriched for gene ontology terms consistent with distinct aspects of cellular function. Significance ranking of the class-determining transcription factor binding sites within these clusters show substantial overlap between the gene ontology terms of the transcriptions factors associated with the binding sites and the gene ontology terms of the regulated genes within each group. CONCLUSION: Thus, gene sorting by promoter composition alone produces partitions in which the "regulated" and the "regulators" cosegregate into similar functional classes. These findings demonstrate that the transcription factor binding site composition is non-randomly distributed between gene promoters in a manner that reflects and partially defines general gene class function

    Estimating genomic coexpression networks using first-order conditional independence

    Get PDF
    We describe a computationally efficient statistical framework for estimating networks of coexpressed genes. This framework exploits first-order conditional independence relationships among gene-expression measurements to estimate patterns of association. We use this approach to estimate a coexpression network from microarray gene-expression measurements from Saccharomyces cerevisiae. We demonstrate the biological utility of this approach by showing that a large number of metabolic pathways are coherently represented in the estimated network. We describe a complementary unsupervised graph search algorithm for discovering locally distinct subgraphs of a large weighted graph. We apply this algorithm to our coexpression network model and show that subgraphs found using this approach correspond to particular biological processes or contain representatives of distinct gene families
    corecore