34 research outputs found

    Online Transitivity Clustering of Biological Data with Missing Values

    Get PDF

    Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing

    Get PDF
    Wittkop T, Baumbach J, Lobo FP, Rahmann S. Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics. 2007;8(1): 396.Background: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. Results: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools ( Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences ( 66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. Conclusion: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/

    Genetic Correction of Huntington's Disease Phenotypes in Induced Pluripotent Stem Cells

    Get PDF
    SummaryHuntington's disease (HD) is caused by a CAG expansion in the huntingtin gene. Expansion of the polyglutamine tract in the huntingtin protein results in massive cell death in the striatum of HD patients. We report that human induced pluripotent stem cells (iPSCs) derived from HD patient fibroblasts can be corrected by the replacement of the expanded CAG repeat with a normal repeat using homologous recombination, and that the correction persists in iPSC differentiation into DARPP-32-positive neurons in vitro and in vivo. Further, correction of the HD-iPSCs normalized pathogenic HD signaling pathways (cadherin, TGF-β, BDNF, and caspase activation) and reversed disease phenotypes such as susceptibility to cell death and altered mitochondrial bioenergetics in neural stem cells. The ability to make patient-specific, genetically corrected iPSCs from HD patients will provide relevant disease models in identical genetic backgrounds and is a critical step for the eventual use of these cells in cell replacement therapy

    clusterMaker: a multi-algorithm clustering plugin for Cytoscape

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In the post-genomic era, the rapid increase in high-throughput data calls for computational tools capable of integrating data of diverse types and facilitating recognition of biologically meaningful patterns within them. For example, protein-protein interaction data sets have been clustered to identify stable complexes, but scientists lack easily accessible tools to facilitate combined analyses of multiple data sets from different types of experiments. Here we present <it>clusterMaker</it>, a Cytoscape plugin that implements several clustering algorithms and provides network, dendrogram, and heat map views of the results. The Cytoscape network is linked to all of the other views, so that a selection in one is immediately reflected in the others. <it>clusterMaker </it>is the first Cytoscape plugin to implement such a wide variety of clustering algorithms and visualizations, including the only implementations of hierarchical clustering, dendrogram plus heat map visualization (tree view), k-means, k-medoid, SCPS, AutoSOME, and native (Java) MCL.</p> <p>Results</p> <p>Results are presented in the form of three scenarios of use: analysis of protein expression data using a recently published mouse interactome and a mouse microarray data set of nearly one hundred diverse cell/tissue types; the identification of protein complexes in the yeast <it>Saccharomyces cerevisiae</it>; and the cluster analysis of the vicinal oxygen chelate (VOC) enzyme superfamily. For scenario one, we explore functionally enriched mouse interactomes specific to particular cellular phenotypes and apply fuzzy clustering. For scenario two, we explore the prefoldin complex in detail using both physical and genetic interaction clusters. For scenario three, we explore the possible annotation of a protein as a methylmalonyl-CoA epimerase within the VOC superfamily. Cytoscape session files for all three scenarios are provided in the Additional Files section.</p> <p>Conclusions</p> <p>The Cytoscape plugin <it>clusterMaker </it>provides a number of clustering algorithms and visualizations that can be used independently or in combination for analysis and visualization of biological data sets, and for confirming or generating hypotheses about biological function. Several of these visualizations and algorithms are only available to Cytoscape users through the <it>clusterMaker </it>plugin. <it>clusterMaker </it>is available via the Cytoscape plugin manager.</p

    Geographic and temporal trends in the molecular epidemiology and genetic mechanisms of transmitted HIV-1 drug resistance:an individual-patient- and sequence-level meta-analysis

    Get PDF
    Regional and subtype-specific mutational patterns of HIV-1 transmitted drug resistance (TDR) are essential for informing first-line antiretroviral (ARV) therapy guidelines and designing diagnostic assays for use in regions where standard genotypic resistance testing is not affordable. We sought to understand the molecular epidemiology of TDR and to identify the HIV-1 drug-resistance mutations responsible for TDR in different regions and virus subtypes.status: publishe

    Clusterung von biologischen Daten durch Aufdecken verborgener transitiver Strukturen

    Get PDF
    Wittkop T. Clustering biological data by unraveling hidden transitive substructures. Bielefeld (Germany): Bielefeld University; 2010.Clustering is a computational technique for the assignment of objects into groups of similar elements. Generally, it is widely used for business data interpretation, natural language analyses, and image processing, just to name a few. Typical bioinformatic applications are: (1) detection of homologous proteins; single and multi domain, (2) prediction of protein complexes in protein-protein interaction networks, (3) identification of overrepresented DNA sequence patterns, and (4) gene co-expression studies. Traditionally, we distinguish between partitional, overlapping, and hierarchical approaches. Partitional and overlapping approaches follow two different strategies: (1) center-based approaches for the detection of appropriate cluster representatives, such as k-means and (2) methods for the identification of homogeneous clusters, such as Markov Clustering. Hierarchical approaches allow for the construction of a tree structure; single linkage agglomerative clustering may serve as an example here. Solving the following problems is crucial for a successful cluster analysis: (1) Probably most challenging is the identification of a problem-specific similarity function. (2) Every clustering approach incorporates at least one parameter that influences the size and number of the clusters. Determining such a density parameter strongly depends on the problem and the chosen similarity function. Preferably, one can even prove certain attributes of a clustering result, given a similarity function and the density parameter. (3) Currently, high throughput experiments produce huge amounts of data. Hence, a clustering environment has to be capable of processing hundreds of thousands of data objects. (4) The integration of existing knowledge into a cluster analysis is highly valuable for improving the clustering output. The integration of known assignments may serve as an example here. (5) It is clear that the method needs to be robust against noise and outliers. (6) From an end-user's point of view, integration with standard software, appropriate visualization capabilities, and easy-to-use evaluation methods are highly beneficial. This thesis introduces Transitivity Clustering (TC) and its accompanying software framework TransClust, a method which addresses all of the aforementioned problems. It is a homogeneous partitioning method based on Weighted Transitive Graph Projection (WTGP), which aims for unraveling hidden transitive substructures in a given similarity graph deduced from a pairwise similarity measure. TC solves the aforementioned problems (2-5). The software implementation TransClust is an easy-to-use standalone and online application that solves the problems mentioned in (1,6). Furthermore, in TC, the density parameter can be chosen intuitively and the underlying weighted transitive graph projection model allows certain criteria of the clustering results to be proven. In addition, the model has been extended in order to allow for the following advanced features: (1) The integration of existing knowledge, for instance, by means of upper and lower bounds, (2) the computation of an hierarchical clustering, and (3) the calculation of overlapping clusterings. These extensions widen the applicability of TC and provide features that distinguish TC from other bioinformatics alternatives. The flexibility of TC makes it suitable for various real-world applications. In this work, we concentrate on protein sequence clustering and the detection of protein complexes in protein-protein interaction networks, showing that TC outperforms the most-commonly used bioinformatics clustering techniques. The software implementation of TC, TransClust, is available online at http://transclust.cebitec.uni-bielefeld.de as web application, as standalone tool, and as plugin for the standard network analysis tool Cytoscape. It provides results of similar or superior accuracy to those of alternative approaches. It is unique in that it features an easy-to-use clustering environment that contributes to all the important steps in a cluster analysis: (1) the choice and evaluation of a meaningful similarity function, (2) the detection of an appropriate density parameter, (3) the efficient computation of a clustering, and (4) the interpretation and evaluation of the clustering results

    INTRODUCTION: ADVANCES IN COMPUTATIONAL SYSTEMS BIOINFORMATICS

    No full text
    corecore