Search CORE

8,488 research outputs found

Recommended from our members

A computer system to perform structure comparison using TOPS representations of protein structure

Author: Gilbert D
Thornton J
Viksna J
Westhead V
Publication venue: 'Elsevier BV'
Publication date: 01/01/2001
Field of study

We describe the design and implementation of a fast topology–based method for protein structure comparison. The approach uses the TOPS topological representation of protein structure, aligning two structures using a common discovered pattern and generating measure of distance derived from an insert score. Heavy use is made of a constraint-based pattern matching algorithm for TOPS diagrams that we have designed and described elsewhere Gilbert et al. (1999). The comparison system is maintained at the European Bioinformatics Institute and is available over the Web via the at tops.ebi.ac.uk/tops. Users submit a structure description in Protein Data Bank (PDB) format and can compare it with structures in the entire PDB or a representative subset of protein domains, receiving the results by email

Brunel University Research Archive

Recommended from our members

Topology-based protein structure comparison using a pattern discovery technique

Author: Gilbert D
Thornton J
Viksna J
Westhead D
Publication venue: 'University of Birmingham Library Services'
Publication date: 01/01/2000
Field of study

Brunel University Research Archive

MotifCluster: an interactive online tool for clustering and visualizing sequences using shared motifs

Author: Hamady Micah
Jeremy Widmann
Micah Hamady
Rob Knight
Shelley D Copley
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

MotifCluster finds related motifs in a set of sequences and clusters the sequences into families using the motifs they contain

Crossref

Springer

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

Graph-based methods for large-scale protein classification and orthology inference

Author: Kuzniar A.
Publication venue: S.n.
Publication date: 01/01/2009
Field of study

The quest for understanding how proteins evolve and function has been a prominent and costly human endeavor. With advances in genomics and use of bioinformatics tools, the diversity of proteins in present day genomes can now be studied more efficiently than ever before. This thesis describes computational methods suitable for large-scale protein classification of many proteomes of diverse species. Specifically, we focus on methods that combine unsupervised learning (clustering) techniques with the knowledge of molecular phylogenetics, particularly that of orthology. In chapter 1 we introduce the biological context of protein structure, function and evolution, review the state-of-the-art sequence-based protein classification methods, and then describe methods used to validate the predictions. Finally, we present the outline and objectives of this thesis. Evolutionary (phylogenetic) concepts are instrumental in studying subjects as diverse as the diversity of genomes, cellular networks, protein structures and functions, and functional genome annotation. In particular, the detection of orthologous proteins (genes) across genomes provides reliable means to infer biological functions and processes from one organism to another. Chapter 2 evaluates the available computational tools, such as algorithms and databases, used to infer orthologous relationships between genes from fully sequenced genomes. We discuss the main caveats of large-scale orthology detection in general as well as the merits and pitfalls of each method in particular. We argue that establishing true orthologous relationships requires a phylogenetic approach which combines both trees and graphs (networks), reliable species phylogeny, genomic data for more than two species, and an insight into the processes of molecular evolution. Also proposed is a set of guidelines to aid researchers in selecting the correct tool. Moreover, this review motivates further research in developing reliable and scalable methods for functional and phylogenetic classification of large protein collections. Chapter 3 proposes a framework in which various protein knowledge-bases are combined into unique network of mappings (links), and hence allows comparisons to be made between expert curated and fully-automated protein classifications from a single entry point. We developed an integrated annotation resource for protein orthology, ProGMap (Protein Group Mappings, http://www.bioinformatics.nl/progmap), to help researchers and database annotators who often need to assess the coherence of proposed annotations and/or group assignments, as well as users of high throughput methodologies (e.g., microarrays or proteomics) who deal with partially annotated genomic data. ProGMap is based on a non-redundant dataset of over 6.6 million protein sequences which is mapped to 240,000 protein group descriptions collected from UniProt, RefSeq, Ensembl, COG, KOG, OrthoMCL-DB, HomoloGene, TRIBES and PIRSF using a fast and fully automated sequence-based mapping approach. The ProGMap database is equipped with a web interface that enables queries to be made using synonymous sequence identifiers, gene symbols, protein functions, and amino acid or nucleotide sequences. It incorporates also services, namely BLAST similarity search and QuickMatch identity search, for finding sequences similar (or identical) to a query sequence, and tools for presenting the results in graphic form. Graphs (networks) have gained an increasing attention in contemporary biology because they have enabled complex biological systems and processes to be modeled and better understood. For example, protein similarity networks constructed of all-versus-all sequence comparisons are frequently used to delineate similarity groups, such as protein families or orthologous groups in comparative genomics studies. Chapter 4.1 presents a benchmark study of freely available graph software used for this purpose. Specifically, the computational complexity of the programs is investigated using both simulated and biological networks. We show that most available software is not suitable for large networks, such as those encountered in large-scale proteome analyzes, because of the high demands on computational resources. To address this, we developed a fast and memory-efficient graph software, netclust (http://www.bioinformatics.nl/netclust/), which can scale to large protein networks, such as those constructed of millions of proteins and sequence similarities, on a standard computer. An extended version of this program called Multi-netclust is presented in chapter 4.2. This tool that can find connected clusters of data presented by different network data sets. It uses user-defined threshold values to combine the data sets in such a way that clusters connected in all or in either of the networks can be retrieved efficiently. Automated protein sequence clustering is an important task in genome annotation projects and phylogenomic studies. During the past years, several protein clustering programs have been developed for delineating protein families or orthologous groups from large sequence collections. However, most of these programs have not been benchmarked systematically, in particular with respect to the trade-off between computational complexity and biological soundness. In chapter 5 we evaluate three best known algorithms on different protein similarity networks and validation (or 'gold' standard) data sets to find out which one can scale to hundreds of proteomes and still delineate high quality similarity groups at the minimum computational cost. For this, a reliable partition-based approach was used to assess the biological soundness of predicted groups using known protein functions, manually curated protein/domain families and orthologous groups available in expert-curated databases. Our benchmark results support the view that a simple and computationally cheap method such as netclust can perform similar to and in cases even better than more sophisticated, yet much more costly methods. Moreover, we introduce an efficient graph-based method that can delineate protein orthologs of hundreds of proteomes into hierarchical similarity groups de novo. The validity of this method is demonstrated on data obtained from 347 prokaryotic proteomes. The resulting hierarchical protein classification is not only in agreement with manually curated classifications but also provides an enriched framework in which the functional and evolutionary relationships between proteins can be studied at various levels of specificity. Finally, in chapter 6 we summarize the main findings and discuss the merits and shortcomings of the methods developed herein. We also propose directions for future research. The ever increasing flood of new sequence data makes it clear that we need improved tools to be able to handle and extract relevant (orthological) information from these protein data. This thesis summarizes these needs and how they can be addressed by the available tools, or be improved by the new tools that were developed in the course of this research. <br/

Wageningen University & Research Publications

Finding the Core-Genes of Chloroplasts

Author: AlKindy Bassam
Bahi Jacques M.
Couchot Jean-François
Guyeux Christophe
Mouly Arnaud
Salomon Michel
Publication venue
Publication date: 01/01/2014
Field of study

Due to the recent evolution of sequencing techniques, the number of available genomes is rising steadily, leading to the possibility to make large scale genomic comparison between sets of close species. An interesting question to answer is: what is the common functionality genes of a collection of species, or conversely, to determine what is specific to a given species when compared to other ones belonging in the same genus, family, etc. Investigating such problem means to find both core and pan genomes of a collection of species, \textit{i.e.}, genes in common to all the species vs. the set of all genes in all species under consideration. However, obtaining trustworthy core and pan genomes is not an easy task, leading to a large amount of computation, and requiring a rigorous methodology. Surprisingly, as far as we know, this methodology in finding core and pan genomes has not really been deeply investigated. This research work tries to fill this gap by focusing only on chloroplastic genomes, whose reasonable sizes allow a deep study. To achieve this goal, a collection of 99 chloroplasts are considered in this article. Two methodologies have been investigated, respectively based on sequence similarities and genes names taken from annotation tools. The obtained results will finally be evaluated in terms of biological relevance

arXiv.org e-Print Archive

HAL - Université de Franche-Comté

Differential gene expression graphs: A data structure for classification in DNA microarrays

Author: Benso Alfredo
Di Carlo Stefano
Politano Gianfranco Michele Maria
Sterpone Luca
Publication venue: IEEE Press
Publication date: 01/01/2008
Field of study

This paper proposes an innovative data structure to be used as a backbone in designing microarray phenotype sample classifiers. The data structure is based on graphs and it is built from a differential analysis of the expression levels of healthy and diseased tissue samples in a microarray dataset. The proposed data structure is built in such a way that, by construction, it shows a number of properties that are perfectly suited to address several problems like feature extraction, clustering, and classificatio

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Assembly of an interactive correlation network for the Arabidopsis genome using a novel heuristic clustering algorithm

Author: Ebenhoeh Oliver
Loraine Ann
Mutwil Marek
Persson Staffan
Schütte Moritz
Usadel Björn
Publication venue: 'American Society of Plant Biologists (ASPB)'
Publication date: 01/01/2010
Field of study

Peer reviewedPublisher PD

Aberdeen University Research

Crossref

PubMed Central

MPG.PuRe