1,255 research outputs found

    Automated Protein Structure Classification: A Survey

    Full text link
    Classification of proteins based on their structure provides a valuable resource for studying protein structure, function and evolutionary relationships. With the rapidly increasing number of known protein structures, manual and semi-automatic classification is becoming ever more difficult and prohibitively slow. Therefore, there is a growing need for automated, accurate and efficient classification methods to generate classification databases or increase the speed and accuracy of semi-automatic techniques. Recognizing this need, several automated classification methods have been developed. In this survey, we overview recent developments in this area. We classify different methods based on their characteristics and compare their methodology, accuracy and efficiency. We then present a few open problems and explain future directions.Comment: 14 pages, Technical Report CSRG-589, University of Toront

    Towards Reliable Automatic Protein Structure Alignment

    Full text link
    A variety of methods have been proposed for structure similarity calculation, which are called structure alignment or superposition. One major shortcoming in current structure alignment algorithms is in their inherent design, which is based on local structure similarity. In this work, we propose a method to incorporate global information in obtaining optimal alignments and superpositions. Our method, when applied to optimizing the TM-score and the GDT score, produces significantly better results than current state-of-the-art protein structure alignment tools. Specifically, if the highest TM-score found by TMalign is lower than (0.6) and the highest TM-score found by one of the tested methods is higher than (0.5), there is a probability of (42%) that TMalign failed to find TM-scores higher than (0.5), while the same probability is reduced to (2%) if our method is used. This could significantly improve the accuracy of fold detection if the cutoff TM-score of (0.5) is used. In addition, existing structure alignment algorithms focus on structure similarity alone and simply ignore other important similarities, such as sequence similarity. Our approach has the capacity to incorporate multiple similarities into the scoring function. Results show that sequence similarity aids in finding high quality protein structure alignments that are more consistent with eye-examined alignments in HOMSTRAD. Even when structure similarity itself fails to find alignments with any consistency with eye-examined alignments, our method remains capable of finding alignments highly similar to, or even identical to, eye-examined alignments.Comment: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013

    Family classification without domain chaining

    Get PDF
    Motivation: Classification of gene and protein sequences into homologous families, i.e. sets of sequences that share common ancestry, is an essential step in comparative genomic analyses. This is typically achieved by construction of a sequence homology network, followed by clustering to identify dense subgraphs corresponding to families. Accurate classification of single domain families is now within reach due to major algorithmic advances in remote homology detection and graph clustering. However, classification of multidomain families remains a significant challenge. The presence of the same domain in sequences that do not share common ancestry introduces false edges in the homology network that link unrelated families and stymy clustering algorithms

    Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices

    Full text link
    Identifying similar protein sequences is a core step in many computational biology pipelines such as detection of homologous protein sequences, generation of similarity protein graphs for downstream analysis, functional annotation and gene location. Performance and scalability of protein similarity searches have proven to be a bottleneck in many bioinformatics pipelines due to increases in cheap and abundant sequencing data. This work presents a new distributed-memory software, PASTIS. PASTIS relies on sparse matrix computations for efficient identification of possibly similar proteins. We use distributed sparse matrices for scalability and show that the sparse matrix infrastructure is a great fit for protein similarity searches when coupled with a fully-distributed dictionary of sequences that allows remote sequence requests to be fulfilled. Our algorithm incorporates the unique bias in amino acid sequence substitution in searches without altering the basic sparse matrix model, and in turn, achieves ideal scaling up to millions of protein sequences.Comment: To appear in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'20

    Exploring the function and evolution of proteins using domain families

    Get PDF
    Proteins are frequently composed of multiple domains which fold independently. These are often evolutionarily distinct units which can be adapted and reused in other proteins. The classification of protein domains into evolutionary families facilitates the study of their evolution and function. In this thesis such classifications are used firstly to examine methods for identifying evolutionary relationships (homology) between protein domains. Secondly a specific approach for predicting their function is developed. Lastly they are used in studying the evolution of protein complexes. Tools for identifying evolutionary relationships between proteins are central to computational biology. They aid in classifying families of proteins, giving clues about the function of proteins and the study of molecular evolution. The first chapter of this thesis concerns the effectiveness of cutting edge methods in identifying evolutionary relationships between protein domains. The identification of evolutionary relationships between proteins can give clues as to their function. The second chapter of this thesis concerns the development of a method to identify proteins involved in the same biological process. This method is based on the concept of domain fusion whereby pairs of proteins from one organism with a concerted function are sometimes found fused into single proteins in a different organism. Using protein domain classifications it is possible to identify these relationships. Most proteins do not act in isolation but carry out their function by binding to other proteins in complexes; little is understood about the evolution of such complexes. In the third chapter of this thesis the evolution of complexes is examined in two representative model organisms using protein domain families. In this work, protein domain superfamilies allow distantly related parts of complexes to be identified in order to determine how homologous units are reused

    CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures

    Get PDF
    We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure–based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification

    CATHe: Detection of remote homologues for CATH superfamilies using embeddings from protein language models

    Get PDF
    MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art HMM-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein Language Models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. RESULTS: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6 ± 0.4%, and 98.2 ± 0.3% respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold 2 structures with structures from the CATH superfamilies to which they were assigned. AVAILABILITY AND IMPLEMENTATION: The code for the developed models can be found on https://github.com/vam-sin/CATHe. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

    Graph theory-based sequence descriptors as remote homology predictors

    Get PDF
    Indexación: Scopus.Alignment-free (AF) methodologies have increased in popularity in the last decades as alternative tools to alignment-based (AB) algorithms for performing comparative sequence analyses. They have been especially useful to detect remote homologs within the twilight zone of highly diverse gene/protein families and superfamilies. The most popular alignment-free methodologies, as well as their applications to classification problems, have been described in previous reviews. Despite a new set of graph theory-derived sequence/structural descriptors that have been gaining relevance in the detection of remote homology, they have been omitted as AF predictors when the topic is addressed. Here, we first go over the most popular AF approaches used for detecting homology signals within the twilight zone and then bring out the state-of-the-art tools encoding graph theory-derived sequence/structure descriptors and their success for identifying remote homologs. We also highlight the tendency of integrating AF features/measures with the AB ones, either into the same prediction model or by assembling the predictions from different algorithms using voting/weighting strategies, for improving the detection of remote signals. Lastly, we briefly discuss the efforts made to scale up AB and AF features/measures for the comparison of multiple genomes and proteomes. Alongside the achieved experiences in remote homology detection by both the most popular AF tools and other less known ones, we provide our own using the graphical–numerical methodologies, MARCH-INSIDE, TI2BioP, and ProtDCal. We also present a new Python-based tool (SeqDivA) with a friendly graphical user interface (GUI) for delimiting the twilight zone by using several similar criteria.https://www.mdpi.com/2218-273X/10/1/2

    SECOM: A Novel Hash Seed and Community Detection Based-Approach for Genome-Scale Protein Domain Identification

    Get PDF
    With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are computationally demanding. Here, we propose a novel genome-wide domain prediction method, SECOM. Instead of conducting all-against-all sequence alignment, SECOM first indexes all the proteins in the genome by using a hash seed function. Local similarity can thus be detected and encoded into a graph structure, in which each node represents a protein sequence and each edge weight represents the shared hash seeds between the two nodes. SECOM then formulates the domain prediction problem as an overlapping community-finding problem in this graph. A backward graph percolation algorithm that efficiently identifies the domains is proposed. We tested SECOM on five recently sequenced genomes of aquatic animals. Our tests demonstrated that SECOM was able to identify most of the known domains identified by InterProScan. When compared with the alignment-based method, SECOM showed higher sensitivity in detecting putative novel domains, while it was also three orders of magnitude faster. For example, SECOM was able to predict a novel sponge-specific domain in nucleoside-triphosphatase (NTPases). Furthermore, SECOM discovered two novel domains, likely of bacterial origin, that are taxonomically restricted to sea anemone and hydra. SECOM is an open-source program and available at http://sfb.kaust.edu.sa/Pages/Software.aspx
    corecore