13 research outputs found

    Significant speedup of database searches with HMMs by search space reduction with PSSM family models

    Get PDF
    Motivation: Profile hidden Markov models (pHMMs) are currently the most popular modeling concept for protein families. They provide sensitive family descriptors, and sequence database searching with pHMMs has become a standard task in today's genome annotation pipelines. On the downside, searching with pHMMs is computationally expensive

    Automated Genome-Wide Protein Domain Exploration

    Get PDF
    Exploiting the exponentially growing genomics and proteomics data requires high quality, automated analysis. Protein domain modeling is a key area of molecular biology as it unravels the mysteries of evolution, protein structures, and protein functions. A plethora of sequences exist in protein databases with incomplete domain knowledge. Hence this research explores automated bioinformatics tools for faster protein domain analysis. Automated tool chains described in this dissertation generate new protein domain models thus enabling more effective genome-wide protein domain analysis. To validate the new tool chains, the Shewanella oneidensis and Escherichia coli genomes were processed, resulting in a new peptide domain database, detection of poor domain models, and identification of likely new domains. The automated tool chains will require months or years to model a small genome when executing on a single workstation. Therefore the dissertation investigates approaches with grid computing and parallel processing to significantly accelerate these bioinformatics tool chains

    Structator: fast index-based search for RNA sequence-structure patterns

    Get PDF
    Background The secondary structure of RNA molecules is intimately related to their function and often more conserved than the sequence. Hence, the important task of searching databases for RNAs requires to match sequence-structure patterns. Unfortunately, current tools for this task have, in the best case, a running time that is only linear in the size of sequence databases. Furthermore, established index data structures for fast sequence matching, like suffix trees or arrays, cannot benefit from the complementarity constraints introduced by the secondary structure of RNAs. Results We present a novel method and readily applicable software for time efficient matching of RNA sequence-structure patterns in sequence databases. Our approach is based on affix arrays, a recently introduced index data structure, preprocessed from the target database. Affix arrays support bidirectional pattern search, which is required for efficiently handling the structural constraints of the pattern. Structural patterns like stem-loops can be matched inside out, such that the loop region is matched first and then the pairing bases on the boundaries are matched consecutively. This allows to exploit base pairing information for search space reduction and leads to an expected running time that is sublinear in the size of the sequence database. The incorporation of a new chaining approach in the search of RNA sequence-structure patterns enables the description of molecules folding into complex secondary structures with multiple ordered patterns. The chaining approach removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our method runs up to two orders of magnitude faster than previous methods. Conclusions The presented method's sublinear expected running time makes it well suited for RNA sequence-structure pattern matching in large sequence databases. RNA molecules containing several stem-loop substructures can be described by multiple sequence-structure patterns and their matches are efficiently handled by a novel chaining method. Beyond our algorithmic contributions, we provide with Structator a complete and robust open-source software solution for index-based search of RNA sequence-structure patterns. The Structator software is available at http://www.zbh.uni-hamburg.de/Structator webcite.Deutsche Forschungsgemeinschaft (grant WI 3628/1-1

    Contrastive learning on protein embeddings enlightens midnight zone

    Get PDF
    Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT

    Efficient homology search for genomic sequence databases

    Get PDF
    Genomic search tools can provide valuable insights into the chemical structure, evolutionary origin and biochemical function of genetic material. A homology search algorithm compares a protein or nucleotide query sequence to each entry in a large sequence database and reports alignments with highly similar sequences. The exponential growth of public data banks such as GenBank has necessitated the development of fast, heuristic approaches to homology search. The versatile and popular blast algorithm, developed by researchers at the US National Center for Biotechnology Information (NCBI), uses a four-stage heuristic approach to efficiently search large collections for analogous sequences while retaining a high degree of accuracy. Despite an abundance of alternative approaches to homology search, blast remains the only method to offer fast, sensitive search of large genomic collections on modern desktop hardware. As a result, the tool has found widespread use with millions of queries posed each day. A significant investment of computing resources is required to process this large volume of genomic searches and a cluster of over 200 workstations is employed by the NCBI to handle queries posed through the organisation's website. As the growth of sequence databases continues to outpace improvements in modern hardware, blast searches are becoming slower each year and novel, faster methods for sequence comparison are required. In this thesis we propose new techniques for fast yet accurate homology search that result in significantly faster blast searches. First, we describe improvements to the final, gapped alignment stages where the query and sequences from the collection are aligned to provide a fine-grain measure of similarity. We describe three new methods for aligning sequences that roughly halve the time required to perform this computationally expensive stage. Next, we investigate improvements to the first stage of search, where short regions of similarity between a pair of sequences are identified. We propose a novel deterministic finite automaton data structure that is significantly smaller than the codeword lookup table employed by ncbi-blast, resulting in improved cache performance and faster search times. We also discuss fast methods for nucleotide sequence comparison. We describe novel approaches for processing sequences that are compressed using the byte packed format already utilised by blast, where four nucleotide bases from a strand of DNA are stored in a single byte. Rather than decompress sequences to perform pairwise comparisons, our innovations permit sequences to be processed in their compressed form, four bases at a time. Our techniques roughly halve average query evaluation times for nucleotide searches with no effect on the sensitivity of blast. Finally, we present a new scheme for managing the high degree of redundancy that is prevalent in genomic collections. Near-duplicate entries in sequence data banks are highly detrimental to retrieval performance, however existing methods for managing redundancy are both slow, requiring almost ten hours to process the GenBank database, and crude, because they simply purge highly-similar sequences to reduce the level of internal redundancy. We describe a new approach for identifying near-duplicate entries that is roughly six times faster than the most successful existing approaches, and a novel approach to managing redundancy that reduces collection size and search times but still provides accurate and comprehensive search results. Our improvements to blast have been integrated into our own version of the tool. We find that our innovations more than halve average search times for nucleotide and protein searches, and have no signifcant effect on search accuracy. Given the enormous popularity of blast, this represents a very significant advance in computational methods to aid life science research

    Discriminative Learning for Probabilistic Sequence Analysis

    No full text

    Efficient algorithms in protein modelling

    Get PDF
    Proteins are key players in the complex world of living cells. No matter whether they are involved in enzymatic reactions, inter-cell communication or numerous other processes, knowledge of their structure is vital for a detailed understanding of their function. However, structure determination by experiment is often a laborious process that cannot keep up with the ever increasing pace of sequencing methodologies. As a consequence, the gap between proteins where we only know the sequence and the proteins where we additionally have detailed structural information is growing rapidly. Computational modelling methods that extrapolate structural information from homologous structures have established themselves as a valuable complement to experiment and help bridging this gap. This thesis addresses two key aspects in protein modelling. (1) It investigates and improves methodologies that assign reliability estimates to protein models, so called quality estimation (QE) methods. Even a human expert cannot immediately detect errors introduced in the modelling process, thus the importance of automated methods performing this task. (2) It assesses the available methods that perform the modelling itself, discusses solutions for current shortcomings and provides efficient implementations thereof. When detecting errors in protein models, many knowledge based methods are biased towards the physio-chemical properties observed in soluble protein structures. This limits their applicability for the important class of membrane protein models. In an effort to improve the situation, QMEANBrane has been developed. QMEANBrane is specifically designed to detect local errors in membrane protein models by membrane specific statistical potentials of mean force that nowadays approach statistical saturation given the increase of available experimental data. Considering the improvement of quality estimation for soluble proteins, instead of solely applying the widely used statistical potentials of mean force, QMEANDisCo incorporates the observed structural variety of experimentally determined protein structures homologous to the model being assessed. Valuable ensemble information can be gathered without the need of actually depending on a large ensemble of protein models, thus circumventing a main limitation of consensus QE methods. Apart from improving QE methods, in an effort of implementing and extending state-of-the-art modelling algorithms, the lack of a free and efficient modelling engine became obvious. No available modelling engine provided an open-source codebase as a basis for novel, innovative algorithms and, at the same time, had no restrictions for usage. This contradicts our efforts to make protein modelling available to all biochemists and molecular biologists worldwide. As a consequence we implemented a new free and open modelling engine from scratch - ProMod3. ProMod3 allows to combine extremely efficient, state-of-the-art modelling algorithms in a flexible manner to solve various modelling problems. To weaken the dogma of one template one model, basic algorithms have been explored to incorporate structural information from multiple templates into one protein model. The algorithms are built using ProMod3 and have extensively been tested in the context of the CAMEO continuous evaluation platform. The result is a highly competitive modelling pipeline that excels with extremely low runtimes and excellent performance

    Bioinformatics

    Get PDF
    This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here
    corecore