37,754 research outputs found

    RasBhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison

    Full text link
    Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de

    Entropy-scaling search of massive biological data

    Get PDF
    Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

    Multiple structural alignment for distantly related all b structures using TOPS pattern discovery and simulated annealing

    Get PDF
    Topsalign is a method that will structurally align diverse protein structures, for example, structural alignment of protein superfolds. All proteins within a superfold share the same fold but often have very low sequence identity and different biological and biochemical functions. There is often signiÂźcant structural diversity around the common scaffold of secondary structure elements of the fold. Topsalign uses topological descriptions of proteins. A pattern discovery algorithm identiÂźes equivalent secondary structure elements between a set of proteins and these are used to produce an initial multiple structure alignment. Simulated annealing is used to optimize the alignment. The output of Topsalign is a multiple structure-based sequence alignment and a 3D superposition of the structures. This method has been tested on three superfolds: the b jelly roll, TIM (a/b) barrel and the OB fold. Topsalign outperforms established methods on very diverse structures. Despite the pattern discovery working only on b strand secondary structure elements, Topsalign is shown to align TIM (a/b) barrel superfamilies, which contain both a helices and b strands

    iLIR : a web resource for prediction of Atg8-family interacting proteins

    Get PDF
    Macroautophagy was initially considered to be a nonselective process for bulk breakdown of cytosolic material. However, recent evidence points toward a selective mode of autophagy mediated by the so-called selective autophagy receptors (SARs). SARs act by recognizing and sorting diverse cargo substrates (e.g., proteins, organelles, pathogens) to the autophagic machinery. Known SARs are characterized by a short linear sequence motif (LIR-, LRS-, or AIM-motif) responsible for the interaction between SARs and proteins of the Atg8 family. Interestingly, many LIR-containing proteins (LIRCPs) are also involved in autophagosome formation and maturation and a few of them in regulating signaling pathways. Despite recent research efforts to experimentally identify LIRCPs, only a few dozen of this class of—often unrelated—proteins have been characterized so far using tedious cell biological, biochemical, and crystallographic approaches. The availability of an ever-increasing number of complete eukaryotic genomes provides a grand challenge for characterizing novel LIRCPs throughout the eukaryotes. Along these lines, we developed iLIR, a freely available web resource, which provides in silico tools for assisting the identification of novel LIRCPs. Given an amino acid sequence as input, iLIR searches for instances of short sequences compliant with a refined sensitive regular expression pattern of the extended LIR motif (xLIR-motif) and retrieves characterized protein domains from the SMART database for the query. Additionally, iLIR scores xLIRs against a custom position-specific scoring matrix (PSSM) and identifies potentially disordered subsequences with protein interaction potential overlapping with detected xLIR-motifs. Here we demonstrate that proteins satisfying these criteria make good LIRCP candidates for further experimental verification. Domain architecture is displayed in an informative graphic, and detailed results are also available in tabular form. We anticipate that iLIR will assist with elucidating the full complement of LIRCPs in eukaryotes
    • 

    corecore