26 research outputs found

    MimoSA: a system for minimotif annotation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Minimotifs are short peptide sequences within one protein, which are recognized by other proteins or molecules. While there are now several minimotif databases, they are incomplete. There are reports of many minimotifs in the primary literature, which have yet to be annotated, while entirely novel minimotifs continue to be published on a weekly basis. Our recently proposed function and sequence syntax for minimotifs enables us to build a general tool that will facilitate structured annotation and management of minimotif data from the biomedical literature.</p> <p>Results</p> <p>We have built the MimoSA application for minimotif annotation. The application supports management of the Minimotif Miner database, literature tracking, and annotation of new minimotifs. MimoSA enables the visualization, organization, selection and editing functions of minimotifs and their attributes in the MnM database. For the literature components, Mimosa provides paper status tracking and scoring of papers for annotation through a freely available machine learning approach, which is based on word correlation. The paper scoring algorithm is also available as a separate program, TextMine. Form-driven annotation of minimotif attributes enables entry of new minimotifs into the MnM database. Several supporting features increase the efficiency of annotation. The layered architecture of MimoSA allows for extensibility by separating the functions of paper scoring, minimotif visualization, and database management. MimoSA is readily adaptable to other annotation efforts that manually curate literature into a MySQL database.</p> <p>Conclusions</p> <p>MimoSA is an extensible application that facilitates minimotif annotation and integrates with the Minimotif Miner database. We have built MimoSA as an application that integrates dynamic abstract scoring with a high performance relational model of minimotif syntax. MimoSA's TextMine, an efficient paper-scoring algorithm, can be used to dynamically rank papers with respect to context.</p

    Secondary Structure, a Missing Component of Sequence- Based Minimotif Definitions

    Get PDF
    Minimotifs are short contiguous segments of proteins that have a known biological function. The hundreds of thousands of minimotifs discovered thus far are an important part of the theoretical understanding of the specificity of protein-protein interactions, posttranslational modifications, and signal transduction that occur in cells. However, a longstanding problem is that the different abstractions of the sequence definitions do not accurately capture the specificity, despite decades of effort by many labs. We present evidence that structure is an essential component of minimotif specificity, yet is not used in minimotif definitions. Our analysis of several known minimotifs as case studies, analysis of occurrences of minimotifs in structured and disordered regions of proteins, and review of the literature support a new model for minimotif definitions that includes sequence, structure, and function

    Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories <b>- </b>based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an <it>O</it>(<it>n/p</it>) time parallel algorithm has been given for this problem. Here <it>n </it>is the size of the input and <it>p </it>is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(<it>n</it>Σ) messages (Σ being the size of the alphabet).</p> <p>Results</p> <p>In this paper we present a Θ(<it>n/p</it>) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of <inline-formula><m:math name="1471-2105-11-560-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow><m:mo>Θ</m:mo><m:mo stretchy="false">(</m:mo><m:mfrac><m:mrow><m:mi>n</m:mi><m:mi>log</m:mi><m:mo stretchy="false">(</m:mo><m:mi>n</m:mi><m:mo>/</m:mo><m:mi>B</m:mi><m:mo stretchy="false">)</m:mo></m:mrow><m:mrow><m:mi>B</m:mi><m:mi>log</m:mi><m:mo stretchy="false">(</m:mo><m:mi>M</m:mi><m:mo>/</m:mo><m:mi>B</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:mfrac><m:mo stretchy="false">)</m:mo></m:mrow></m:math></inline-formula> (<it>M </it>being the main memory size and <it>B </it>being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster <b>- </b>both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem.</p> <p>Conclusions</p> <p>The bi-directed de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bi-directed de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Furthermore, our algorithms do not employ any all-to-all communications in a parallel setting and perform better than the prior algorithms. Finally our out-of-core algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET.</p

    Secondary Structure, a Missing Component of Sequence-Based Minimotif Definitions

    Get PDF
    Minimotifs are short contiguous segments of proteins that have a known biological function. The hundreds of thousands of minimotifs discovered thus far are an important part of the theoretical understanding of the specificity of protein-protein interactions, posttranslational modifications, and signal transduction that occur in cells. However, a longstanding problem is that the different abstractions of the sequence definitions do not accurately capture the specificity, despite decades of effort by many labs. We present evidence that structure is an essential component of minimotif specificity, yet is not used in minimotif definitions. Our analysis of several known minimotifs as case studies, analysis of occurrences of minimotifs in structured and disordered regions of proteins, and review of the literature support a new model for minimotif definitions that includes sequence, structure, and function. © 2012 Sargeant et al

    Efficient Combinatorial Algorithms for Problems in Sequence and Self Assembly

    No full text
    In this thesis we present efficient algorithms for various combinatorial problems arising in sequence assembly and self assembly . Sequence assembly is a major phase in uncovering the genomic sequence of an organism. Sequence assembly has several underlying combinatorial problems on bi-directed de Bruijn graphs. Existing algorithms to build and operate on these graphs cannot scale with ever increasing volume of sequence assembly data. In this thesis we close this gap by providing efficient algorithms build and operate on bi-directed de Bruijn graphs. We first show how a bi-directed de Bruijn graph can be constructed optimally in Θ(n) time in contrast to the existing [ZB08] Θ(n log( G)) algorithm, here n is the input size and G is the size of genome. This algorithm is also I/O optimal and requires Θ( logn/M logM/B ) I/Os to build the graph, here M is the main memory size and B is the block size. Secondly we show that we can solve the Chinese Postman walk Problem on a bi-directed graph without reducing it to bi-directed flow problem. This bi-directed flow based algorithm [MGMB07] to solve the CPP on a bi-directed graph G( V,E) takes O(:E:2 log2(V)) time. We show that we can improve this algorithm to Θ(p(:V: + : E:) log(:V:) + (dmaxp) 3), here p = max{:{ν:din(ν) − dout(ν) \u3e 0}:, :{ν:din(ν) − dout(ν) \u3c 0}:} and dmax = max{:din(ν) − dout(ν):}. This algorithm performs asymptotically better than the bi-directed flow algorithm when the number of imbalanced nodes p is much less than the nodes in the bi-directed graph. ^ On the other hand self assembly systems have numerous critical applications in medicine, circuit design. Theoretical modeling of self assembly is very useful before performing self assembly experiments. Algorithmic self assembly studies the efficiency of self assembly systems on an abstract two dimensional (2D) tile assembly model (TAM). The theory behind TAM is based on Wang\u27s tiling technique, TAM has the power to simulate a turing machine. Algorithms with an optimal tile complexity of (Θ( logN loglogN )) were proposed earlier to uniquely self assemble an N × N square (with a temperature of α = 2) on TAM. However efficient algorithms (tile set constructions) to assemble arbitrary shapes on TAM are not known and have remained open. In this thesis we try to bridge this gap by presenting algorithms which can self assemble some regular polygons with a tile complexity of Θ(log(N)), here N is the area of the underlying polygon. In a deterministic self assembly model such as TAM, it has been proven that the tile complexity lower bound to self assembly any shape is Θ( logNloglog N ) (inferred from the Kolmogrov complexity), here N is the area of the underlying shape. However designing even Θ( logN loglogN ) unique tiles specific to a shape which needs to be self assembled is still an intensive task. Creating a copy of a tile is much simpler than creating a unique tile. With this constraint in mind probabilistic tile assembly models (PTAM) were introduced—these models are also referred as concentration programming models or randomized self assembly models. These systems have O(1) tile complexity and the concentration of each of the tiles can be varied to produce the desired shape. Existing algorithms [KS08] [Dot09] on PTAM suffer from large underlying constant, this is because all these algorithms adopt sub-tiles which perform binary arithmetic. In contrast to the existing algorithms, in this thesis we show that its possible to self assemble rectilinear shapes on PTAM without using any sub-tiles performing binary arithmetic; We introduce a technique called staircase sampling which can self assemble squares, rectangles and rectangles with constant aspect ratio with high probability (i.e. Ω(1 − 1/nα), for any fixed α \u3e 0), here n is the dimension of the shape which needs to be self assembled.

    Algorithms for Local Structural Alignment And Structural Motif Identification

    Full text link
    A protein is characterized by both the amino-acid sequence and the threedimensional (3-D) structure of the underlying atoms. Although it is a common practice of the biologists to use sequence similarity among different proteins to identify any conserved regions during the evolution, it has been proven that the 3-D structures of the proteins are conserved more fundamentally than the sequence during the evolution. Even though two given proteins may not exhibit much of a sequence homology, the structural similarity between them might account for similar properties. Proteins with a similar structure might have similar properties. This is the motivation behind the study of the structural alignment problem in a manner similar to that of the sequence alignment problem
    corecore