982 research outputs found

    Tableau-based protein substructure search using quadratic programming

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Searching for proteins that contain similar substructures is an important task in structural biology. The exact solution of most formulations of this problem, including a recently published method based on tableaux, is too slow for practical use in scanning a large database.</p> <p>Results</p> <p>We developed an improved method for detecting substructural similarities in proteins using tableaux. Tableaux are compared efficiently by solving the quadratic program (QP) corresponding to the quadratic integer program (QIP) formulation of the extraction of maximally-similar tableaux. We compare the accuracy of the method in classifying protein folds with some existing techniques.</p> <p>Conclusion</p> <p>We find that including constraints based on the separation of secondary structure elements increases the accuracy of protein structure search using maximally-similar subtableau extraction, to a level where it has comparable or superior accuracy to existing techniques. We demonstrate that our implementation is able to search a structural database in a matter of hours on a standard PC.</p

    Computational Methods for the Integration of Biological Activity and Chemical Space

    Get PDF
    One general aim of medicinal chemistry is the understanding of structure-activity relationships of ligands that bind to biological targets. Advances in combinatorial chemistry and biological screening technologies allow the analysis of ligand-target relationships on a large-scale. However, in order to extract useful information from biological activity data, computational methods are needed that link activity of ligands to their chemical structure. In this thesis, it is investigated how fragment-type descriptors of molecular structure can be used in order to create a link between activity and chemical ligand space. First, an activity class-dependent hierarchical fragmentation scheme is introduced that generates fragmentation pathways that are aligned using established methodologies for multiple alignment of biological sequences. These alignments are then used to extract consensus fragment sequences that serve as a structural signature for individual biological activity classes. It is also investigated how defined, chemically intuitive molecular fragments can be organized based on their topological environment and co-occurrence in compounds active against closely related targets. Therefore, the Topological Fragment Index is introduced that quantifies the topological environment complexity of a fragment in a given molecule, and thus goes beyond fragment frequency analysis. Fragment dependencies have been established on the basis of common topological environments, which facilitates the identification of activity class-characteristic fragment dependency pathways that describe fragment relationships beyond structural resemblance. Because fragments are often dependent on each other in an activity class-specific manner, the importance of defined fragment combinations for similarity searching is further assessed. Therefore, Feature Co-occurrence Networks are introduced that allow the identification of feature cliques characteristic of individual activity classes. Three differently designed molecular fingerprints are compared for their ability to provide such cliques and a clique-based similarity searching strategy is established. For molecule- and activity class-centric fingerprint designs, feature combinations are shown to improve similarity search performance in comparison to standard methods. Moreover, it is demonstrated that individual features can form activity-class specific combinations. Extending the analysis of feature cliques characteristic of individual activity classes, the distribution of defined fragment combinations among several compound classes acting against closely related targets is assessed. Fragment Formal Concept Analysis is introduced for flexible mining of complex structure-activity relationships. It allows the interactive assembly of fragment queries that yield fragment combinations characteristic of defined activity and potency profiles. It is shown that pairs and triplets, rather than individual fragments distinguish between different activity profiles. A classifier is built based on these fragment signatures that distinguishes between ligands of closely related targets. Going beyond activity profiles, compound selectivity is also analyzed. Therefore, Molecular Formal Concept Analysis is introduced for the systematic mining of compound selectivity profiles on a whole-molecule basis. Using this approach, structurally diverse compounds are identified that share a selectivity profile with selected template compounds. Structure-selectivity relationships of obtained compound sets are further analyzed

    Entropy-scaling search of massive biological data

    Get PDF
    Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

    Query-Dependent Banding (QDB) for Faster RNA Similarity Searches

    Get PDF
    When searching sequence databases for RNAs, it is desirable to score both primary sequence and RNA secondary structure similarity. Covariance models (CMs) are probabilistic models well-suited for RNA similarity search applications. However, the computational complexity of CM dynamic programming alignment algorithms has limited their practical application. Here we describe an acceleration method called query-dependent banding (QDB), which uses the probabilistic query CM to precalculate regions of the dynamic programming lattice that have negligible probability, independently of the target database. We have implemented QDB in the freely available Infernal software package. QDB reduces the average case time complexity of CM alignment from LN (2.4) to LN (1.3) for a query RNA of N residues and a target database of L residues, resulting in a 4-fold speedup for typical RNA queries. Combined with other improvements to Infernal, including informative mixture Dirichlet priors on model parameters, benchmarks also show increased sensitivity and specificity resulting from improved parameterization

    Homomorphic Pattern Mining from a Single Large Data Tree

    Get PDF

    Connectable Components for Protein Design

    Get PDF
    Protein design requires reusable, trustworthy, and connectable parts in order to scale to complex challenges. The recent explosion of protein structures stored within the Protein Data Bank provides a wealth of small motifs we can harvest, but we still lack tools to combine them into larger proteins. Here I explore two approaches for connecting reusable protein components on two different length scales. On the atomic scale, I build an interactive search engine for connecting chemical fragments together. Protein fragments built using this search engine recapitulate native-like protein assemblies that can be integrated into existing protein scaffolds using backbone search engines such as MaDCaT. On the protein domain scale, I quantitatively dissect structural variations in two-component systems in order to extract general principles for engineering interfacial flexibility between modular four-helix bundles. These bundles exhibit large scissoring motions where helices move towards or away from the bundle axis and these motions propagate across domain boundaries. Together, these two approaches form the beginnings of a multiscale methodology for connecting reusable protein fragments where there is a constant interplay and feedback between design of atomic structure, secondary structure, and tertiary structure. Rapid iteration, visualization, and search glue these diverse length scales together into a cohesive whole

    The influence of relatives on the efficiency and error rate of familial searching

    Get PDF
    We investigate the consequences of adopting the criteria used by the state of California, as described by Myers et al. (2011), for conducting familial searches. We carried out a simulation study of randomly generated profiles of related and unrelated individuals with 13-locus CODIS genotypes and YFiler Y-chromosome haplotypes, on which the Myers protocol for relative identification was carried out. For Y-chromosome sharing first degree relatives, the Myers protocol has a high probability (80 - 99%) of identifying their relationship. For unrelated individuals, there is a low probability that an unrelated person in the database will be identified as a first-degree relative. For more distant Y-haplotype sharing relatives (half-siblings, first cousins, half-first cousins or second cousins) there is a substantial probability that the more distant relative will be incorrectly identified as a first-degree relative. For example, there is a 3 - 18% probability that a first cousin will be identified as a full sibling, with the probability depending on the population background. Although the California familial search policy is likely to identify a first degree relative if his profile is in the database, and it poses little risk of falsely identifying an unrelated individual in a database as a first-degree relative, there is a substantial risk of falsely identifying a more distant Y-haplotype sharing relative in the database as a first-degree relative, with the consequence that their immediate family may become the target for further investigation. This risk falls disproportionately on those ethnic groups that are currently overrepresented in state and federal databases.Comment: main text: 19 pages, 4 tables, 2 figures supplemental text: 2 pages, 5 tables all together as single fil
    corecore