44 research outputs found

    A practical index for approximate dictionary matching with few mismatches

    Get PDF
    Approximate dictionary matching is a classic string matching problem (checking if a query string occurs in a collection of strings) with applications in, e.g., spellchecking, online catalogs, geolocation, and web searchers. We present a surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (e.g., by being cache friendly). We compare our solution with other algorithms and we show that it performs better for the Hamming distance. Query times in the order of 1 microsecond were reported for one mismatch for the dictionary size of a few megabytes on a medium-end PC. We also demonstrate that a basic compression technique consisting in qq-gram substitution can significantly reduce the index size (up to 50% of the input text size for the DNA), while still keeping the query time relatively low

    Similarity-based virtual screening using 2D fingerprints

    Get PDF
    This paper summarises recent work at the University of Sheffield on virtual screening methods that use 2D fingerprint measures of structural similarity. A detailed comparison of a large number of similarity coefficients demonstrates that the well-known Tanimoto coefficient remains the method of choice for the computation of fingerprint-based similarity, despite possessing some inherent biases related to the sizes of the molecules that are being sought. Group fusion involves combining the results of similarity searches based on multiple reference structures and a single similarity measure. We demonstrate the effectiveness of this approach to screening, and also describe an approximate form of group fusion, turbo similarity searching, that can be used when just a single reference structure is available

    IMPROVING MOLECULAR FINGERPRINT SIMILARITY VIA ENHANCED FOLDING

    Get PDF
    Drug discovery depends on scientists finding similarity in molecular fingerprints to the drug target. A new way to improve the accuracy of molecular fingerprint folding is presented. The goal is to alleviate a growing challenge due to excessively long fingerprints. This improved method generates a new shorter fingerprint that is more accurate than the basic folded fingerprint. Information gathered during preprocessing is used to determine an optimal attribute order. The most commonly used blocks of bits can then be organized and used to generate a new improved fingerprint for more optimal folding. We thenapply the widely usedTanimoto similarity search algorithm to benchmark our results. We show an improvement in the final results using this method to generate an improved fingerprint when compared against other traditional folding methods

    The use of MoStBioDat for rapid screening of molecular diversity

    Get PDF
    MoStBioDat is a uniform data storage and extraction system with an extensive array of tools for structural similarity measures and pattern matching which is essential to facilitate the drug discovery process. Structure-based database screening has recently become a common and efficient technique in early stages of the drug development, shifting the emphasis from rational drug design into the probability domain of more or less random discovery. The virtual ligand screening (VLS), an approach based on high-throughput flexible docking, samples a virtually infinite molecular diversity of chemical libraries increasing the concentration of molecules with high binding affinity. The rapid process of subsequent examination of a large number of molecules in order to optimize the molecular diversity is an attractive alternative to the traditional methods of lead discovery. This paper presents the application of the MoStBioDat package not only as a data management platform but mainly in substructure searching. In particular, examples of the applications of MoStBioDat are discussed and analyze

    RASCAL: calculation of graph similarity using maximum common edge subgraphs

    Get PDF
    A new graph similarity calculation procedure is introduced for comparing labeled graphs. Given a minimum similarity threshold, the procedure consists of an initial screening process to determine whether it is possible for the measure of similarity between the two graphs to exceed the minimum threshold, followed by a rigorous maximum common edge subgraph (MCES) detection algorithm to compute the exact degree and composition of similarity. The proposed MCES algorithm is based on a maximum clique formulation of the problem and is a significant improvement over other published algorithms. It presents new approaches to both lower and upper bounding as well as vertex selection

    Chemical Diversity of Scarab Beetle Pheromones and its Implication in Chemical Evolution

    Get PDF
    Pheromones are species-specific chemical signals used by insects to communicate, to find a mate, and to identify their territory. In this paper, we analyzed the structural similarity of scarab beetle pheromones using the Tanimoto coefficient in an attempt to draw insights regarding their ecology, evolution and chemotaxonomy. The results showed a very diverse scarab beetle pheromone structure which provides further support to an earlier hypothesis regarding beetle pheromone evolution. In addition, it was found that the scarab beetle pheromone structure cannot be used as a species marker in chemotaxonomy owing to the observed high structural diversity

    Functional Group and Substructure Searching as a Tool in Metabolomics

    Get PDF
    BACKGROUND: A direct link between the names and structures of compounds and the functional groups contained within them is important, not only because biochemists frequently rely on literature that uses a free-text format to describe functional groups, but also because metabolic models depend upon the connections between enzymes and substrates being known and appropriately stored in databases. METHODOLOGY: We have developed a database named "Biochemical Substructure Search Catalogue" (BiSSCat), which contains 489 functional groups, >200,000 compounds and >1,000,000 different computationally constructed substructures, to allow identification of chemical compounds of biological interest. CONCLUSIONS: This database and its associated web-based search program (http://bisscat.org/) can be used to find compounds containing selected combinations of substructures and functional groups. It can be used to determine possible additional substrates for known enzymes and for putative enzymes found in genome projects. Its applications to enzyme inhibitor design are also discussed

    Identifying new topoisomerase II poison scaffolds by combining publicly available toxicity data and 2D/3D-based virtual screening

    Get PDF
    Molecular descriptor (2D) and three dimensional (3D) shape based similarity methods are widely used in ligand based virtual drug design. In the present study pairwise structure comparisons among a set of 4858 DTP compounds tested in the NCI60 tumor cell line anticancer drug screen were computed using chemical hashed fingerprints and 3D molecule shapes to calculate 2D and 3D similarities, respectively. Additionally, pairwise biological activity similarities were calculated by correlating the 60 element vectors of pGI50 values corresponding to the cytotoxicity of the compounds across the NCI60 panel. Subsequently, we compared the power of 2D and 3D structural similarity metrics to predict the toxicity pattern of compounds. We found that while the positive predictive value and sensitivity of 3D and molecular descriptor based approaches to predict biological activity are similar, a subset of molecule pairs yielded contradictory results. By simultaneously requiring similarity of biological activities and 3D shapes, and dissimilarity of molecular descriptor based comparisons, we identify pairs of scaffold hopping candidates displaying characteristic core structural changes such as heteroatom/heterocycle change and ring closure. Attempts to discover scaffold hopping candidates of mitoxantrone recovered known Topoisomerase II (Top2) inhibitors, and also predicted new, previously unknown chemotypes possessing in vitro Top2 inhibitory activity
    corecore