83 research outputs found

    Efficient identification of Tanimoto nearest neighbors; All Pairs Similarity Search Using the Extended Jaccard Coefficient

    Get PDF
    Tanimoto, or extended Jaccard, is an important similarity measure which has seen prominent use in fields such as data mining and chemoinformatics. Many of the existing state-of-the-art methods for market basket analysis, plagiarism and anomaly detection, compound database search, and ligand-based virtual screening rely heavily on identifying Tanimoto nearest neighbors. Given the rapidly increasing size of data that must be analyzed, new algorithms are needed that can speed up nearest neighbor search, while at the same time providing reliable results. While many search algorithms address the complexity of the task by retrieving only some of the nearest neighbors, we propose a method that finds all of the exact nearest neighbors efficiently by leveraging recent advances in similarity search filtering. We provide tighter filtering bounds for the Tanimoto coefficient and show that our method, TAPNN, greatly outperforms existing baselines across a variety of real-world datasets and similarity thresholds

    An efficient graph generative model for navigating ultra-large combinatorial synthesis libraries

    Full text link
    Virtual, make-on-demand chemical libraries have transformed early-stage drug discovery by unlocking vast, synthetically accessible regions of chemical space. Recent years have witnessed rapid growth in these libraries from millions to trillions of compounds, hiding undiscovered, potent hits for a variety of therapeutic targets. However, they are quickly approaching a size beyond that which permits explicit enumeration, presenting new challenges for virtual screening. To overcome these challenges, we propose the Combinatorial Synthesis Library Variational Auto-Encoder (CSLVAE). The proposed generative model represents such libraries as a differentiable, hierarchically-organized database. Given a compound from the library, the molecular encoder constructs a query for retrieval, which is utilized by the molecular decoder to reconstruct the compound by first decoding its chemical reaction and subsequently decoding its reactants. Our design minimizes autoregression in the decoder, facilitating the generation of large, valid molecular graphs. Our method performs fast and parallel batch inference for ultra-large synthesis libraries, enabling a number of important applications in early-stage drug discovery. Compounds proposed by our method are guaranteed to be in the library, and thus synthetically and cost-effectively accessible. Importantly, CSLVAE can encode out-of-library compounds and search for in-library analogues. In experiments, we demonstrate the capabilities of the proposed method in the navigation of massive combinatorial synthesis libraries.Comment: 36th Conference on Neural Information Processing Systems (NeurIPS 2022

    Software for supporting large scale data processing for High Throughput Screening

    Get PDF
    High Throughput Screening for is a valuable data generation technique for data driven knowledge discovery. Because the rate of data generation is so great, it is a challenge to cope with the demands of post experiment data analysis. This thesis presents three software solutions that I implemented in an attempt to alleviate this problem. The first is K-Screen, a Laboratory Information Management System designed to handle and visualize large High Throughput Screening datasets. K-Screen is being successfully used by the University of Kansas High Throughput Screening Laboratory to better organize and visualize their data. The next two algorithms are designed to accelerate the search times for chemical similarity searches using 1-dimensional fingerprints. The first algorithm balances information content in bit strings to attempt to find more optimal ordering and segmentation patterns for chemical fingerprints. The second algorithm eliminates redundant pruning calculations for large batch chemical similarity searches and shows a 250% improvement for the fastest current fingerprint search algorithm for large batch queries

    Algorithms for Constructing Exact Nearest Neighbor Graphs

    Get PDF
    University of Minnesota Ph.D. dissertation.June 2016. Major: Computer Science. Advisor: George Karypis. 1 computer file (PDF); xi, 151 pages.Nearest neighbor graphs (NNGs) contain the set of closest neighbors, and their similarities, for each of the objects in a set of objects. They are widely used in many real-world applications, such as clustering, online advertising, recommender systems, data cleaning, and query refinement. A brute-force method for constructing the graph requires O(n^2) similarity comparisons for a set of n objects. One way to reduce the number of comparisons is to ignore object pairs with low similarity, which are unimportant in many domains. Current methods for construction of the graph tackle the problem by either pruning the similarity search space, avoiding comparisons of objects that can be determined to not meet the similarity bounding conditions, or they solve the problem approximately, which can miss some of the neighbors. This thesis addresses the problem of efficiently constructing the exact nearest neighbor graph for a large set of objects, i.e., the graph that would be found by comparing each object against all other objects in the set. In this context, we address two specific problems. The epsilon-nearest neighbor graph (epsilon-NNG) construction problem, also known as all-pairs similarity search (APSS), seeks to find, for each object, all other objects with a similarity of at least some threshold epsilon. On the other hand, the k-nearest neighbor graph (k-NNG) construction problem seeks to find the k closest other objects to each object in the set. For both problems, we propose filtering techniques that are more effective than previous ones, and efficient serial and parallel algorithms to construct the graph. Our methods are ideally suited for sparse high dimensional data

    Advances in the Development of Shape Similarity Methods and Their Application in Drug Discovery

    Get PDF
    Molecular similarity is a key concept in drug discovery. It is based on the assumption that structurally similar molecules frequently have similar properties. Assessment of similarity between small molecules has been highly effective in the discovery and development of various drugs. Especially, two-dimensional (2D) similarity approaches have been quite popular due to their simplicity, accuracy and efficiency. Recently, the focus has been shifted toward the development of methods involving the representation and comparison of three-dimensional (3D) conformation of small molecules. Among the 3D similarity methods, evaluation of shape similarity is now gaining attention for its application not only in virtual screening but also in molecular target prediction, drug repurposing and scaffold hopping. A wide range of methods have been developed to describe molecular shape and to determine the shape similarity between small molecules. The most widely used methods include atom distance-based methods, surface-based approaches such as spherical harmonics and 3D Zernike descriptors, atom-centered Gaussian overlay based representations. Several of these methods demonstrated excellent virtual screening performance not only retrospectively but also prospectively. In addition to methods assessing the similarity between small molecules, shape similarity approaches have been developed to compare shapes of protein structures and binding pockets. Additionally, shape comparisons between atomic models and 3D density maps allowed the fitting of atomic models into cryo-electron microscopy maps. This review aims to summarize the methodological advances in shape similarity assessment highlighting advantages, disadvantages and their application in drug discovery

    Mining complex data in highly streaming environments

    Get PDF
    Data is growing at a rapid rate because of advanced hardware and software technologies and platforms such as e-health systems, sensor networks, and social media. One of the challenging problems is storing, processing and transferring this big data in an efficient and effective way. One solution to tackle these challenges is to construct synopsis by means of data summarization techniques. Motivated by the fact that without summarization, processing, analyzing and communicating this vast amount of data is inefficient, this thesis introduces new summarization frameworks with the main goals of reducing communication costs and accelerating data mining processes in different application scenarios. Specifically, we study the following big data summarizaion techniques:(i) dimensionality reduction;(ii)clustering,and(iii)histogram, considering their importance and wide use in various areas and domains. In our work, we propose three different frameworks using these summarization techniques to cover three different aspects of big data:"Volume","Velocity"and"Variety" in centralized and decentralized platforms. We use dimensionality reduction techniques for summarizing large 2D-arrays, clustering and histograms for processing multiple data streams. With respect to the importance and rapid growth of emerging e-health applications such as tele-radiology and tele-medicine that require fast, low cost, and often lossless access to massive amounts of medical images and data over band limited channels,our first framework attempts to summarize streams of large volume medical images (e.g. X-rays) for the purpose of compression. Significant amounts of correlation and redundancy exist across different medical images. These can be extracted and used as a data summary to achieve better compression, and consequently less storage and less communication overheads on the network. We propose a novel memory-assisted compression framework as a learning-based universal coding, which can be used to complement any existing algorithm to further eliminate redundancies/similarities across images. This approach is motivated by the fact that, often in medical applications, massive amounts of correlated images from the same family are available as training data for learning the dependencies and deriving appropriate reference or synopses models. The models can then be used for compression of any new image from the same family. In particular, dimensionality reduction techniques such as Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF) are applied on a set of images from training data to form the required reference models. The proposed memory-assisted compression allows each image to be processed independently of other images, and hence allows individual image access and transmission. In the second part of our work,we investigate the problem of summarizing distributed multidimensional data streams using clustering. We devise a distributed clustering framework, DistClusTree, that extends the centralized ClusTree approach. The main difficulty in distributed clustering is balancing communication costs and clustering quality. We tackle this in DistClusTree through combining spatial index summaries and online tracking for efficient local and global incremental clustering. We demonstrate through extensive experiments the efficacy of the framework in terms of communication costs and approximate clustering quality. In the last part, we use a multidimensional index structure to merge distributed summaries in the form of a centralized histogram as another widely used summarization technique with the application in approximate range query answering. In this thesis, we propose the index-based Distributed Mergeable Summaries (iDMS) framework based on kd-trees that addresses these challenges with data generative models of Gaussian mixture models (GMMs) and a Generative Adversarial Network (GAN). iDMS maintains a global approximate kd-tree at a central site via GMMs or GANs upon new arrivals of streaming data at local sites. Experimental results validate the effectiveness and efficiency of iDMS against baseline distributed settings in terms of approximation error and communication costs

    Benchmarking and Developing Novel Methods for G Protein-coupled Receptor Ligand Discovery

    Get PDF
    G protein-coupled receptors (GPCR) are integral membrane proteins mediating responses from extracellular effectors that regulate a diverse set of physiological functions. Consequently, GPCR are the targets of ~34% of current FDA-approved drugs.3 Although it is clear that GPCR are therapeutically significant, discovery of novel drugs for these receptors is often impeded by a lack of known ligands and/or experimentally determined structures for potential drug targets. However, computational techniques have provided paths to overcome these obstacles. As such, this work discusses the development and application of novel computational methods and workflows for GPCR ligand discovery. Chapter 1 provides an overview of current obstacles faced in GPCR ligand discovery and defines ligand- and structure-based computational methods of overcoming these obstacles. Furthermore, chapter 1 outlines methods of hit list generation and refinement and provides a GPCR ligand discovery workflow incorporating computational techniques. In chapter 2, a workflow for modeling GPCR structure incorporating template selection via local sequence similarity and refinement of the structurally variable extracellular loop 2 (ECL2) region is benchmarked. Overall, findings in chapter 2 support the use of local template homology modeling in combination with de novo ECL2 modeling in the presence of a ligand from the template crystal structure to generate GPCR models intended to study ligand binding interactions. Chapter 3 details a method of generating structure-based pharmacophore models via the random selection of functional group fragments placed with Multiple Copy Simultaneous Search (MCSS) that is benchmarked in the context of 8 GPCR targets. When pharmacophore model performance was assessed with enrichment factor (EF) and goodness-of-hit (GH) scoring metrics, pharmacophore models possessing the theoretical maximum EF value were produced in both resolved structures (8 of 8 cases) and homology models (7 of 8 cases). Lastly, chapter 4 details a method of structure-based pharmacophore model generation using MCSS that is applicable to targets with no known ligands. Additionally, a method of pharmacophore model selection via machine learning is discussed. Overall, the work in chapter 4 led to the development of pharmacophore models exhibiting high EF values that were able to be accurately selected with machine learning classifiers

    Seventh Biennial Report : June 2003 - March 2005

    No full text
    corecore