1,634 research outputs found

    The study of probability model for compound similarity searching

    Get PDF
    Information Retrieval or IR system main task is to retrieve relevant documents according to the users query. One of IR most popular retrieval model is the Vector Space Model. This model assumes relevance based on similarity, which is defined as the distance between query and document in the concept space. All currently existing chemical compound database systems have adapt the vector space model to calculate the similarity of a database entry to a query compound. However, it assumes that fragments represented by the bits are independent of one another, which is not necessarily true. Hence, the possibility of applying another IR model is explored, which is the Probabilistic Model, for chemical compound searching. This model estimates the probabilities of a chemical structure to have the same bioactivity as a target compound. It is envisioned that by ranking chemical structures in decreasing order of their probability of relevance to the query structure, the effectiveness of a molecular similarity searching system can be increased. Both fragment dependencies and independencies assumption are taken into consideration in achieving improvement towards compound similarity searching system. After conducting a series of simulated similarity searching, it is concluded that PM approaches really did perform better than the existing similarity searching. It gave better result in all evaluation criteria to confirm this statement. In terms of which probability model performs better, the BD model shown improvement over the BIR model

    Targeting the Poly (ADP-Ribose) Polymerase-1 Catalytic Pocket Using AutoGrow4, a Genetic Algorithm for De Novo Design

    Get PDF
    AutoGrow4 is a free and open-source program for de novo drug design that uses a genetic algorithm (GA) to create novel predicted small-molecule ligands for a given protein target without the constraints of a finite, pre-defined virtual library. By leveraging recent computational and cheminformatic advancements, AutoGrow4 is faster, more stable, and more modular than previous versions. Features such as docking-software compatibility, chemical filters, multithreading options, and selection methods have been expanded to support a wide range of user needs. This dissertation will cover the development and validation of AutoGrow4, as well as its application to poly (ADP-ribose) polymerase-1 (PARP-1). PARP-1 is a well-characterized DNA-damage recognition protein, and PARP-1 inhibition is an effective treatment for ovarian and breast cancers that are homologous-recombination (HR) deficient1–5. As a well-studied protein, PARP-1 is also an excellent drug target with which to validate AutoGrow4. Multiple crystallographic structures of PARP-1 bound to various PARP-1 inhibitors (PARPi) serve as positive controls for assessing the quality of AutoGrow4-generated compounds in terms of predicted binding affinity, chemical structure, and predicted protein-ligand interactions. This dissertation describes how I (1) generated novel potential PARPi with predicted binding affinities that surpass those of known PARPi; (2) validated AutoGrow4 as a tool for de novo drug design, lead optimization, and hypothesis generation, using PARP-1 as a test target; (3) contributed support to the growing notion that there is a need for HR-deficient cancer chemotherapies that do not rely on the same set of protein-ligand interactions typical of current PARPi; (4) generated novel potential PARPi that are predicted to bind to PARP-1 independent of a post-translational modification that is known to cause PARPi resistance; and (5) generated novel potential PARPi that are predicted to bind a secondary PARP-1 pocket that is distant from the primary catalytic site

    Sparse Learning over Infinite Subgraph Features

    Full text link
    We present a supervised-learning algorithm from graph data (a set of graphs) for arbitrary twice-differentiable loss functions and sparse linear models over all possible subgraph features. To date, it has been shown that under all possible subgraph features, several types of sparse learning, such as Adaboost, LPBoost, LARS/LASSO, and sparse PLS regression, can be performed. Particularly emphasis is placed on simultaneous learning of relevant features from an infinite set of candidates. We first generalize techniques used in all these preceding studies to derive an unifying bounding technique for arbitrary separable functions. We then carefully use this bounding to make block coordinate gradient descent feasible over infinite subgraph features, resulting in a fast converging algorithm that can solve a wider class of sparse learning problems over graph data. We also empirically study the differences from the existing approaches in convergence property, selected subgraph features, and search-space sizes. We further discuss several unnoticed issues in sparse learning over all possible subgraph features.Comment: 42 pages, 24 figures, 4 table

    Scheduling and Tuning Kernels for High-performance on Heterogeneous Processor Systems

    Get PDF
    Accelerated parallel computing techniques using devices such as GPUs and Xeon Phis (along with CPUs) have proposed promising solutions of extending the cutting edge of high-performance computer systems. A significant performance improvement can be achieved when suitable workloads are handled by the accelerator. Traditional CPUs can handle those workloads not well suited for accelerators. Combination of multiple types of processors in a single computer system is referred to as a heterogeneous system. This dissertation addresses tuning and scheduling issues in heterogeneous systems. The first section presents work on tuning scientific workloads on three different types of processors: multi-core CPU, Xeon Phi massively parallel processor, and NVIDIA GPU; common tuning methods and platform-specific tuning techniques are presented. Then, analysis is done to demonstrate the performance characteristics of the heterogeneous system on different input data. This section of the dissertation is part of the GeauxDock project, which prototyped a few state-of-art bioinformatics algorithms, and delivered a fast molecular docking program. The second section of this work studies the performance model of the GeauxDock computing kernel. Specifically, the work presents an extraction of features from the input data set and the target systems, and then uses various regression models to calculate the perspective computation time. This helps understand why a certain processor is faster for certain sets of tasks. It also provides the essential information for scheduling on heterogeneous systems. In addition, this dissertation investigates a high-level task scheduling framework for heterogeneous processor systems in which, the pros and cons of using different heterogeneous processors can complement each other. Thus a higher performance can be achieve on heterogeneous computing systems. A new scheduling algorithm with four innovations is presented: Ranked Opportunistic Balancing (ROB), Multi-subject Ranking (MR), Multi-subject Relative Ranking (MRR), and Automatic Small Tasks Rearranging (ASTR). The new algorithm consistently outperforms previously proposed algorithms with better scheduling results, lower computational complexity, and more consistent results over a range of performance prediction errors. Finally, this work extends the heterogeneous task scheduling algorithm to handle power capping feature. It demonstrates that a power-aware scheduler significantly improves the power efficiencies and saves the energy consumption. This suggests that, in addition to performance benefits, heterogeneous systems may have certain advantages on overall power efficiency

    Molecular Similarity and Xenobiotic Metabolism

    Get PDF
    MetaPrint2D, a new software tool implementing a data-mining approach for predicting sites of xenobiotic metabolism has been developed. The algorithm is based on a statistical analysis of the occurrences of atom centred circular fingerprints in both substrates and metabolites. This approach has undergone extensive evaluation and been shown to be of comparable accuracy to current best-in-class tools, but is able to make much faster predictions, for the first time enabling chemists to explore the effects of structural modifications on a compound’s metabolism in a highly responsive and interactive manner.MetaPrint2D is able to assign a confidence score to the predictions it generates, based on the availability of relevant data and the degree to which a compound is modelled by the algorithm.In the course of the evaluation of MetaPrint2D a novel metric for assessing the performance of site of metabolism predictions has been introduced. This overcomes the bias introduced by molecule size and the number of sites of metabolism inherent to the most commonly reported metrics used to evaluate site of metabolism predictions.This data mining approach to site of metabolism prediction has been augmented by a set of reaction type definitions to produce MetaPrint2D-React, enabling prediction of the types of transformations a compound is likely to undergo and the metabolites that are formed. This approach has been evaluated against both historical data and metabolic schemes reported in a number of recently published studies. Results suggest that the ability of this method to predict metabolic transformations is highly dependent on the relevance of the training set data to the query compounds.MetaPrint2D has been released as an open source software library, and both MetaPrint2D and MetaPrint2D-React are available for chemists to use through the Unilever Centre for Molecular Science Informatics website.----Boehringer-Ingelhie

    Efficient Algorithms for Graph Optimization Problems

    Get PDF
    A doktori értekezés hatékony algoritmusokat mutat be gráfokon értelmezett nehéz kombinatorikus optimalizálási feladatok megoldására. A kutatás legfontosabb eredményét különböző megoldási módszerekhez kidolgozott javítások jelentik, amelyek magukban foglalnak új heurisztikákat, valamint gráfok és fák speciális reprezentációit is. Az elvégzett elemzések igazolták, hogy a szerző által adott leghatékonyabb algoritmusok az esetek többségében gyorsabbak, illetve jobb eredményeket adnak, mint más elérhető implementációk. A dolgozat első fele hét különböző algoritmust és számos hasznos javítást mutat be a minimális költségű folyam feladatra, amely a legtöbbet vizsgált és alkalmazott gráfoptimalizálási problémák egyike. Az implementációinkat egy átfogó tapasztalati elemzés keretében összehasonlítottuk nyolc másik megoldóprogrammal, köztük a leggyakrabban használt és legelismertebb implementációkkal. A hálózati szimplex algoritmusunk lényegesen hatékonyabbnak és robusztusabbnak bizonyult, mint a módszer más implementációi, továbbá a legtöbb tesztadaton ez az algoritmus a leggyorsabb. A bemutatott költségskálázó algoritmus szintén rendkívül hatékony; nagy méretű ritka gráfokon felülmúlja a hálózati szimplex implementációkat. Az értekezésben tárgyalt másik optimalizálási feladat a legnagyobb közös részgráf probléma. Ezt a feladatot kémiai alkalmazások szempontjából vizsgáltuk. Hatékony heurisztikákat dolgoztunk ki, amelyek jelentősen javítják két megoldási módszer pontosságát és sebességét, valamint kémiailag relevánsabb módon rendelik egymáshoz molekulagráfok atomjait és kötéseit. Az algoritmusainkat összehasonlítottuk két ismert megoldóprogrammal, amelyeknél lényegesen jobb eredményeket sikerült elérnünk. A kifejlesztett implementációk bekerültek a ChemAxon Kft. több szoftvertermékébe, melyek vezető nemzetközi gyógyszercégek használatában állnak. Ezen kívül az értekezés röviden bemutatja a LEMON nevű nyílt forrású C++ gráfoptimalizációs programkönyvtárat, amely magában foglalja a minimális költségű folyam feladatra adott algoritmusokat. Ezek az implementációk nagy mértékben hozzájárultak a programcsomag népszerűségének növekedéséhez

    Design and implementation of a platform for predicting pharmacological properties of molecules

    Get PDF
    Tese de mestrado, Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de Ciências, 2019O processo de descoberta e desenvolvimento de novos medicamentos prolonga-se por vários anos e implica o gasto de imensos recursos monetários. Como tal, vários métodos in silico são aplicados com o intuito de dimiuir os custos e tornar o processo mais eficiente. Estes métodos incluem triagem virtual, um processo pelo qual vastas coleções de compostos são examinadas para encontrar potencial terapêutico. QSAR (Quantitative Structure Activity Relationship) é uma das tecnologias utilizada em triagem virtual e em optimização de potencial farmacológico, em que a informação estrutural de ligandos conhecidos do alvo terapêutico é utilizada para prever a actividade biológica de um novo composto para com o alvo. Vários investigadores desenvolvem modelos de aprendizagem automática de QSAR para múltiplos alvos terapêuticos. Mas o seu uso está dependente do acesso aos mesmos e da facilidade em ter os modelos funcionais, o que pode ser complexo quando existem várias dependências ou quando o ambiente de desenvolvimento difere bastante do ambiente em que é usado. A aplicação ao qual este documento se refere foi desenvolvida para lidar com esta questão. Esta é uma plataforma centralizada onde investigadores podem aceder a vários modelos de QSAR, podendo testar os seus datasets para uma multitude de alvos terapêuticos. A aplicação permite usar identificadores moleculares como SMILES e InChI, e gere a sua integração em descritores moleculares para usar como input nos modelos. A plataforma pode ser acedida através de uma aplicação web com interface gráfica desenvolvida com o pacote Shiny para R e directamente através de uma REST API desenvolvida com o pacote flask-restful para Python. Toda a aplicação está modularizada através de teconologia de “contentores”, especificamente o Docker. O objectivo desta plataforma é divulgar o acesso aos modelos criados pela comunidade, condensando-os num só local e removendo a necessidade do utilizador de instalar ou parametrizar qualquer tipo de software. Fomentando assim o desenvolvimento de conhecimento e facilitando o processo de investigação.The drug discovery and design process is expensive, time-consuming and resource-intensive. Various in silico methods are used to make the process more efficient and productive. Methods such as Virtual Screening often take advantage of QSAR machine learning models to more easily pinpoint the most promising drug candidates, from large pools of compounds. QSAR, which means Quantitative Structure Activity Relationship, is a ligand-based method where structural information of known ligands of a specific target is used to predict the biological activity of another molecule against that target. They are also used to improve upon an existing molecule’s pharmacologic potential by elucidating the structural composition with desirable properties. Several researchers create and develop QSAR machine learning models for a variety of different therapeutic targets. However, their use is limited by lack of access to said models. Beyond access, there are often difficulties in using published software given the need to manage dependencies and replicating the development environment. To address this issue, the application documented here was designed and developed. In this centralized platform, researchers can access several QSAR machine learning models and test their own datasets for interaction with various therapeutic targets. The platform allows the use of widespread molecule identifiers as input, such as SMILES and InChI, handling the necessary integration into the appropriate molecular descriptors to be used in the model. The platform can be accessed through a Web Application with a full graphical user interface developed with the R package Shiny and through a REST API developed with the Flask Restful package for Python. The complete application is packaged up in container technology, specifically Docker. The main goal of this platform is to grant widespread access to the QSAR models developed by the scientific community, by concentrating them in a single location and removing the user’s need to install or set up software unfamiliar to them. This intends to incite knowledge creation and facilitate the research process
    corecore