35 research outputs found

    NOVEL ALGORITHMS AND TOOLS FOR LIGAND-BASED DRUG DESIGN

    Get PDF
    Computer-aided drug design (CADD) has become an indispensible component in modern drug discovery projects. The prediction of physicochemical properties and pharmacological properties of candidate compounds effectively increases the probability for drug candidates to pass latter phases of clinic trials. Ligand-based virtual screening exhibits advantages over structure-based drug design, in terms of its wide applicability and high computational efficiency. The established chemical repositories and reported bioassays form a gigantic knowledgebase to derive quantitative structure-activity relationship (QSAR) and structure-property relationship (QSPR). In addition, the rapid advance of machine learning techniques suggests new solutions for data-mining huge compound databases. In this thesis, a novel ligand classification algorithm, Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (LiCABEDS), was reported for the prediction of diverse categorical pharmacological properties. LiCABEDS was successfully applied to model 5-HT1A ligand functionality, ligand selectivity of cannabinoid receptor subtypes, and blood-brain-barrier (BBB) passage. LiCABEDS was implemented and integrated with graphical user interface, data import/export, automated model training/ prediction, and project management. Besides, a non-linear ligand classifier was proposed, using a novel Topomer kernel function in support vector machine. With the emphasis on green high-performance computing, graphics processing units are alternative platforms for computationally expensive tasks. A novel GPU algorithm was designed and implemented in order to accelerate the calculation of chemical similarities with dense-format molecular fingerprints. Finally, a compound acquisition algorithm was reported to construct structurally diverse screening library in order to enhance hit rates in high-throughput screening

    Chemical Informatics Functionality in R

    Get PDF
    The flexibility and scope of the R programming environment has made it a popular choice for statistical modeling and scientific prototyping in a number of fields. In the field of chemistry, R provides several tools for a variety of problems related to statistical modeling of chemical information. However, one aspect common to these tools is that they do not have direct access to the information that is available from chemical structures, such as contained in molecular descriptors. We describe the rcdk package that provides the R user with access to the CDK, a Java framework for cheminformatics. As a result, it is possible to read in a variety of molecular formats, calculate molecular descriptors and evaluate fingerprints. In addition, we describe the rpubchem that will allow access to the data in PubChem, a public repository of molecular structures and associated assay data for approximately 8 million compounds. Currently, the package allows access to structural information as well as some simple molecular properties from PubChem. In addition the package allows access to bio-assay data from the PubChem FTP servers.

    Chemical Informatics Functionality in R

    Get PDF
    The flexibility and scope of the R programming environment has made it a popular choice for statistical modeling and scientific prototyping in a number of fields. In the field of chemistry, R provides several tools for a variety of problems related to statistical modeling of chemical information. However, one aspect common to these tools is that they do not have direct access to the information that is available from chemical structures, such as contained in molecular descriptors. We describe the rcdk package that provides the R user with access to the CDK, a Java framework for cheminformatics. As a result, it is possible to read in a variety of molecular formats, calculate molecular descriptors and evaluate fingerprints. In addition, we describe the rpubchem that will allow access to the data in PubChem, a public repository of molecular structures and associated assay data for approximately 8 million compounds. Currently, the package allows access to structural information as well as some simple molecular properties from PubChem. In addition the package allows access to bio-assay data from the PubChem FTP servers

    Estudio de la Relación Cuantitativa Estructura-Actividad de pesticidas mediante técnicas de clasificación

    Get PDF
    The aim of this work was the comparison between κ-Nearest Neighbors (κ-NN) and Counterpropagation Artificial Neural network (CP-ANN) classification methods for modeling the toxicity of a set of 192 organochlorinated, organophosphates, carbamates, and pyrethroid pesticides measured as effective concentration (EC50). The EC50 values were divided into three classes, i.e. low, intermediate, and high toxicity. The 4885 molecular descriptors were calculated using the Dragon software, and then were simultaneously analyzed through κ-NN classification analysis coupled with Genetic Algorithms - Variable Subset Selection (GA-VSS) technique. The models were properly validated through an external test set of compounds. The results clearly suggest that 3D-descriptors did not offer relevant information for modeling the classes. On the other hand, κ-NN showed better results than CP-ANN.El objetivo de este trabajo fue la comparación entre los métodos de clasificación del vecino más cercano (κ-NN) y las redes neuronales artificiales de contrapropagación (CP-ANN) para modelar la toxicidad de un conjunto de 192 pesticidas organoclorados, organofosforados, carbamatos y piretroides, medidos como Concentración Efectiva (EC50) y que fueron divididos en tres clases, es decir, baja, intermedia y alta toxicidad. Se calcularon 4885 descriptores moleculares usando el programa DRAGON, los que fueron simultáneamente analizados mediante el método κ-NN acoplado con la técnica de selección de variables de los Algoritmos Genéticos (GA-VSS). Los modelos fueron apropiadamente validados mediante un subconjunto de predicción. Los resultados claramente sugieren que los descriptores 3D no ofrecen información relevante para modelar las clases. Por otro lado, κ-NN muestra mejores resultados que CP-ANN

    Estudio de la Relación Cuantitativa Estructura-Actividad de pesticidas mediante técnicas de clasificación

    Get PDF
    The aim of this work was the comparison between k-Nearest Neighbors (k-NN) and Counterpropagation Artificial Neural network (CP-ANN) classification methods for modeling the toxicity of a set of 192 organochlorinated, organophosphates, carbamates, and pyrethroid pesticides measured as effective concentration (EC50). The EC50 values were divided into three classes, i.e. low, intermediate, and high toxicity. The 4885 molecular descriptors were calculated using the Dragon software, and then were simultaneously analyzed through k-NN classification analysis coupled with Genetic Algorithms - Variable Subset Selection (GA-VSS) technique. The models were properly validated through an external test set of compounds. The results clearly suggest that 3D-descriptors did not offer relevant information for modeling the classes. On the other hand, k-NN showed better results than CP-ANN.El objetivo de este trabajo fue la comparación entre los métodos de clasificación del vecino más cercano (k-NN) y las redes neuronales artificiales de contrapropagación (CP-ANN) para modelar la toxicidad de un conjunto de 192 pesticidas organoclorados, organofosforados, carbamatos y piretroides, medidos como Concentración Efectiva (EC50) y que fueron divididos en tres clases, es decir, baja, intermedia y alta toxicidad. Se calcularon 4885 descriptores moleculares usando el programa DRAGON, los que fueron simultáneamente analizados mediante el método k-NN acoplado con la técnica de selección de variables de los Algoritmos Genéticos (GA-VSS). Los modelos fueron apropiadamente validados mediante un subconjunto de predicción. Los resultados claramente sugieren que los descriptores 3D no ofrecen información relevante para modelar las clases. Por otro lado, k-NN muestra mejores resultados que CP-ANN.Instituto de Investigaciones Fisicoquímicas Teóricas y Aplicada

    Development of Conformation Independent Computational Models for the Early Recognition of Breast Cancer Resistance Protein Substrates

    Get PDF
    ABC efflux transporters are polyspecific members of the ABC superfamily that, acting as drug and metabolite carriers, provide a biochemical barrier against drug penetration and contribute to detoxification. Their overexpression is linked tomultidrug resistance issues in a diversity of diseases. Breast cancer resistance protein (BCRP) is the most expressed ABC efflux transporter throughout the intestine and the blood-brain barrier, limiting oral absorption and brain bioavailability of its substrates. Early recognition of BCRP substrates is thus essential to optimize oral drug absorption, design of novel therapeutics for central nervous systemconditions, and overcome BCRP-mediated cross-resistance issues. We present the development of an ensemble of ligand-based machine learning algorithms for the early recognition of BCRP substrates, from a database of 262 substrates and nonsubstrates compiled from the literature. Such dataset was rationally partitioned into training and test sets by application of a 2-step clustering procedure. The models were developed through application of linear discriminant analysis to randomsubsamples ofDragonmolecular descriptors. Simple data fusion and statistical comparison of partial areas under the curve of ROC curves were applied to obtain the best 2-model combination, which presented 82% and 74.5% of overall accuracy in the training and test set, respectively.Facultad de Ciencias Exacta

    Analysis of Biological Screening Data and Molecular Selectivity Profiles Using Fingerprints and Mapping Algorithms

    Get PDF
    The identification of promising drug candidates is a major milestone in the early stages of drug discovery and design. Among the properties that have to be optimized before a drug candidate is admitted to clinical testing, potency and target selectivity are of great interest and can be addressed very early. Unfortunately, optimization–relevant knowledge is often limited, and the analysis of noisy and heterogeneous biological screening data with standard methods like QSAR is hardly feasible. Furthermore, the identification of compounds displaying different selectivity patterns against related targets is a prerequisite for chemical genetics and genomics applications, allowing to specifically interfere with functions of individual members of protein families. In this thesis it is shown that computational methods based on molecular similarity are suitable tools for the analysis of compound potency and target selectivity. Originally developed to facilitate the efficient discovery of active compounds by means of virtual screening of compound libraries, these ligand–based approaches assume that similar molecules are likely to exhibit similar properties and biological activities based on the similarity property principle. Given their holistic approach to molecular similarity analysis, ligand–based virtual screening methods can be applied when little or no structure– activity information is available and do not require the knowledge of the target structure. The methods under investigation cover a wide methodological spectrum and only rely on properties derived from one– and two–dimensional molecular representations, which renders them particularly useful for handling large compound libraries. Using biological screening data, these virtual screening methods are shown to be able to extrapolate from experimental data and preferentially detect potent compounds. Subsequently, extensive benchmark calculations prove that existing 2D molecular fingerprints and dynamic mapping algorithms are suitable tools for the distinction between compounds with differential selectivity profiles. Finally, an advanced dynamic mapping algorithm is introduced that is able to generate target–selective chemical reference spaces by adaptively identifying most–discriminative molecular properties from a set of active compounds. These reference spaces are shown to be of great value for the generation of predictive target–selectivity models by screening a biologically annotated compound library. </p

    Development of Conformation Independent Computational Models for the Early Recognition of Breast Cancer Resistance Protein Substrates

    Get PDF
    ABC efflux transporters are polyspecific members of the ABC superfamily that, acting as drug and metabolite carriers, provide a biochemical barrier against drug penetration and contribute to detoxification. Their overexpression is linked tomultidrug resistance issues in a diversity of diseases. Breast cancer resistance protein (BCRP) is the most expressed ABC efflux transporter throughout the intestine and the blood-brain barrier, limiting oral absorption and brain bioavailability of its substrates. Early recognition of BCRP substrates is thus essential to optimize oral drug absorption, design of novel therapeutics for central nervous systemconditions, and overcome BCRP-mediated cross-resistance issues. We present the development of an ensemble of ligand-based machine learning algorithms for the early recognition of BCRP substrates, from a database of 262 substrates and nonsubstrates compiled from the literature. Such dataset was rationally partitioned into training and test sets by application of a 2-step clustering procedure. The models were developed through application of linear discriminant analysis to randomsubsamples ofDragonmolecular descriptors. Simple data fusion and statistical comparison of partial areas under the curve of ROC curves were applied to obtain the best 2-model combination, which presented 82% and 74.5% of overall accuracy in the training and test set, respectively.Facultad de Ciencias Exacta

    Development of Conformation Independent Computational Models for the Early Recognition of Breast Cancer Resistance Protein Substrates

    Get PDF
    ABC efflux transporters are polyspecific members of the ABC superfamily that, acting as drug and metabolite carriers, provide a biochemical barrier against drug penetration and contribute to detoxification. Their overexpression is linked tomultidrug resistance issues in a diversity of diseases. Breast cancer resistance protein (BCRP) is the most expressed ABC efflux transporter throughout the intestine and the blood-brain barrier, limiting oral absorption and brain bioavailability of its substrates. Early recognition of BCRP substrates is thus essential to optimize oral drug absorption, design of novel therapeutics for central nervous systemconditions, and overcome BCRP-mediated cross-resistance issues. We present the development of an ensemble of ligand-based machine learning algorithms for the early recognition of BCRP substrates, from a database of 262 substrates and nonsubstrates compiled from the literature. Such dataset was rationally partitioned into training and test sets by application of a 2-step clustering procedure. The models were developed through application of linear discriminant analysis to randomsubsamples ofDragonmolecular descriptors. Simple data fusion and statistical comparison of partial areas under the curve of ROC curves were applied to obtain the best 2-model combination, which presented 82% and 74.5% of overall accuracy in the training and test set, respectively.Facultad de Ciencias Exacta

    Statistical Learning in Drug Discovery via Clustering and Mixtures

    Get PDF
    In drug discovery, thousands of compounds are assayed to detect activity against a biological target. The goal of drug discovery is to identify compounds that are active against the target (e.g. inhibit a virus). Statistical learning in drug discovery seeks to build a model that uses descriptors characterizing molecular structure to predict biological activity. However, the characteristics of drug discovery data can make it difficult to model the relationship between molecular descriptors and biological activity. Among these characteristics are the rarity of active compounds, the large volume of compounds tested by high-throughput screening, and the complexity of molecular structure and its relationship to activity. This thesis focuses on the design of statistical learning algorithms/models and their applications to drug discovery. The two main parts of the thesis are: an algorithm-based statistical method and a more formal model-based approach. Both approaches can facilitate and accelerate the process of developing new drugs. A unifying theme is the use of unsupervised methods as components of supervised learning algorithms/models. In the first part of the thesis, we explore a sequential screening approach, Cluster Structure-Activity Relationship Analysis (CSARA). Sequential screening integrates High Throughput Screening with mathematical modeling to sequentially select the best compounds. CSARA is a cluster-based and algorithm driven method. To gain further insight into this method, we use three carefully designed experiments to compare predictive accuracy with Recursive Partitioning, a popular structureactivity relationship analysis method. The experiments show that CSARA outperforms Recursive Partitioning. Comparisons include problems with many descriptor sets and situations in which many descriptors are not important for activity. In the second part of the thesis, we propose and develop constrained mixture discriminant analysis (CMDA), a model-based method. The main idea of CMDA is to model the distribution of the observations given the class label (e.g. active or inactive class) as a constrained mixture distribution, and then use Bayes’ rule to predict the probability of being active for each observation in the testing set. Constraints are used to deal with the otherwise explosive growth of the number of parameters with increasing dimensionality. CMDA is designed to solve several challenges in modeling drug data sets, such as multiple mechanisms, the rare target problem (i.e. imbalanced classes), and the identification of relevant subspaces of descriptors (i.e. variable selection). We focus on the CMDA1 model, in which univariate densities form the building blocks of the mixture components. Due to the unboundedness of the CMDA1 log likelihood function, it is easy for the EM algorithm to converge to degenerate solutions. A special Multi-Step EM algorithm is therefore developed and explored via several experimental comparisons. Using the multi-step EM algorithm, the CMDA1 model is compared to model-based clustering discriminant analysis (MclustDA). The CMDA1 model is either superior to or competitive with the MclustDA model, depending on which model generates the data. The CMDA1 model has better performance than the MclustDA model when the data are high-dimensional and unbalanced, an essential feature of the drug discovery problem! An alternate approach to the problem of degeneracy is penalized estimation. By introducing a group of simple penalty functions, we consider penalized maximum likelihood estimation of the CMDA1 and CMDA2 models. This strategy improves the convergence of the conventional EM algorithm, and helps avoid degenerate solutions. Extending techniques from Chen et al. (2007), we prove that the PMLE’s of the two-dimensional CMDA1 model can be asymptotically consistent
    corecore