4 research outputs found

    Dissimilarity-based algorithms for selecting structurally diverse sets of compounds

    Get PDF
    This paper commences with a brief introduction to modern techniques for the computational analysis of molecular diversity and the design of combinatorial libraries. It then reviews dissimilarity-based algorithms for the selection of structurally diverse sets of compounds in chemical databases. Procedures are described for selecting a diverse subset of an entire database, and for selecting diverse combinatorial libraries using both reagent-based and product-based selection

    DPRESS: Localizing estimates of predictive uncertainty

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The need to have a quantitative estimate of the uncertainty of prediction for QSAR models is steadily increasing, in part because such predictions are being widely distributed as tabulated values disconnected from the models used to generate them. Classical statistical theory assumes that the error in the population being modeled is independent and identically distributed (IID), but this is often not actually the case. Such inhomogeneous error (heteroskedasticity) can be addressed by providing an individualized estimate of predictive uncertainty for each particular new object <it>u</it>: the standard error of prediction <it>s</it><sub>u </sub>can be estimated as the non-cross-validated error <it>s</it><sub>t* </sub>for the closest object <it>t</it>* in the training set adjusted for its separation <it>d </it>from <it>u </it>in the descriptor space relative to the size of the training set.</p> <p><display-formula><graphic file="1758-2946-1-11-i1.gif"/></display-formula></p> <p>The predictive uncertainty factor <it>γ</it><sub>t* </sub>is obtained by distributing the internal predictive error sum of squares across objects in the training set based on the distances between them, hence the acronym: <it>D</it>istributed <it>PR</it>edictive <it>E</it>rror <it>S</it>um of <it>S</it>quares (DPRESS). Note that <it>s</it><sub>t* </sub>and <it>γ</it><sub>t*</sub>are characteristic of each training set compound contributing to the model of interest.</p> <p>Results</p> <p>The method was applied to partial least-squares models built using 2D (molecular hologram) or 3D (molecular field) descriptors applied to mid-sized training sets (<it>N </it>= 75) drawn from a large (<it>N </it>= 304), well-characterized pool of cyclooxygenase inhibitors. The observed variation in predictive error for the external 229 compound test sets was compared with the uncertainty estimates from DPRESS. Good qualitative and quantitative agreement was seen between the distributions of predictive error observed and those predicted using DPRESS. Inclusion of the distance-dependent term was essential to getting good agreement between the estimated uncertainties and the observed distributions of predictive error. The uncertainty estimates derived by DPRESS were conservative even when the training set was biased, but not excessively so.</p> <p>Conclusion</p> <p>DPRESS is a straightforward and powerful way to reliably estimate individual predictive uncertainties for compounds outside the training set based on their distance to the training set and the internal predictive uncertainty associated with its nearest neighbor in that set. It represents a sample-based, <it>a posteriori </it>approach to defining applicability domains in terms of localized uncertainty.</p

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining

    Développement de méthodes et d’outils chémoinformatiques pour l’analyse et la comparaison de chimiothèques

    Get PDF
    Some news areas in biology ,chemistry and computing interface, have emerged in order to respond the numerous problematics linked to the drug research. This is what this thesis is all about, as an interface gathered under the banner of chimocomputing. Though, new on a human scale, these domains are nevertheless, already an integral part of the drugs and medicines research. As the Biocomputing, his fundamental pillar remains storage, representation, management and the exploitation through computing of chemistry data. Chimocomputing is now mostly used in the upstream phases of drug research. Combining methods from various fields ( chime, computing, maths, apprenticeship, statistics, etc…) allows the implantation of computing tools adapted to the specific problematics and data of chime such as chemical database storage, understructure research, data visualisation or physoco-chimecals and biologics properties prediction.In that multidisciplinary frame, the work done in this thesis pointed out two important aspects, both related to chimocomputing : (1) The new methods development allowing to ease the visualization, analysis and interpretation of data related to set of the molecules, currently known as chimocomputing and (2) the computing tools development enabling the implantation of these methods.De nouveaux domaines ont vu le jour, à l’interface entre biologie, chimie et informatique, afin de répondre aux multiples problématiques liées à la recherche de médicaments. Cette thèse se situe à l’interface de plusieurs de ces domaines, regroupés sous la bannière de la chémo-informatique. Récent à l’échelle humaine, ce domaine fait néanmoins déjà partie intégrante de la recherche pharmaceutique. De manière analogue à la bioinformatique, son pilier fondateur reste le stockage, la représentation, la gestion et l’exploitation par ordinateur de données provenant de la chimie. La chémoinformatique est aujourd’hui utilisée principalement dans les phases amont de la recherche de médicaments. En combinant des méthodes issues de différents domaines (chimie, informatique, mathématique, apprentissage, statistiques, etc.), elle permet la mise en oeuvre d’outils informatiques adaptés aux problématiques et données spécifiques de la chimie, tels que le stockage de l’information chimique en base de données, la recherche par sous-structure, la visualisation de données, ou encore la prédiction de propriétés physico-chimiques et biologiques.Dans ce cadre pluri-disciplinaire, le travail présenté dans cette thèse porte sur deux aspects importants liés à la chémoinformatique : (1) le développement de nouvelles méthodes permettant de faciliter la visualisation, l’analyse et l’interprétation des données liées aux ensembles de molécules, plus communément appelés chimiothèques, et (2) le développement d’outils informatiques permettant de mettre en oeuvre ces méthodes
    corecore