6,396 research outputs found

    A tree-based method for the rapid screening of chemical fingerprints

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The fingerprint of a molecule is a bitstring based on its structure, constructed such that structurally similar molecules will have similar fingerprints. Molecular fingerprints can be used in an initial phase of drug development for identifying novel drug candidates by screening large databases for molecules with fingerprints similar to a query fingerprint.</p> <p>Results</p> <p>In this paper, we present a method which efficiently finds all fingerprints in a database with Tanimoto coefficient to the query fingerprint above a user defined threshold. The method is based on two novel data structures for rapid screening of large databases: the <it>k</it>D grid and the Multibit tree. The <it>k</it>D grid is based on splitting the fingerprints into <it>k </it>shorter bitstrings and utilising these to compute bounds on the similarity of the complete bitstrings. The Multibit tree uses hierarchical clustering and similarity within each cluster to compute similar bounds. We have implemented our method and tested it on a large real-world data set. Our experiments show that our method yields approximately a three-fold speed-up over previous methods.</p> <p>Conclusions</p> <p>Using the novel <it>k</it>D grid and Multibit tree significantly reduce the time needed for searching databases of fingerprints. This will allow researchers to (1) perform more searches than previously possible and (2) to easily search large databases.</p

    Chemical fingerprinting of wood sampled along a pith-to-bark gradient for individual comparison and provenance identification

    Get PDF
    Background and Objectives: The origin of traded timber is one of the main questions in the enforcement of regulations to combat the illegal timber trade. Substantial efforts are still needed to develop techniques that can determine the exact geographical provenance of timber and this is vital to counteract the destructive effects of illegal logging, ranging from economical loss to habitat destruction. The potential of chemical fingerprints from pith-to-bark growth rings for individual comparison and geographical provenance determination is explored. Materials and Methods: A wood sliver was sampled per growth ring from four stem disks from four individuals of Pericopsis elata (Democratic Republic of the Congo) and from 14 stem disks from 14 individuals of Terminalia superba (Cote d'Ivoire and Democratic Republic of the Congo). Chemical fingerprints were obtained by analyzing these wood slivers with Direct Analysis in Real Time Time-Of-Flight Mass Spectrometry (DART TOFMS). Results: Individual distinction for both species was achieved but the accuracy was dependent on the dataset size and number of individuals included. As this is still experimental, we can only speak of individual comparison and not individual distinction at this point. The prediction accuracy for the country of origin increases with increasing sample number and a random sample can be placed in the correct country. When a complete disk is removed from the training dataset, its rings (samples) are correctly attributed to the country with an accuracy ranging from 43% to 100%. Relative abundances of ions appear to contribute more to differentiation compared to frequency differences. Conclusions: DART TOFMS shows potential for geographical provenancing but is still experimental for individual distinction; more research is needed to make this an established method. Sampling campaigns should focus on sampling tree cores from pith-to-bark, paving the way towards a chemical fingerprint database for species provenance

    Inductive queries for a drug designing robot scientist

    Get PDF
    It is increasingly clear that machine learning algorithms need to be integrated in an iterative scientific discovery loop, in which data is queried repeatedly by means of inductive queries and where the computer provides guidance to the experiments that are being performed. In this chapter, we summarise several key challenges in achieving this integration of machine learning and data mining algorithms in methods for the discovery of Quantitative Structure Activity Relationships (QSARs). We introduce the concept of a robot scientist, in which all steps of the discovery process are automated; we discuss the representation of molecular data such that knowledge discovery tools can analyse it, and we discuss the adaptation of machine learning and data mining algorithms to guide QSAR experiments

    Software for supporting large scale data processing for High Throughput Screening

    Get PDF
    High Throughput Screening for is a valuable data generation technique for data driven knowledge discovery. Because the rate of data generation is so great, it is a challenge to cope with the demands of post experiment data analysis. This thesis presents three software solutions that I implemented in an attempt to alleviate this problem. The first is K-Screen, a Laboratory Information Management System designed to handle and visualize large High Throughput Screening datasets. K-Screen is being successfully used by the University of Kansas High Throughput Screening Laboratory to better organize and visualize their data. The next two algorithms are designed to accelerate the search times for chemical similarity searches using 1-dimensional fingerprints. The first algorithm balances information content in bit strings to attempt to find more optimal ordering and segmentation patterns for chemical fingerprints. The second algorithm eliminates redundant pruning calculations for large batch chemical similarity searches and shows a 250% improvement for the fastest current fingerprint search algorithm for large batch queries

    Application of Infrared and Raman Spectroscopy for the Identification of Disease Resistant Trees

    Get PDF
    New approaches for identifying disease resistant trees are needed as the incidence of diseases caused by non-native and invasive pathogens increases. These approaches must be rapid, reliable, cost-effective, and should have the potential to be adapted for high-throughput screening or phenotyping. Within the context of trees and tree diseases, we summarize vibrational spectroscopic and chemometric methods that have been used to distinguish between groups of trees which vary in disease susceptibility or other important characteristics based on chemical fingerprint data. We also provide specific examples from the literature of where these approaches have been used successfully. Finally, we discuss future application of these approaches for wide-scale screening and phenotyping efforts aimed at identifying disease resistant trees and managing forest diseases

    LightGBM: An Effective and Scalable Algorithm for Prediction of Chemical Toxicity – Application to the Tox21 and Mutagenicity Datasets

    Get PDF
    Machine learning algorithms have attained widespread use in assessing the potential toxicities of pharmaceuticals and industrial chemicals because of their faster-speed and lower-cost compared to experimental bioassays. Gradient boosting is an effective algorithm that often achieves high predictivity, but historically the relative long computational time limited its applications in predicting large compound libraries or developing in silico predictive models that require frequent retraining. LightGBM, a recent improvement of the gradient boosting algorithm inherited its high predictivity but resolved its scalability and long computational time by adopting leaf-wise tree growth strategy and introducing novel techniques. In this study, we compared the predictive performance and the computational time of LightGBM to deep neural networks, random forests, support vector machines, and XGBoost. All algorithms were rigorously evaluated on publicly available Tox21 and mutagenicity datasets using a Bayesian optimization integrated nested 10-fold cross-validation scheme that performs hyperparameter optimization while examining model generalizability and transferability to new data. The evaluation results demonstrated that LightGBM is an effective and highly scalable algorithm offering the best predictive performance while consuming significantly shorter computational time than the other investigated algorithms across all Tox21 and mutagenicity datasets. We recommend LightGBM for applications in in silico safety assessment and also in other areas of cheminformatics to fulfill the ever-growing demand for accurate and rapid prediction of various toxicity or activity related endpoints of large compound libraries present in the pharmaceutical and chemical industry

    Scalable Similarity Search for Molecular Descriptors

    Full text link
    Similarity search over chemical compound databases is a fundamental task in the discovery and design of novel drug-like molecules. Such databases often encode molecules as non-negative integer vectors, called molecular descriptors, which represent rich information on various molecular properties. While there exist efficient indexing structures for searching databases of binary vectors, solutions for more general integer vectors are in their infancy. In this paper we present a time- and space- efficient index for the problem that we call the succinct intervals-splitting tree algorithm for molecular descriptors (SITAd). Our approach extends efficient methods for binary-vector databases, and uses ideas from succinct data structures. Our experiments, on a large database of over 40 million compounds, show SITAd significantly outperforms alternative approaches in practice.Comment: To be appeared in the Proceedings of SISAP'1

    Substrate-specific clades of active marine methylotrophs associated with a phytoplankton bloom in a temperate coastal environment

    Get PDF
    Marine microorganisms that consume one-carbon (C1) compounds are poorly described, despite their impact on global climate via an influence on aquatic and atmospheric chemistry. This study investigated marine bacterial communities involved in the metabolism of C1 compounds. These communities were of relevance to surface seawater and atmospheric chemistry in the context of a bloom that was dominated by phytoplankton known to produce dimethylsulfoniopropionate. In addition to using 16S rRNA gene fingerprinting and clone libraries to characterize samples taken from a bloom transect in July 2006, seawater samples from the phytoplankton bloom were incubated with 13C-labeled methanol, monomethylamine, dimethylamine, methyl bromide, and dimethyl sulfide to identify microbial populations involved in the turnover of C1 compounds, using DNA stable isotope probing. The [13C]DNA samples from a single time point were characterized and compared using denaturing gradient gel electrophoresis (DGGE), fingerprint cluster analysis, and 16S rRNA gene clone library analysis. Bacterial community DGGE fingerprints from 13C-labeled DNA were distinct from those obtained with the DNA of the nonlabeled community DNA and suggested some overlap in substrate utilization between active methylotroph populations growing on different C1 substrates. Active methylotrophs were affiliated with Methylophaga spp. and several clades of undescribed Gammaproteobacteria that utilized methanol, methylamines (both monomethylamine and dimethylamine), and dimethyl sulfide. rRNA gene sequences corresponding to populations assimilating 13C-labeled methyl bromide and other substrates were associated with members of the Alphaproteobacteria (e.g., the family Rhodobacteraceae), the Cytophaga-Flexibacter-Bacteroides group, and unknown taxa. This study expands the known diversity of marine methylotrophs in surface seawater and provides a comprehensive data set for focused cultivation and metagenomic analyses in the future
    • 

    corecore