1,106 research outputs found

    Scalable Similarity Search for Molecular Descriptors

    Full text link
    Similarity search over chemical compound databases is a fundamental task in the discovery and design of novel drug-like molecules. Such databases often encode molecules as non-negative integer vectors, called molecular descriptors, which represent rich information on various molecular properties. While there exist efficient indexing structures for searching databases of binary vectors, solutions for more general integer vectors are in their infancy. In this paper we present a time- and space- efficient index for the problem that we call the succinct intervals-splitting tree algorithm for molecular descriptors (SITAd). Our approach extends efficient methods for binary-vector databases, and uses ideas from succinct data structures. Our experiments, on a large database of over 40 million compounds, show SITAd significantly outperforms alternative approaches in practice.Comment: To be appeared in the Proceedings of SISAP'1

    Prediction of Hydrate and Solvate Formation Using Statistical Models

    Get PDF
    Novel, knowledge based models for the prediction of hydrate and solvate formation are introduced, which require only the molecular formula as input. A data set of more than 19 000 organic, nonionic, and nonpolymeric molecules was extracted from the Cambridge Structural Database. Molecules that formed solvates were compared with those that did not using molecular descriptors and statistical methods, which allowed the identification of chemical properties that contribute to solvate formation. The study was conducted for five types of solvates: ethanol, methanol, dichloromethane, chloroform, and water solvates. The identified properties were all related to the size and branching of the molecules and to the hydrogen bonding ability of the molecules. The corresponding molecular descriptors were used to fit logistic regression models to predict the probability of any given molecule to form a solvate. The established models were able to predict the behavior of ∼80% of the data correctly using only two descriptors in the predictive model

    Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection

    Get PDF
    The estimation of the accuracy of predictions is a critical problem in QSAR modeling. The "distance to model" can be defined as a metric that defines the similarity between the training set molecules and the test set compound for the given property in the context of a specific model. It could be expressed in many different ways, e.g., using Tanimoto coefficient, leverage, correlation in space of models, etc. In this paper we have used mixtures of Gaussian distributions as well as statistical tests to evaluate six types of distances to models with respect to their ability to discriminate compounds with small and large prediction errors. The analysis was performed for twelve QSAR models of aqueous toxicity against T. pyriformis obtained with different machine-learning methods and various types of descriptors. The distances to model based on standard deviation of predicted toxicity calculated from the ensemble of models afforded the best results. This distance also successfully discriminated molecules with low and large prediction errors for a mechanism-based model developed using log P and the Maximum Acceptor Superdelocalizability descriptors. Thus, the distance to model metric could also be used to augment mechanistic QSAR models by estimating their prediction errors. Moreover, the accuracy of prediction is mainly determined by the training set data distribution in the chemistry and activity spaces but not by QSAR approaches used to develop the models. We have shown that incorrect validation of a model may result in the wrong estimation of its performance and suggested how this problem could be circumvented. The toxicity of 3182 and 48774 molecules from the EPA High Production Volume (HPV) Challenge Program and EINECS (European chemical Substances Information System), respectively, was predicted, and the accuracy of prediction was estimated. The developed models are available online at http://www.qspr.org site

    Corrected overlap weight and clustering coefficient

    Full text link
    We discuss two well known network measures: the overlap weight of an edge and the clustering coefficient of a node. For both of them it turns out that they are not very useful for data analytic task to identify important elements (nodes or links) of a given network. The reason for this is that they attain their largest values on maximal subgraphs of relatively small size that are more probable to appear in a network than that of larger size. We show how the definitions of these measures can be corrected in such a way that they give the expected results. We illustrate the proposed corrected measures by applying them on the US Airports network using the program Pajek.Comment: The paper is a detailed and extended version of the talk presented at the CMStatistics (ERCIM) 2015 Conferenc

    GTI-space : the space of generalized topological indices

    Get PDF
    A new extension of the generalized topological indices (GTI) approach is carried out torepresent 'simple' and 'composite' topological indices (TIs) in an unified way. Thisapproach defines a GTI-space from which both simple and composite TIs represent particular subspaces. Accordingly, simple TIs such as Wiener, Balaban, Zagreb, Harary and Randićconnectivity indices are expressed by means of the same GTI representation introduced for composite TIs such as hyper-Wiener, molecular topological index (MTI), Gutman index andreverse MTI. Using GTI-space approach we easily identify mathematical relations between some composite and simple indices, such as the relationship between hyper-Wiener and Wiener index and the relation between MTI and first Zagreb index. The relation of the GTI space with the sub-structural cluster expansion of property/activity is also analysed and some routes for the applications of this approach to QSPR/QSAR are also given

    Predicting toxicity through computers: a changing world

    Get PDF
    The computational approaches used to predict toxicity are evolving rapidly, a process hastened on by the emergence of new ways of describing chemical information. Although this trend offers many opportunities, new regulations, such as the European Community's 'Registration, Evaluation, Authorisation and Restriction of Chemicals' (REACH), demand that models be ever more robust

    Characterization of two heparan sulphate-binding sites in the mycobacterial adhesin Hlp

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The histone-like Hlp protein is emerging as a key component in mycobacterial pathogenesis, being involved in the initial events of host colonization by interacting with laminin and glycosaminoglycans (GAGs). In the present study, nuclear magnetic resonance (NMR) was used to map the binding site(s) of Hlp to heparan sulfate and identify the nature of the amino acid residues directly involved in this interaction.</p> <p>Results</p> <p>The capacity of a panel of 30 mer synthetic peptides covering the full length of Hlp to bind to heparin/heparan sulfate was analyzed by solid phase assays, NMR, and affinity chromatography. An additional active region between the residues Gly46 and Ala60 was defined at the N-terminal domain of Hlp, expanding the previously defined heparin-binding site between Thr31 and Phe50. Additionally, the C-terminus, rich in Lys residues, was confirmed as another heparan sulfate binding region. The amino acids in Hlp identified as mediators in the interaction with heparan sulfate were Arg, Val, Ile, Lys, Phe, and Thr.</p> <p>Conclusion</p> <p>Our data indicate that Hlp interacts with heparan sulfate through two distinct regions of the protein. Both heparan sulfate-binding regions here defined are preserved in all mycobacterial Hlp homologues that have been sequenced, suggesting important but possibly divergent roles for this surface-exposed protein in both pathogenic and saprophic species.</p
    corecore