Search CORE

15 research outputs found

Scalable Similarity Search for Molecular Descriptors

Author: A Leach
AM Bender
B Chen
D Vida
J Chen
M Keiser
M Kotera
M Kotera
R Nasr
R Sawada
R Todeschini
TG Kristensen
Publication venue
Publication date: 09/08/2017
Field of study

Similarity search over chemical compound databases is a fundamental task in the discovery and design of novel drug-like molecules. Such databases often encode molecules as non-negative integer vectors, called molecular descriptors, which represent rich information on various molecular properties. While there exist efficient indexing structures for searching databases of binary vectors, solutions for more general integer vectors are in their infancy. In this paper we present a time- and space- efficient index for the problem that we call the succinct intervals-splitting tree algorithm for molecular descriptors (SITAd). Our approach extends efficient methods for binary-vector databases, and uses ideas from succinct data structures. Our experiments, on a large database of over 40 million compounds, show SITAd significantly outperforms alternative approaches in practice.Comment: To be appeared in the Proceedings of SISAP'1

arXiv.org e-Print Archive

Crossref

A comparative study of SMILES-based compound similarity functions for drug-target interaction prediction

Author: A Islam
A Schuffenhauer
AA Toropov
AA Toropov
AM Helguera
Arzucan Özgür
D Rognan
D Vidal
D Weininger
D Weininger
DC Kombo
DS Cao
DS Hirschberg
DS Wishart
Elif Ozkirimli
H Geppert
H Gohlke
H Hong
Hakime Öztürk
HP Luhn
I Schomburg
J Schwartz
K Bleakley
KS Jones
L Jacob
M Bilenko
M Campillos
M Fernandez
M Gonen
M Hattori
M Hattori
M Takarabe
MJ Keiser
P Willett
R Sawada
RA Wagner
S Gunther
S Zhu
T van Laarhoven
VI Levenshtein
Y Tabei
Y Yamanishi
Y Yamanishi
Z He
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Global Analysis of Small Molecule Binding to Related Protein Targets

Author: A Bateman
A Halevy
A Hopkins
AJ Vilella
AL Hopkins
AM Davis
CF George
D Tanramluk
D Vidal
DE Gloriam
DL Nersesian
E Koonin
E van der Horst
E van der Horst
Felix A. Kruger
Greg Tucker-Kellogg
I Shamovsky
J Cozzi
J Fox
J Overington
JD Wichard
John P. Overington
JP Overington
JS Surgand
JT Metz
JU Bowie
KL McGary
KR Taylor
L Ireland-Denny
L Zhi
LA Black
LJ Bellis
M Davies
M Ekroos
M Kimura
M Vieth
MA Fabian
ME Peterson
MM Hann
MP Gleeson
MS Stephens
MW Karaman
N Brooijmans
PR Caron
RD Finn
RL Tatusov
TA Esbenshade
TW Lovenberg
W Sujansky
WM Fitch
X Ligneau
Y Arens
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

We report on the integration of pharmacological data and homology information for a large scale analysis of small molecule binding to related targets. Differences in small molecule binding have been assessed for curated pairs of human to rat orthologs and also for recently diverged human paralogs. Our analysis shows that in general, small molecule binding is conserved for pairs of human to rat orthologs. Using statistical tests, we identified a small number of cases where small molecule binding is different between human and rat, some of which had previously been reported in the literature. Knowledge of species specific pharmacology can be advantageous for drug discovery, where rats are frequently used as a model system. For human paralogs, we demonstrate a global correlation between sequence identity and the binding of small molecules with equivalent affinity. Our findings provide an initial general model relating small molecule binding and sequence divergence, containing the foundations for a general model to anticipate and predict within-target-family selectivity

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

Chemical Informatics Functionality in R

Author: Guha Rajarshi
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 01/01/2007
Field of study

The flexibility and scope of the R programming environment has made it a popular choice for statistical modeling and scientific prototyping in a number of fields. In the field of chemistry, R provides several tools for a variety of problems related to statistical modeling of chemical information. However, one aspect common to these tools is that they do not have direct access to the information that is available from chemical structures, such as contained in molecular descriptors. We describe the rcdk package that provides the R user with access to the CDK, a Java framework for cheminformatics. As a result, it is possible to read in a variety of molecular formats, calculate molecular descriptors and evaluate fingerprints. In addition, we describe the rpubchem that will allow access to the data in PubChem, a public repository of molecular structures and associated assay data for approximately 8 million compounds. Currently, the package allows access to structural information as well as some simple molecular properties from PubChem. In addition the package allows access to bio-assay data from the PubChem FTP servers

Directory of Open Access Journals

Journal of Statistical Software

Assessing and developing methods to explore the role of molecular shape in computer-aided drug design

Author: Zarnecka JM
Publication venue
Publication date
Field of study

Shape-based approaches have many potential areas for development in the future for application to in silico pharmacology. Further exploration of the role of molecular shape may lead to better understanding of the substrate specificity of enzymes and the possibility to reduce toxic effects that may be caused by ligands binding to undesired target proteins. Methods exploiting molecular shape for activity and toxicity prediction might have a great influence on the drug discovery process. There are different approaches that might be used for this purpose, e.g. shape fingerprints and shape multipoles. Both methods describe the shape of molecules, discarding any chemical information, using numerical values. Focusing only on shape can lead to identifying novel core structures of molecules, with improved properties. Molecular fingerprints are binary bit strings that encode the structure or shape of compounds; shape is measured indirectly by alignment to a database of standard molecular shapes – the reference shapes. The Shape Database should represent a wide range of possible molecular shapes to produce accurate results. Therefore, this was the main focus of the investigation. The shape multipoles method is a fast computational method to describe the shape of molecules by using only numbers and therefore it requires low storage needs and comparison is performed by simple mathematical operations. To describe the shape, it uses only 13 values (3 quadrupole components and 10 octupole components). The performances of both methods in grouping compounds based on shared biological activity were evaluated using several test sets with slightly better results in case of shape fingerprints. However, the shape multipole approach showed potential in finding differences in shape between enantiomers. Among the possible applications of the shape fingerprints method are solubility prediction (on comparable level as well-established methods) and virtual screening

LJMU Research Online (Liverpool John Moores University)

Additive SMILES-Based Carcinogenicity Models: Probabilistic Principles in the Search for Robust Predictions

Author: Alla Toropova
Andrey Toropov
Benfenati
Benigni
Benigni
Contrera
Emilio Benfenati
Fatemi
Marino
Mazzatorta
Peruzzo
Toropov
Toropov
Toropov
Toropov
Toropov
Vidal
Weininger
Weininger
Weininger
Publication venue: Molecular Diversity Preservation International (MDPI)
Publication date
Field of study

Optimal descriptors calculated with the simplified molecular input line entry system (SMILES) have been utilized in modeling of carcinogenicity as continuous values (logTD50). These descriptors can be calculated using correlation weights of SMILES attributes calculated by the Monte Carlo method. A considerable subset of these attributes includes rare attributes. The use of these rare attributes can lead to overtraining. One can avoid the influence of the rare attributes if their correlation weights are fixed to zero. A function, limS, has been defined to identify rare attributes. The limS defines the minimum number of occurrences in the set of structures of the training (subtraining) set, to accept attributes as usable. If an attribute is present less than limS, it is considered “rare”, and thus not used. Two systems of building up models were examined: 1. classic training-test system; 2. balance of correlations for the subtraining and calibration sets (together, they are the original training set: the function of the calibration set is imitation of a preliminary test set). Three random splits into subtraining, calibration, and test sets were analysed. Comparison of abovementioned systems has shown that balance of correlations gives more robust prediction of the carcinogenicity for all three splits (split 1: rtest2=0.7514, stest=0.684; split 2: rtest2=0.7998, stest=0.600; split 3: rtest2=0.7192, stest=0.728)

Crossref

PubMed Central

The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching

Author: Alvarsson Jonathan
Berg Arvid
Carlsson Lars
Evelo Chris T.
Guha Rajarshi
Jeliazkova Nina
Kuhn Stefan
Mayfield John W.
Pluskal Tomas
Rojas-Cherto Miquel
Spjuth Ola
Steinbeck Christoph
Torrence Gilleain
Willighagen Egon L.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

open access articleBackground: The Chemistry Development Kit (CDK) is a widely used open source cheminformatics toolkit, providing data structures to represent chemical concepts along with methods to manipulate such structures and perform computations on them. The library implements a wide variety of cheminformatics algorithms ranging from chemical structure canonicalization to molecular descriptor calculations and pharmacophore perception. It is used in drug discovery, metabolomics, and toxicology. Over the last 10 years, the code base has grown significantly, however, resulting in many complex interdependencies among components and poor performance of many algorithms. Results: We report improvements to the CDK v2.0 since the v1.2 release series, specifically addressing the increased functional complexity and poor performance. We first summarize the addition of new functionality, such atom typing and molecular formula handling, and improvement to existing functionality that has led to significantly better performance for substructure searching, molecular fingerprints, and rendering of molecules. Second, we outline how the CDK has evolved with respect to quality control and the approaches we have adopted to ensure stability, including a code review mechanism. Conclusions: This paper highlights our continued efforts to provide a community driven, open source cheminformatics library, and shows that such collaborative projects can thrive over extended periods of time, resulting in a high-quality and performant library. By taking advantage of community support and contributions, we show that an open source cheminformatics project can act as a peer reviewed publishing platform for scientific computing software

Maastricht University Research Portal

Publikationer från Uppsala Universitet

Directory of Open Access Journals

De Montfort University Open Research Archive

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Leicester Research Archive

Fingerprints, and Other Molecular Descriptions for Database Analysis and Searching

Author: Bajusz Dávid
Héberger Károly
Rácz Anita
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

Repository of the Academy's Library