Search CORE

26 research outputs found

MONA 2: A Light Cheminformatics Platform for Interactive Compound Library Processing

Author: Matthias Hilbig (1502212)
Matthias Rarey (1425577)
Publication venue
Publication date
Field of study

Because of the availability of large compound collections on the Web, elementary cheminformatics tasks such as chemical library browsing, analyzing, filtering, or unifying have become widespread in the life science community. Furthermore, the high performance of desktop hardware allows an interactive, problem-driven approach to these tasks, avoiding rigid processing scripts and workflows. Here, we present MONA 2, which is the second major release of our cheminformatics desktop application addressing this need. Using MONA requires neither complex database setups nor expert knowledge of cheminformatics. A new molecular set concept purely based on structural entities rather than individual compounds has allowed the development of an intuitive user interface. Based on a chemically precise, high-performance software library, typical tasks on chemical libraries with up to one million compounds can be performed mostly interactively. This paper describes the functionality of MONA, its fundamental concepts, and a collection of application scenarios ranging from file conversion, compound library curation, and management to the post-processing of large-scale experiments

FigShare

ASCONA: Rapid Detection and Alignment of Protein Binding Site Conformations

Author: Matthias Rarey (1425577)
Stefan Bietz (1425619)
Publication venue
Publication date
Field of study

The usage of conformational ensembles constitutes a widespread technique for the consideration of protein flexibility in computational biology. When experimental structures are applied for this purpose, alignment techniques are usually required in dealing with structural deviations and annotation inconsistencies. Moreover, many application scenarios focus on protein ligand binding sites. Here, we introduce our new alignment algorithm ASCONA that has been specially geared to the problem of aligning multiple conformations of sequentially similar binding sites. Intense efforts have been directed to an accurate detection of highly flexible backbone deviations, multiple binding site matches within a single structure, and a reliable, but at the same time highly efficient, search algorithm. In contrast, most available alignment methods rather target other issues, e.g., the global alignment of distantly related proteins that share structurally conserved regions. For conformational ensembles, this might not only result in an overhead of computation time but could also affect the achieved accuracy, especially for more complicated cases as highly flexible proteins. ASCONA was evaluated on a test set containing 1107 structures of 65 diverse proteins. In all cases, ASCONA was able to correctly align the binding site at an average alignment computation time of 4 ms per target. Furthermore, no false positive matches were observed when searching the same query sites in the structures of other proteins. ASCONA proved to cope with highly deviating backbone structures and to tolerate structural gaps and moderate mutation rates. ASCONA is available free of charge for academic use at http://www.zbh.uni-hamburg.de/ascona

FigShare

FSees: Customized Enumeration of Chemical Subspaces with Limited Main Memory Consumption

Author: Florian Lauck (3113445)
Matthias Rarey (1425577)
Publication venue
Publication date
Field of study

In the search for new marketable drugs, new ideas are required constantly. Particularly with regard to challenging targets and previously patented chemical space, designing novel molecules is crucial. This demands efficient and innovative computational tools to generate libraries of promising molecules. Here we present an efficient method to generate such libraries by systematically enumerating all molecules in a specific chemical space. This space is defined by a fragment space and a set of user-defined physicochemical properties (e.g., molecular weight, tPSA, number of H-bond donors and acceptors, or predicted logP). In order to enumerate a very large number of molecules, our algorithm uses file-based data structures instead of memory-based ones, thus overcoming the limitations of computer main memory. The resulting chemical library can be used as a starting point for computational lead-finding technologies, like similarity searching, pharmacophore mapping, docking, or virtual screening. We applied the algorithm in different scenarios, thus creating numerous target-specific libraries. Furthermore, we generated a fragment space from all approved drugs in DrugBank and enumerated it with lead-like constraints, thus generating 0.5 billion molecules in the molecular weight range 250–350

FigShare

SIENA: Efficient Compilation of Selective Protein Binding Site Ensembles

Author: Matthias Rarey (1425577)
Stefan Bietz (1425619)
Publication venue
Publication date
Field of study

Structural flexibility of proteins has an important influence on molecular recognition and enzymatic function. In modeling, structure ensembles are therefore often applied as a valuable source of alternative protein conformations. However, their usage is often complicated by structural artifacts and inconsistent data annotation. Here, we present SIENA, a new computational approach for the automated assembly and preprocessing of protein binding site ensembles. Starting with an arbitrarily defined binding site in a single protein structure, SIENA searches for alternative conformations of the same or sequentially closely related binding sites. The method is based on an indexed database for identifying perfect k-mer matches and a recently published algorithm for the alignment of protein binding site conformations. Furthermore, SIENA provides a new algorithm for the interaction-based selection of binding site conformations which aims at covering all known ligand-binding geometries. Various experiments highlight that SIENA is able to generate comprehensive and well selected binding site ensembles improving the compatibility to both known and unconsidered ligand molecules. Starting with the whole PDB as data source, the computation time of the whole ensemble generation takes only a few seconds. SIENA is available via a Web service at www.zbh.uni-hamburg.de/siena

FigShare

Benchmark Data Sets for Structure-Based Computational Target Prediction

Author: Karen T. Schomburg (1760134)
Matthias Rarey (1425577)
Publication venue
Publication date
Field of study

Structure-based computational target prediction methods identify potential targets for a bioactive compound. Methods based on protein–ligand docking so far face many challenges, where the greatest probably is the ranking of true targets in a large data set of protein structures. Currently, no standard data sets for evaluation exist, rendering comparison and demonstration of improvements of methods cumbersome. Therefore, we propose two data sets and evaluation strategies for a meaningful evaluation of new target prediction methods, i.e., a small data set consisting of three target classes for detailed proof-of-concept and selectivity studies and a large data set consisting of 7992 protein structures and 72 drug-like ligands allowing statistical evaluation with performance metrics on a drug-like chemical space. Both data sets are built from openly available resources, and any information needed to perform the described experiments is reported. We describe the composition of the data sets, the setup of screening experiments, and the evaluation strategy. Performance metrics capable to measure the early recognition of enrichments like AUC, BEDROC, and NSLR are proposed. We apply a sequence-based target prediction method to the large data set to analyze its content of nontrivial evaluation cases. The proposed data sets are used for method evaluation of our new inverse screening method iRAISE. The small data set reveals the method’s capability and limitations to selectively distinguish between rather similar protein structures. The large data set simulates real target identification scenarios. iRAISE achieves in 55% excellent or good enrichment a median AUC of 0.67 and RMSDs below 2.0 Å for 74% and was able to predict the first true target in 59 out of 72 cases in the top 2% of the protein data set of about 8000 structures

FigShare

Enhanced Calculation of Property Distributions in Chemical Fragment Spaces

Author: Justin Lübbers (18137484)
Matthias Rarey (1425577)
Uta Lessel (2323663)
Publication venue
Publication date: 11/03/2024
Field of study

Chemical fragment spaces exceed traditional virtual compound libraries by orders of magnitude, making them ideal search spaces for drug design projects. However, due to their immense size, they are not compatible with traditional analysis and search algorithms that rely on the enumeration of molecules. In this paper, we present SpaceProp2, an evolution of the SpaceProp algorithm, which enables the calculation of exact property distributions for chemical fragment spaces without enumerating them. We extend the original algorithm by the capabilities to compute distributions for the TPSA, the number of rotatable bonds, and the occurrence of user-defined molecular structures in the form of SMARTS patterns. Furthermore, SpaceProp2 produces example molecules for every property bin, enabling a detailed interpretation of the distributions. We demonstrate SpaceProp2 on six established make-on-demand chemical fragment spaces as well as BICLAIM, the in-house fragment space of Boehringer Ingelheim. The possibility to search multiple SMARTS patterns simultaneously as well as the produced example molecules offers previously impossible insights into the composition of these vast combinatorial molecule collections, making it an ideal tool for the analysis and design of chemical fragment spaces

FigShare

The Valence State Combination Model: A Generic Framework for Handling Tautomers and Protonation States

Author: Adrian Kolodzik (1830427)
Matthias Rarey (1425577)
Sascha Urbaczek (1788721)
Publication venue
Publication date
Field of study

The consistent handling of molecules is probably the most basic and important requirement in the field of cheminformatics. Reliable results can only be obtained if the underlying calculations are independent of the specific way molecules are represented in the input data. However, ensuring consistency is a complex task with many pitfalls, an important one being the fact that the same molecule can be represented by different valence bond structures. In order to achieve reliability, a cheminformatics system needs to solve two fundamental problems. First, different choices of valence bond structures must be identified as the same molecule. Second, for each molecule all valence bond structures relevant to the context must be taken into consideration. The latter is especially important with regard to tautomers and protonation states, as these have considerable influence on physicochemical properties of molecules. We present a comprehensive method for the rapid and consistent generation of reasonable tautomers and protonation states for molecules relevant in the context of drug design. This method is based on a generic scheme, the Valence State Combination Model, which has been designed for the enumeration and scoring of valence bond structures in large data sets. In order to ensure our method’s consistency, we have developed procedures which can serve as a general validation scheme for similar approaches. The analysis of both the average number of generated structures and the associated runtimes shows that our method is perfectly suited for typical cheminformatics applications. By comparison with frequently used and curated public data sets, we can demonstrate that the tautomers and protonation state produced by our method are chemically reasonable

FigShare

RingDecomposerLib: An Open-Source Implementation of Unique Ring Families and Other Cycle Bases

Author: Florian Flachsenberg (3713857)
Matthias Rarey (1425577)
Niek Andresen (3713854)
Publication venue
Publication date
Field of study

Many cheminformatics applications like aromaticity detection, SMARTS matching, or the calculation of atomic coordinates require a chemically meaningful perception of the molecular ring topology. The unique ring families (URFs) were recently introduced as a unique, polynomial, and chemically meaningful description of the ring topology. Here we present the first open-source implementation of the URF concept for ring perception. The C library RingDecomposerLib is easy to use, portable, well-documented, and thoroughly tested. Aside from the URFs, other related ring topology descriptions like the relevant cycles (RCs), relevant cycle prototypes (RCPs), and a smallest set of smallest rings (SSSR) can be calculated. We demonstrate the runtime efficiency of the RingDecomposerLib with computing time benchmarks for the complete PubChem Compound Database and thereby show the applicability in large-scale and interactive applications

FigShare

Unique Ring Families: A Chemically Meaningful Description of Molecular Ring Topologies

Author: Adrian Kolodzik (1830427)
Matthias Rarey (1425577)
Sascha Urbaczek (1788721)
Publication venue
Publication date
Field of study

The perception of a set of rings forms the basis for a number of chemoinformatics applications, e.g. the systematic naming of compounds, the calculation of molecular descriptors, the matching of SMARTS expressions, and the generation of atomic coordinates. We introduce the concept of unique ring families (URFs) as an extension of the concept of relevant cycles (RCs)., URFs are consistent for different atom orders and represent an intuitive description of the rings of a molecular graph. Furthermore, in contrast to RCs, URFs are polynomial in number. We provide an algorithm to efficiently calculate URFs in polynomial time and demonstrate their suitability for real-time applications by providing computing time benchmarks for the PubChem Database. URFs combine three important properties of chemical ring descriptions, for the first time, namely being unique, chemically meaningful, and efficient to compute. Therefore, URFs are a valuable alternative to the commonly used concept of the smallest set of smallest rings (SSSR) and would be suited to become the standard measure for ring topologies of small molecules

FigShare

Searching for Recursively Defined Generic Chemical Patterns in Nonenumerated Fragment Spaces

Author: Angela M. Henzler (1933384)
Hans-Christian Ehrlich (1933387)
Matthias Rarey (1425577)
Publication venue
Publication date
Field of study

Retrieving molecules with specific structural features is a fundamental requirement of today’s molecular database technologies. Estimates claim the chemical space relevant for drug discovery to be around 1060 molecules. This figure is many orders of magnitude larger than the amount of molecules conventional databases retain today and will store in the future. An elegant description of such a large chemical space is provided by the concept of fragment spaces. A fragment space comprises fragments that are molecules with open valences and describes rules how to connect these fragments to products. Due to the combinatorial nature of fragment spaces, a complete enumeration of its products is intractable. We present an algorithm to search fragment spaces for generic chemical patterns as present in the SMARTS chemical pattern language. Our method allows specification of the chemical surrounding of an atom in a query and, therefore, enables a chemically intuitive search. During the search, the costly enumeration of products is avoided. The result is a fragment space that exactly describes all possible molecules that contain the user-defined pattern. We evaluated the algorithm in three different drug development use-cases and performed a large scale statistical analysis with 738 SMARTS patterns on three public available fragment spaces. Our results show the ability of the algorithm to explore the chemical space around known active molecules, to analyze fragment spaces for the presence of likely toxic molecules, and to identify complex macromolecular structures under additional structural constraints. By searching the fragment space in its nonenumerated form, spaces covering up to 1019 molecules can be examined in times ranging between 47 s and 19 min depending on the complexity of the query pattern

FigShare