19 research outputs found
The computer storage, retrieval and searching of generic structures in chemical patents : the machine-readable representation of generic structures.
The nature of the generic chemical structures found in patents is
described, with a discussion of the types of statement commonly
found in them. The available representations for such structures
are reviewed, with particular note being given to the suitability
of the representation for searching files of such structures.
Requirements for the unambiguous representation of generic
structures in an "ideal" storage and retrieval system are
discussed.
The basic principles of the theory of formal languages are
reviewed, with particular consideration being given to parsing
methods for context-free languages. The Grammar and parsing of
computer programming languages, as an example of artificial
formal languages, is discussed. Applications of formal language
theory to chemistry and information work are briefly reviewed.
GENSAL, a formal language for the unambiguous description of
generic structures from patents, is presented. It is designed to
be intelligible to a chemist or patent agent, yet sufficiently
ABSTRACT
formaLised to be amenabLe to computer anaLysis. DetaiLed
description is given of the facilities it provides for generic
structure representation, and there is discussion of its
Limitations and the principLes behind its design.
A connection-tabLe-based internaL representation for generic
structures, caLLed an ECTR <Extended Connection TabLe
Representation) is presented. It is designed to represent generic
structures unambiguousLy, and to be generated automatically from
structures encoded in GENSAL. It is compared to other proposed
representations, and its implementation using data types of the
programming Language PascaL described.
An interpreter program which generates an ECTR from structures
encoded in a subset of the GENSAL Language is presented. The
principles of its operation are described.
Possible applications of GENSAL outside the area of patent
documentation are discussed, and suggestions made for further
work on the development of a generic structure storage and
retrieval system based on GENSAL and ECTRs
Development and implementation of in silico molecule fragmentation algorithms for the cheminformatics analysis of natural product spaces
Computational methodologies extracting specific substructures like functional groups or molecular scaffolds from input molecules can be grouped under the term âin silico molecule fragmentationâ. They can be used to investigate what specifically characterises a heterogeneous compound class, like pharmaceuticals or Natural Products (NP) and in which aspects they are similar or dissimilar. The aim is to determine what specifically characterises NP structures to transfer patterns favourable for bioactivity to drug development. As part of this thesis, the first algorithmic approach to in silico deglycosylation, the removal of glycosidic moieties for the study of aglycones, was developed with the Sugar Removal Utility (SRU) (Publication A). The SRU has also proven useful for investigating NP glycoside space. It was applied to one of the largest open NP databases, COCONUT (COlleCtion of Open Natural prodUcTs), for this purpose (Publication B). A contribution was made to the Chemistry Development Kit (CDK) by developing the open Scaffold Generator Java library (Publication C). Scaffold Generator can extract different scaffold types and dissect them into smaller parent scaffolds following the scaffold tree or scaffold network approach. Publication D describes the OngLai algorithm, the first automated method to identify homologous series in input datasets, group the member structures of each group, and extract their common core. To support the development of new fragmentation algorithms, the open Java rich client graphical user interface application MORTAR (MOlecule fRagmenTAtion fRamework) was developed as part of this thesis (Publication E). MORTAR allows users to quickly execute the steps of importing a structural dataset, applying a fragmentation algorithm, and visually inspecting the results in different ways. All software developed as part of this thesis is freely and openly available (see https://github.com/JonasSchaub)
Development and validation of in silico tools for efficient library design and data analysis in high throughput screening campaigns
My PhD project findings have their major application in the early phase of the drug discovery process, in particular we have developed and validated two computational tools (Molecular Assembles and LiGen) to support the hit finding and the hit to lead phases.
I have reported here novel methods to first design chemical libraries optimized for HTS and then profile them for a specific target receptor or enzyme. I also analyzed the generated bio-chemical data in order to obtain robust SARs and to select the most promising hits for the follow up. The described methods support the iterative process of validated hit series optimization up to the identification of a lead.
In chapter 3, Ligand generator (LiGen), a de novo tool for structure based virtual screening, is presented. The development of LiGen is a project based on a collaboration among DompĂŠ Farmaceutici SpA, CINECA and the University of Parma. In this multidisciplinary group, the integration of different skills has allowed the development, from scratch, of a virtual screening tool, able to compete in terms of performance with long standing, well-established molecular docking tools such as Glide, Autodock and PLANTS.
LiGen, using a novel docking algorithm, is able to perform ligand flexible docking without performing a conformational sampling. LiGen also has other distinctive features with respect to other molecular docking programs:
⢠LiGen uses the inverse pharmacophore derived from the binding site to identify the putative bioactive conformation of the molecules, thus avoiding the evaluation of molecular conformations which do not match the key features of the binding site.
⢠LiGen implemenst a de novo molecule builder based on the accurate definition of chemical rules taking account of building block (reagents) reactivity.
⢠LiGen is natively a multi-platform C++ portable code designed for HPC applications and optimized for the most recent hardware architectures like the Xeon Phi Accelerators.
Chapter 3 also reports the further development and optimization of the software starting from the results obtained in the first optimization step performed to validate the software and to derive the default parameters.
In chapter 4, the application of LiGen in the discovery and optimization of novel inhibitors of the complement factor 5 receptor (C5aR) is reported. Briefly, the C5a anaphylatoxin acting on its cognate G protein-coupled receptor C5aR is a potent pronociceptive mediator in several models of inflammatory and neuropathic pain. Although there has long been interest in the identification of C5aR inhibitors, their development has been complicated, as is the case with many peptidomimetic drugs, mostly due to the poor drug-like properties of these molecules. Herein, we report the de novo design of a potent and selective C5aR noncompetitive allosteric inhibitor, DF2593A. DF2593A design was guided by the hypothesis that an allosteric site, the âminor pocketâ, previously characterized in CXCR1 and CXCR2, could be functionally conserved in the GPCR class.DF2593A potently inhibited C5a-induced migration of human and rodent neutrophils in vitro. Moreover, oral administration of DF2593A effectively reduced mechanical hyperalgesia in several models of acute and chronic inflammatory and neuropathic pain in vivo, without any apparent side effects.
Chapter 5 describes another tool: Molecular Assemblies (MA), a novel metrics based on a hierarchical representation of the molecule based on different representations of the scaffold of the molecule and pruning rules. The algorithm used by MA, defining a priori a metrics (a set of rules), creates a representation of the chemical structure through hierarchical decomposition of the scaffold in fragments, in a pathway invariant way (this feature is novel with respect to the other algorithms reported in literature). Such structure decomposition is applied to nine hierarchical representation of the scaffold of the reference molecule, differing for the content of structural information: atom typing and bond order (this feature is novel with respect to the other algorithms reported in literature) The algorithm (metrics) generates a multi-dimensional hierarchical representation of the molecule.
This descriptor applied to a library of compounds is able to extract structural (molecule having the same scaffold, wireframe or framework) and sub structural (molecule having the same fragments in common) relations among all the molecules.
At least, this method generates relations among molecules based on identities (scaffolds or fragments). Such an approach produces a unique representation of the reference chemical space not biased by the threshold used to define the similarity cut-off between two molecules. This is in contrast to other methods which generate representations based in similarities.
MA procedure, retrieving all scaffold representation, fragments and fragmentationâs patterns (according to the predefined rules) from a molecule, creates a molecular descriptor useful for several cheminformatics applications:
⢠Visualization of the chemical space. The scaffold relations (Figure 7) and the fragmentation patterns can be plotted using a network representation. The obtained graphs are useful depictions of the chemical space highlighting the relations that occur among the molecule in a two dimensional space.
⢠Clustering of the chemical space. The relations among the molecules are based on identities. This means that the scaffold representations and their fragments can be used as a hierarchical clustering method. This descriptor produces clusters that are independent from the number and similarity among closest neighbors because belonging to a cluster is a property of the single molecule (Figure 8). This intrinsic feature makes the scaffold based clustering much faster than other methods in producing âstableâ clusters in fact, adding and removing molecules increases and decreases the number of clusters while adding or removing relations among the clusters. However these changes do not affect the cluster number and the relation of the other molecules in dataset.
⢠Generate scaffold-based fingerprints. The descriptor can be used as a fingerprint of the molecule and to generate a similarity index able to compare single molecules or also to compare the diversity of two libraries as a whole. Chapter 6 reports an application of MA in the design of a diverse drug-like scaffold based library optimized for HTS campaigns. A well designed, sizeable and properly organized chemical library is a fundamental prerequisite for any HTS project. To build a collection of chemical compounds with high chemical diversity was the aim of the Italian Drug Discovery Network (IDDN) initiative. A structurally diverse collection of about 200,000 chemical molecules was designed and built taking into account practical aspects related to experimental HTS procedures. Algorithms and procedures were developed and implemented to address compound filtering, selection, clusterization and plating.
Chapter 7 collects concluding remarks and plans for the further development of the tools
Recommended from our members
Chemical Information Bulletin
Periodic supplement for "the regular journals of the American Chemical Society," containing annotated bibliographies of chemical documentation literature as well as information about meetings, conferences, awards, scholarships, and other news from the American Chemical Society (ACS) Division of Chemical Literature
Recommended from our members
Automated analysis and validation of open chemical data
Methods to automatically extract Open Data from the chemical literature,
validate it, and use it to validate theory are examined.
Chemical identifiers which assist the automatic location of chemical structures
using commercial Web search engines are investigated. The IUPAC
International Chemical Idenfitifer (InChI) gives almost 100% recall and precision,
though is shown to be too long for present search engines. A combination
of InChI and InChIKey, a shorter, fixed-length hash of the InChI
string, is concluded to be the best current method of identifying structures.
The proportion of published, Open Crystallographic Information Files
(CIFs) that are valid with respect to the specification is shown to be improving,
and is around 99% in 2007. The error rate in the conversion of valid
CIFs to Chemical Markup Language (CML) is less than 0.2%. The machine
generation of connection tables from CIFs requires many heuristics, and in
some cases it is impossible to deduce the exact connection table.
CrystalEye, a fully-automated system for the reformulation of the fragmented
crystallographic Web into a structured XML-based repository is described.
Published, Open CIFs can be located and aggregated programmatically
with almost 100% recall. It is shown that, by converting CIF data
to CML, software can be created to use the latest Web standards and technologies
to enhance the ability of Web users to browse, find, keep updated,
download and reuse the latest published crystallography.
A workflow for the high-throughput calculation of solid-state geometry
using a semi-empirical method is described. A wide-range of organic and
inorganic systems provided by CrystalEye are used to test both the data and
the method. Several errors in the method are discovered, many of which can
be attributed to the parameterization process.
An Open NMR experiment to perform high-throughput prediction of 13C
chemical shifts using a GIAO protocol is described. The data and analysis
were provided on publicly-available webpages to enable crowdsourcing, which
assisted in discovering an error rate of 6.1% in the starting data. The protocol
was refined during the work and shown to have an average unsigned error
of 2.24ppm for 13C nuclei of small, rigid molecules; comparable to the errors
observed elsewhere for general structures using HOSE and Neural Network
methods
Lead optimization for new antimalarials and Successful lead identification for metalloproteinases: A Fragment-based approach Using Virtual Screening
Lead optimization for new antimalarials and Successful lead identification
for metalloproteinases: A Fragment-based approach Using Virtual Screening
Computer-aided drug design is an essential part of the modern medicinal
chemistry, and has led to the acceleration of many projects. The herein
described thesis presents examples for its application in the field of lead
optimization and lead identification for three metalloproteins.
DOXP-reductoisomerase (DXR) is a key enzyme of the mevalonate independent
isoprenoid biosynthesis. Structure-activity relationships for 43 DXR
inhibitors are established, derived from protein-based docking, ligand-based
3D QSAR and a combination of both approaches as realized by AFMoC. As part
of an effort to optimize the properties of the established inhibitor
Fosmidomycin, analogues have been synthesized and tested to gain further
insights into the primary determinants of structural affinity.
Unfortunately, these structures still leave the active Fosmidomycin
conformation and detailed reaction mechanism undetermined. This fact,
together with the small inhibitor data set provides a major challenge for
presently available docking programs and 3D QSAR tools. Using the recently
developed protein tailored scoring protocol AFMoC precise prediction of
binding affinities for related ligands as well as the capability to estimate
the affinities of structurally distinct inhibitors has been achieved.
Farnesyltransferase is a zinc-metallo enzyme that catalyzes the
posttranslational modification of numerous proteins involved in
intracellular signal transduction. The development of farnesyltransferase
inhibitors is directed towards the so-called non-thiol inhibitors because of
adverse drug effects connected to free thiols. A first step on the way to
non-thiol farnesyltransferase inhibitors was the development of an
CAAX-benzophenone peptidomimetic based on a pharmacophore model. On its
basis bisubstrate analogues were developed as one class of non-thiol
farnesyltransferase inhibitors. In further studies two aryl binding and two
distinct specificity sites were postulated. Flexible docking of model
compounds was applied to investigate the sub-pockets and design highly
active non-thiol farnesyltransferase inhibitor. In addition to affinity,
special attention was paid towards in vivo activity and species specificity.
The second part of this thesis describes a possible strategy for
computer-aided lead discovery. Assembling a complex ligand from simple
fragments has recently been introduced as an alternative to traditional HTS.
While frequently applied experimentally, only a few examples are known for
computational fragment-based approaches. Mostly, computational tools are
applied to compile the libraries and to finally assess the assembled
ligands. Using the metalloproteinase thermolysin (TLN) as a model target, a
computational fragment-based screening protocol has been established.
Starting with a data set of commercially available chemical compounds, a
fragment library has been compiled considering (1) fragment likeness and (2)
similarity to known drugs. The library is screened for target specificity,
resulting in 112 fragments to target the zinc binding area and 75 fragments
targeting the hydrophobic specificity pocket of the enzyme. After analyzing
the performance of multiple docking programs and scoring functions forand the most 14 candidates are selected for further analysis. Soaking
experiments were performed for reference fragment to derive a general
applicable crystallization protocol for TLN and subsequently for new
protein-fragment complex structures. 3-Methylsaspirin could be determined to
bind to TLN. Additional studies addressed a retrospective performance
analysis of the applied scoring functions and modification on the screening
hit. Curios about the differences of aspirin and 3-methylaspirin,
3-chloroaspirin has been synthesized and affinities could be determined to
be 2.42 mM; 1.73 mM und 522 ÎźM respectively.
The results of the thesis show, that computer aided drug design approaches
could successfully support projects in lead optimization and lead
identification.
fragments in general, the fragments derived from the screening are docke
Recommended from our members
Chemical Information Bulletin
Periodic supplement for "the regular journals of the American Chemical Society," containing annotated bibliographies of chemical documentation literature as well as information about meetings, conferences, awards, scholarships, and other news from the American Chemical Society (ACS) Division of Chemical Literature
Computational Approaches to Drug Profiling and Drug-Protein Interactions
Despite substantial increases in R&D spending within the pharmaceutical industry, denovo drug design has become a time-consuming endeavour. High attrition rates led to a
long period of stagnation in drug approvals. Due to the extreme costs associated with
introducing a drug to the market, locating and understanding the reasons for clinical failure
is key to future productivity. As part of this PhD, three main contributions were made in
this respect. First, the web platform, LigNFam enables users to interactively explore
similarity relationships between âdrug likeâ molecules and the proteins they bind. Secondly,
two deep-learning-based binding site comparison tools were developed, competing with
the state-of-the-art over benchmark datasets. The models have the ability to predict offtarget interactions and potential candidates for target-based drug repurposing. Finally, the
open-source ScaffoldGraph software was presented for the analysis of hierarchical scaffold
relationships and has already been used in multiple projects, including integration into a
virtual screening pipeline to increase the tractability of ultra-large screening experiments.
Together, and with existing tools, the contributions made will aid in the understanding of
drug-protein relationships, particularly in the fields of off-target prediction and drug
repurposing, helping to design better drugs faster
Applications and Variations of the Maximum Common Subgraph for the Determination of Chemical Similarity
The Maximum Common Substructure (MCS), along with numerous graph theory techniques, has
been used widely in chemoinformatics. A topic which has been studied at Sheffield is the hyperstructure
concept - a chemical definition of a superstructure, which represents the graph theoretic union
of several molecules. This technique however, has been poorly studied in the context of similarity-based
virtual screening. Most hyperstructure literature to date has focused on either construction
methodology, or property prediction on small datasets of compounds.
The work in this thesis is divided into two parts. The first part describes a method for constructing
hyperstructures, and then describes the application of a hyperstructure in similarity searching in
large compound datasets, comparing it with extended connectivity fingerprint and MCS similarity.
Since hyperstructures performed significantly worse than fingerprints, additional work is described
concerning various weighting schemes of hyperstructures.
Due to the poor performance of hyperstructure and MCS screening compared to fingerprints, it was
questioned whether the type of maximum common substructure algorithm and type had an influence.
A series of MCS algorithms and types were compared for both speed, MCS size, and virtual screening
ability. A topologically-constrained variant of the MCS was found to be competitive with fingerprints,
and fusion of the two techniques overall improved active compound recall
Structure generation and de novo design using reaction networks
This project is concerned with de novo molecular design whereby novel molecules are built in silico and evaluated against properties relevant to biological activity, such as physicochemical properties and structural similarity to active compounds. The aim is to encourage cost-effective compound design by reducing the number of molecules requiring synthesis and analysis.
One of the main issues in de novo design is ensuring that the molecules generated are synthesisable. In this project, a method is developed that enables virtual synthesis using rules derived from reaction sequences. Individual reactions taken from reaction databases were connected to form reaction networks. Reaction sequences were then extracted by tracing paths through the network and used to create âreaction sequence vectorsâ (RSVs) which encode the differences between the start and end points of th esequences. RSVs can be applied to molecules to generate virtual products which are
based on literature precedents.
The RSVs were applied to structure-activity relationship (SAR) exploration using examples taken from the literature. They were shown to be effective in expanding the chemical space that is accessible from the given starting materials. Furthermore, each virtual product is associated with a potential synthetic route. They were then applied in de novo design scenarios with the aim of generating molecules that are predicted to be active using SAR models. Using a collection of RSVs with a set of small molecules as starting materials for de novo design proved that the method was capable of producing
many useful, synthesisable compounds worthy of future study.
The RSV method was then compared with a previously published method that is based on individual reactions (reaction vectors or RVs). The RSV approach was shown to be considerably faster than de novo design using RVs, however, the diversity of products was more limited