14 research outputs found
Information retrieval and text mining technologies for chemistry
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European
Communityâs Horizon 2020 Program (project reference:
654021 - OpenMinted). M.K. additionally acknowledges the
Encomienda MINETAD-CNIO as part of the Plan for the
Advancement of Language Technology. O.R. and J.O. thank
the Foundation for Applied Medical Research (FIMA),
University of Navarra (Pamplona, Spain). This work was
partially funded by ConselleriÌa
de Cultura, EducacioÌn e OrdenacioÌn Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic
funding of UID/BIO/04469/2013 unit and COMPETE 2020
(POCI-01-0145-FEDER-006684). We thank InÌigo GarciaÌ -Yoldi
for useful feedback and discussions during the preparation of
the manuscript.info:eu-repo/semantics/publishedVersio
Recommended from our members
Chemical Information Bulletin
Periodic supplement for "the regular journals of the American Chemical Society," containing annotated bibliographies of chemical documentation literature as well as information about meetings, conferences, awards, scholarships, and other news from the American Chemical Society (ACS) Division of Chemical Literature
Integrative Systems Approaches Towards Brain Pharmacology and Polypharmacology
Polypharmacology is considered as the future of drug discovery and emerges as the next paradigm of drug discovery. The traditional drug design is primarily based on a âone target-one drugâ paradigm. In polypharmacology, drug molecules always interact with multiple targets, and therefore it imposes new challenges in developing and designing new and effective drugs that are less toxic by eliminating the unexpected drug-target interactions. Although still in its infancy, the use of polypharmacology ideas appears to already have a remarkable impact on modern drug development. The current thesis is a detailed study on various pharmacology approaches at systems level to understand polypharmacology in complex brain and neurodegnerative disorders. The research work in this thesis focuses on the design and construction of a dedicated knowledge base for human brain pharmacology. This pharmacology knowledge base, referred to as the Human Brain Pharmacome (HBP) is a unique and comprehensive resource that aggregates data and knowledge around current drug treatments that are available for major brain and neurodegenerative disorders. The HBP knowledge base provides data at a single place for building models and supporting hypotheses. The HBP also incorporates new data obtained from similarity computations over drugs and proteins structures, which was analyzed from various aspects including network pharmacology and application of in-silico computational methods for the discovery of novel multi-target drug candidates. Computational tools and machine learning models were developed to characterize protein targets for their polypharmacological profiles and to distinguish indications specific or target specific drugs from other drugs. Systems pharmacology approaches towards drug property predictions provided a highly enriched compound library that was virtually screened against an array of network pharmacology based derived protein targets by combined docking and molecular dynamics simulation workflows. The developed approaches in this work resulted in the identification of novel multi-target drug candidates that are backed up by existing experimental knowledge, and propose repositioning of existing drugs, that are undergoing further experimental validations
From Knowledgebases to Toxicity Prediction and Promiscuity Assessment
Polypharmacology marked a paradigm shift in drug discovery from the traditional âone drug, one targetâ approach to a multi-target perspective, indicating that highly effective drugs favorably modulate multiple biological targets. This ability of drugs to show activity towards many targets is referred to as promiscuity, an essential phenomenon that may as well lead to undesired side-effects. While activity at therapeutic targets provides desired biological response, toxicity often results from non-specific modulation of off-targets. Safety, efficacy and pharmacokinetics have been the primary concerns behind the failure of a majority of candidate drugs. Computer-based (in silico) models that can predict the pharmacological and toxicological profiles complement the ongoing efforts to lower the high attrition rates. High-confidence bioactivity data is a prerequisite for the development of robust in silico models. Additionally, data quality has been a key concern when integrating data from publicly-accessible bioactivity databases. A majority of the bioactivity data originates from high- throughput screening campaigns and medicinal chemistry literature. However, large numbers of screening hits are considered false-positives due to a number of reasons. In stark contrast, many compounds do not demonstrate biological activity despite being tested in hundreds of assays.
This thesis work employs cheminformatics approaches to contribute to the aforementioned diverse, yet highly related, aspects that are crucial in rationalizing and expediting drug discovery. Knowledgebase resources of approved and withdrawn drugs were established and enriched with information integrated from multiple databases. These resources are not only useful in small molecule discovery and optimization, but also in the elucidation of mechanisms of action and off- target effects. In silico models were developed to predict the effects of small molecules on nuclear receptor and stress response pathways and human Ether-aÌ-go-go-Related Gene encoded potassium channel. Chemical similarity and machine-learning based methods were evaluated while highlighting the challenges involved in the development of robust models using public domain bioactivity data. Furthermore, the true promiscuity of the potentially frequent hitter compounds was identified and their mechanisms of action were explored at the molecular level by investigating target-ligand complexes. Finally, the chemical and biological spaces of the extensively tested, yet inactive, compounds were investigated to reconfirm their potential to be promising candidates.Die Polypharmakologie beschreibt einen Paradigmenwechsel von "einem Wirkstoff - ein ZielmolekuÌl" zu "einem Wirkstoff - viele ZielmolekuÌle" und zeigt zugleich auf, dass hochwirksame Medikamente nur durch die Interaktion mit mehreren ZielmolekuÌlen Ihre komplette Wirkung entfalten koÌnnen.
Hierbei ist die biologische AktivitaÌt eines Medikamentes direkt mit deren Nebenwirkungen assoziiert, was durch die Interaktion mit therapeutischen bzw. Off-Targets erklaÌrt werden kann (PromiskuitaÌt). Ein Ungleichgewicht dieser Wechselwirkungen resultiert oftmals in mangelnder Wirksamkeit, ToxizitaÌt oder einer unguÌnstigen Pharmakokinetik, anhand dessen man das Scheitern mehrerer potentieller Wirkstoffe in ihrer praÌklinischen und klinischen Entwicklungsphase aufzeigen kann. Die fruÌhzeitige Vorhersage des pharmakologischen und toxikologischen Profils durch computergestuÌtzte Modelle (in-silico) anhand der chemischen Struktur kann helfen den Prozess der Medikamentenentwicklung zu verbessern. Eine Voraussetzung fuÌr die erfolgreiche Vorhersage stellen zuverlaÌssige BioaktivitaÌtsdaten dar. Allerdings ist die DatenqualitaÌt oftmals ein zentrales Problem bei der Datenintegration. Die Ursache hierfuÌr ist die Verwendung von verschiedenen Bioassays und âReadoutsâ, deren Daten zum GroĂteil aus primaÌren und bestaÌtigenden Bioassays gewonnen werden. WaÌhrend ein GroĂteil der Treffer aus primaÌren Assays als falsch-positiv eingestuft werden, zeigen einige Substanzen keine biologische AktivitaÌt, obwohl sie in beiden Assay- Typen ausgiebig getestet wurden (âextensively assayed compoundsâ).
In diese Arbeit wurden verschiedene chemoinformatische Methoden entwickelt und angewandt, um die zuvor genannten Probleme zu thematisieren sowie LoÌsungsansaÌtze aufzuzeigen und im Endeffekt die Arzneimittelforschung zu beschleunigen. HierfuÌr wurden nicht redundante, Hand-validierte Wissensdatenbanken fuÌr zugelassene und zuruÌckgezogene Medikamente erstellt und mit weiterfuÌhrenden Informationen angereichert, um die Entdeckung und Optimierung kleiner organischer MolekuÌle voran zu treiben. Ein entscheidendes Tool ist hierbei die AufklaÌrung derer Wirkmechanismen sowie Off-Target-Interaktionen.
FuÌr die weiterfuÌhrende Charakterisierung von Nebenwirkungen, wurde ein Hauptaugenmerk auf Nuklearrezeptoren, Pathways in welchen Stressrezeptoren involviert sind sowie den hERG-Kanal gelegt und mit in-silico Modellen simuliert. Die Erstellung dieser Modelle wurden Mithilfe eines integrativen Ansatzes aus âstate-of-the-artâ Algorithmen wie AÌhnlichkeitsvergleiche und âMachine- Learningâ umgesetzt. Um ein hohes MaĂ an VorhersagequalitaÌt zu gewaÌhrleisten, wurde bei der Evaluierung der DatensaÌtze explizit auf die DatenqualitaÌt und deren chemische Vielfalt geachtet. WeiterfuÌhrend wurden die in-silico-Modelle dahingehend erweitert, das Substrukturfilter genauer betrachtet wurden, um richtige Wirkmechanismen von unspezifischen Bindungsverhalten (falsch- positive Substanzen) zu unterscheiden. AbschlieĂend wurden der chemische und biologische Raum ausgiebig getesteter, jedoch inaktiver, kleiner organischer MolekuÌle (âextensively assayed compoundsâ) untersucht und mit aktuell zugelassenen Medikamenten verglichen, um ihr Potenzial als vielversprechende Kandidaten zu bestaÌtigen
Unified processing framework of high-dimensional and overly imbalanced chemical datasets for virtual screening.
Virtual screening in drug discovery involves processing large datasets containing unknown molecules in order to find the ones that are likely to have the desired effects on a biological target, typically a protein receptor or an enzyme. Molecules are thereby classified into active or non-active in relation to the target. Misclassification of molecules in cases such as drug discovery and medical diagnosis is costly, both in time and finances. In the process of discovering a drug, it is mainly the inactive molecules classified as active towards the biological target i.e. false positives that cause a delay in the progress and high late-stage attrition. However, despite the pool of techniques available, the selection of the suitable approach in each situation is still a major challenge. This PhD thesis is designed to develop a pioneering framework which enables the analysis of the virtual screening of chemical compounds datasets in a wide range of settings in a unified fashion. The proposed method provides a better understanding of the dynamics of innovatively combining data processing and classification methods in order to screen massive, potentially high dimensional and overly imbalanced datasets more efficiently
Recommended from our members
Geometric Learning for Quantum-Informed, Machine Learning and Analysis of Electrostatic Preorganization
This thesis is organized in a slightly unconventional fashion: algorithms lead and appli-cations fill out the content. I think this emphasizes my interests during graduate school -
I built algorithms and tools to address issues that were otherwise inaccessible to different
areas of computational chemistry (including applied machine learning) and enzymology. Two
sets of scientific thrusts underscore the bulk of my work: algorithms to analyze dynamic,
heterogeneous fields in the context of enzymology and flexible machine learning algorithms,
including those that leverage quantum descriptors, for rigorous molecular and reaction-level
properties. Each section will include grounding on applications and broader impacts for
the reader as well. Now we pivot to discussing the main thrusts and outlining each chapter
briefly.General ML and Quantum Theory of Atoms-in-Molecules (QTAIM): QTAIMserves as a mathematical decomposition algorithm for electronic basins within a molecule.
The algorithm intakes molecular densities, as computed (typically) by density functional
theory (DFT), and uses the flux of density to partition the scalar field into 3-dimensional
atomic basins of density [14, 16]. These objects are known as atomic basins and represent
the quantum atom within a molecule. By constructing these structures, we compute a rich
set of mathematical descriptors that map to many features including energies, bonding,
and electron delocalization. These features have been correlated, in the past, to activation
energies, reactivity, and overall system energies, but these uses largely relied on human
intervention and small datasets [44, 62, 65, 111, 142, 287]. By developing software centered
around high-throughput QTAIM calculations and machine learning, I was able to bring these
descriptors to larger datasets and a wide host of applications.
In Chapter 2, I discuss an algorithm I implemented to predict Diels-Alder reaction
barriers from QTAIM signatures alone. In this study, we showed that QTAIM features, can be
used to surmise reaction barriers while also using machine learning techniques to understand
what signatures were most informative to our models. Here QTAIM electrostatic potentials
and delocalization indices alone were able to yield great performance on withheld datasets.
In addition, we demonstrated that QTAIM features can allow a machine learning model to
generalize, to an extent, to much larger Diels-Alder reactions. This chapter was adapted from
the following: Machine Learning to Predict DielsâAlder Reaction Barriers from the Reactant
State Electron Density. S. Vargas*, M. Hannefarth, Z. Liu, A.N. Alexandrova. Journal of
Chemical Theory and Computation 2021 17 (10), 6203-6213. 10.1021/acs.jctc.1c00623.
In Chapter 3, I discuss a package developed to perform high-throughput QTAIM
calculations on datasets of molecules and reactions. This package is currently adapted to
work with open-source packages such as ORCA and Multiwfn. These softwares, respectively,
compute DFT densities at a user-specified level of theory and subsequently compute QTAIM
descriptors. The package is built with high-performance compute (HPC) in mind as it
can operate on a single dataset with an arbitrary number of concurrent jobs. Here I also
used the package to compute QTAIM values for a diverse set of important and difficult
datasets and developed graph neural networks to predict molecular and reaction properties
leveraging QTAIM as inputs. This chapter was adapted from the following: This was adapted
from High-throughput quantum theory of atoms in molecules (QTAIM) for geometric deep
learning of molecular and reaction properties Santiago Vargas, Winston Gee, and Anastassia
N. Alexandrova. Digital Discovery 2024 3, 987-998.Advancing Analysis of Electric Fields in Proteins: The later chapters follow ourwork in developing algorithms to ingest, interpret, and predict on electric fields in protein
active sites. This work builds on the notion of electrostatic preorganization, a theory that
posits that protein scaffolds arrange to electrostatically catalyse chemical reactions, and
thereby, destabilizing reactants while suppressing transition state energies [299, 301].
Chapter 4 depicts exhaustive efforts to apply heterogenous electric field analysis to
understanding directed evolution in the context of a protoglobin directed evolution (DE)
trajectory. Previous DE efforts optimized protoglobin to efficiently catalyze carbene transfer
reactions. We show that traditional explanations for increased catalytic activity across the
DE lineage, substrate access and binding, cannot account for the dramatic improvements in
protein activity. By tracking the 3-D electric field and using clustering algorithms, we pinpoint
representative structures for QM/MM calculations and show that changes in the electric field,
along DE, improve carbene transfer reactivity. These findings highlight the role electrostatic
organization, notably its dynamic effect, has on determining protein function and points to
its future importance in designing proteins for relevant chemical processes. This chapter is
adapted from Directed Evolution of Protoglobin Optimizes the Enzyme Electric Field. Shobhit
S. Chaturvedi, Santiago Vargas, Pujan Ajmera, and Anastassia N. Alexandrova. Journal of
the American Chemical Society 2024 146 (24), 16670-16680 DOI: 10.1021/jacs.4c03914.
In Chapter 5, I introduce a machine learning framework designed to predict enzyme
functionality directly from the heterogeneous electric fields applied to protein active sites. We
apply this method to a dataset of Heme-Iron Oxidoreductases. Previous studies here, focused
on simple, point electric fields along the Fe-O bond, are insufficient for reasonable accuracy.
On the otherhand, our 3-D, heterogenous model can accurately predict protein activity
without relying on additional protein-specific information. In addition, feature selection
elucidates what electric field components most inform our models and thus highlight important
components to reactivity and selectivity. Finally, we apply previously-mentioned electric
field clustering algorithms and QM/MM calculations to reveal how dynamic complexities in
protein structures can complicate predictions and thus provides a path forward for improved
models in this space. This chapter is adapted from Machine-learning prediction of protein
function from the portrait of its intramolecular electric field. S. Vargas*, S. Chaturvedi, A.N.
Alexandrova. (Accepted, Journal of the American Chemical Society
Recommended from our members
Integrative omics approaches for new target identification and therapeutics development
The growing research and commercial pressures for novel therapeutics development accentuate
why better strategies are needed for drug discovery. The costly nature of developing a
pharmaceutical compound as well as the shrinking pool of âeasyâ targets are some of the key
reasons why there is a research paradigm shift towards integrative and systems biology driven
approaches. Moreover, multifactorial aspects of many diseases require more innovative clinical
strategies rather than just focusing on a single target. Cardiovascular diseases as well as associated
immune components exemplify this complexity well. This thesis aimed to introduce a gradual and
highly integrative analytical framework by incorporating a full range of studies from disease target
selection to high-throughput virtual screening so that a cost-effective and efficient stratification of
targets and associated compounds could be achieved. Heart failure served as a case study for
complex diseases where the first in-depth omics study on cardiomyopathies helped to elucidate new
therapeutic avenues. This research tied in with a development of a novel scoring function and
integrated machine learning approach for multiple therapeutic target classification and exploration.
Finally, all pieces of the introduced research were used to create a highly integrative in silico
screening workflow. Some of the key results included the first reported molecular dynamics
analyses for a complex immunotherapeutic target, c-Rel, as well as 15 new therapeutic compounds
that could potentially modulate this transcription factor subunit. Thus, this dissertation provided
several important improvements for target identification, validation, and drug discovery that could
significantly advance current development strategies and accelerate new therapeutics production