14 research outputs found

    A possible extension to the RInChI as a means of providing machine readable process data

    Get PDF
    The algorithmic, large-scale use and analysis of reaction databases such as Reaxys is currently hindered by the absence of widely adopted standards for publishing reaction data in machine readable formats. Crucial data such as yields of all products or stoichiometry are frequently not explicitly stated in the published papers and, hence, not reported in the database entry for those reactions, limiting their usefulness for algorithmic analysis. This paper presents a possible extension to the IUPAC RInChI standard via an auxiliary layer, termed ProcAuxInfo\textit{ProcAuxInfo}, which is a standardised, extensible form in which to report certain key reaction parameters such as declaration of all products and reactants as well as auxiliaries known in the reaction, reaction stoichiometry, amounts of substances used, conversion, yield and operating conditions. The standard is demonstrated via creation of the RInChI including the ProcAuxInfo\textit{ProcAuxInfo} layer based on three published reactions and demonstrates accurate data recoverability via reverse translation of the created strings. Implementation of this or another method of reporting process data by the publishing community would ensure that databases, such as Reaxys, would be able to abstract crucial data for big data analysis of their contents.PMJ is grateful to Peterhouse and the Cambridge Trust for scholarships

    International chemical identifier for reactions (RInChI).

    Get PDF
    The Reaction InChI (RInChI) extends the idea of the InChI, which provides a unique descriptor of molecular structures, towards reactions. Prototype versions of the RInChI have been available since 2011. The first official release (RInChI-V1.00), funded by the InChI Trust, is now available for download ( http://www.inchi-trust.org/downloads/ ). This release defines the format and generates hashed representations (RInChIKeys) suitable for database and web operations. The RInChI provides a concise description of the key data in chemical processes, and facilitates the manipulation and analysis of reaction data

    Computer Aided Synthesis Prediction to Enable Augmented Chemical Discovery and Chemical Space Exploration

    Get PDF
    The drug-like chemical space is estimated to be 10 to the power of 60 molecules, and the largest generated database (GDB) obtained by the Reymond group is 165 billion molecules with up to 17 heavy atoms. Furthermore, deep learning techniques to explore regions of chemical space are becoming more popular. However, the key to realizing the generated structures experimentally lies in chemical synthesis. The application of which was previously limited to manual planning or slow computer assisted synthesis planning (CASP) models. Despite the 60-year history of CASP few synthesis planning tools have been open-sourced to the community. In this thesis I co-led the development of and investigated one of the only fully open-source synthesis planning tools called AiZynthFinder, trained on both public and proprietary datasets consisting of up to 17.5 million reactions. This enables synthesis guided exploration of the chemical space in a high throughput manner, to bridge the gap between compound generation and experimental realisation. I firstly investigate both public and proprietary reaction data, and their influence on route finding capability. Furthermore, I develop metrics for assessment of retrosynthetic prediction, single-step retrosynthesis models, and automated template extraction workflows. This is supplemented by a comparison of the underlying datasets and their corresponding models. Given the prevalence of ring systems in the GDB and wider medicinal chemistry domain, I developed ‘Ring Breaker’ - a data-driven approach to enable the prediction of ring-forming reactions. I demonstrate its utility on frequently found and unprecedented ring systems, in agreement with literature syntheses. Additionally, I highlight its potential for incorporation into CASP tools, and outline methodological improvements that result in the improvement of route-finding capability. To tackle the challenge of model throughput, I report a machine learning (ML) based classifier called the retrosynthetic accessibility score (RAscore), to assess the likelihood of finding a synthetic route using AiZynthFinder. The RAscore computes at least 4,500 times faster than AiZynthFinder. Thus, opens the possibility of pre-screening millions of virtual molecules from enumerated databases or generative models for synthesis informed compound prioritization. Finally, I combine chemical library visualization with synthetic route prediction to facilitate experimental engagement with synthetic chemists. I enable the navigation of chemical property space by using interactive visualization to deliver associated synthetic data as endpoints. This aids in the prioritization of compounds. The ability to view synthetic route information alongside structural descriptors facilitates a feedback mechanism for the improvement of CASP tools and enables rapid hypothesis testing. I demonstrate the workflow as applied to the GDB databases to augment compound prioritization and synthetic route design

    Learning the Language of Chemical Reactions – Atom by Atom. Linguistics-Inspired Machine Learning Methods for Chemical Reaction Tasks

    Get PDF
    Over the last hundred years, not much has changed how organic chemistry is conducted. In most laboratories, the current state is still trial-and-error experiments guided by human expertise acquired over decades. What if, given all the knowledge published, we could develop an artificial intelligence-based assistant to accelerate the discovery of novel molecules? Although many approaches were recently developed to generate novel molecules in silico, only a few studies complete the full design-make-test cycle, including the synthesis and the experimental assessment. One reason is that the synthesis part can be tedious, time-consuming, and requires years of experience to perform successfully. Hence, the synthesis is one of the critical limiting factors in molecular discovery. In this thesis, I take advantage of similarities between human language and organic chemistry to apply linguistic methods to chemical reactions, and develop artificial intelligence-based tools for accelerating chemical synthesis. First, I investigate reaction prediction models focusing on small data sets of challenging stereo- and regioselective carbohydrate reactions. Second, I develop a multi-step synthesis planning tool predicting reactants and suitable reagents (e.g. catalysts and solvents). Both forward prediction and retrosynthesis approaches use black-box models. Hence, I then study methods to provide more information about the models’ predictions. I develop a reaction classification model that labels chemical reaction and facilitates the communication of reaction concepts. As a side product of the classification models, I obtain reaction fingerprints that enable efficient similarity searches in chemical reaction space. Moreover, I study approaches for predicting reaction yields. Lastly, after I approached all chemical reaction tasks with atom-mapping independent models, I demonstrate the generation of accurate atom-mapping from the patterns my models have learned while being trained self-supervised on chemical reactions. My PhD thesis’s leitmotif is the use of the attention-based Transformer architecture to molecules and reactions represented with a text notation. It is like atoms are my letters, molecules my words, and reactions my sentences. With this analogy, I teach my neural network models the language of chemical reactions - atom by atom. While exploring the link between organic chemistry and language, I make an essential step towards the automation of chemical synthesis, which could significantly reduce the costs and time required to discover and create new molecules and materials

    The metaRbolomics Toolbox in Bioconductor and beyond

    Get PDF
    Metabolomics aims to measure and characterise the complex composition of metabolites in a biological system. Metabolomics studies involve sophisticated analytical techniques such as mass spectrometry and nuclear magnetic resonance spectroscopy, and generate large amounts of high-dimensional and complex experimental data. Open source processing and analysis tools are of major interest in light of innovative, open and reproducible science. The scientific community has developed a wide range of open source software, providing freely available advanced processing and analysis approaches. The programming and statistics environment R has emerged as one of the most popular environments to process and analyse Metabolomics datasets. A major benefit of such an environment is the possibility of connecting different tools into more complex workflows. Combining reusable data processing R scripts with the experimental data thus allows for open, reproducible research. This review provides an extensive overview of existing packages in R for different steps in a typical computational metabolomics workflow, including data processing, biostatistics, metabolite annotation and identification, and biochemical network and pathway analysis. Multifunctional workflows, possible user interfaces and integration into workflow management systems are also reviewed. In total, this review summarises more than two hundred metabolomics specific packages primarily available on CRAN, Bioconductor and GitHub

    Towards Efficient Novel Materials Discovery

    Get PDF
    Die Entdeckung von neuen Materialien mit speziellen funktionalen Eigenschaften ist eins der wichtigsten Ziele in den Materialwissenschaften. Das Screening des strukturellen und chemischen Phasenraums nach potentiellen neuen Materialkandidaten wird häufig durch den Einsatz von Hochdurchsatzmethoden erleichtert. Schnelle und genaue Berechnungen sind eins der Hauptwerkzeuge solcher Screenings, deren erster Schritt oft Geometrierelaxationen sind. In Teil I dieser Arbeit wird eine neue Methode der eingeschränkten Geometrierelaxation vorgestellt, welche die perfekte Symmetrie des Kristalls erhält, Resourcen spart sowie Relaxationen von metastabilen Phasen und Systemen mit lokalen Symmetrien und Verzerrungen erlaubt. Neben der Verbesserung solcher Berechnungen um den Materialraum schneller zu durchleuchten ist auch eine bessere Nutzung vorhandener Daten ein wichtiger Pfeiler zur Beschleunigung der Entdeckung neuer Materialien. Obwohl schon viele verschiedene Datenbanken für computerbasierte Materialdaten existieren ist die Nutzbarkeit abhängig von der Darstellung dieser Daten. Hier untersuchen wir inwiefern semantische Technologien und Graphdarstellungen die Annotation von Daten verbessern können. Verschiedene Ontologien und Wissensgraphen werden entwickelt anhand derer die semantische Darstellung von Kristallstrukturen, Materialeigenschaften sowie experimentellen Ergebenissen im Gebiet der heterogenen Katalyse ermöglicht werden. Wir diskutieren, wie der Ansatz Ontologien und Wissensgraphen zu separieren, zusammenbricht wenn neues Wissen mit künstlicher Intelligenz involviert ist. Eine Zwischenebene wird als Lösung vorgeschlagen. Die Ontologien bilden das Hintergrundwissen, welches als Grundlage von zukünftigen autonomen Agenten verwendet werden kann. Zusammenfassend ist es noch ein langer Weg bis Materialdaten für Maschinen verständlich gemacht werden können, so das der direkte Nutzen semantischer Technologien nach aktuellem Stand in den Materialwissenschaften sehr limitiert ist.The discovery of novel materials with specific functional properties is one of the highest goals in materials science. Screening the structural and chemical space for potential new material candidates is often facilitated by high-throughput methods. Fast and still precise computations are a main tool for such screenings and often start with a geometry relaxation to find the nearest low-energy configuration relative to the input structure. In part I of this work, a new constrained geometry relaxation is presented which maintains the perfect symmetry of a crystal, saves time and resources as well as enables relaxations of meta-stable phases and systems with local symmetries or distortions. Apart from improving such computations for a quicker screening of the materials space, better usage of existing data is another pillar that can accelerate novel materials discovery. While many different databases exists that make computational results accessible, their usability depends largely on how the data is presented. We here investigate how semantic technologies and graph representations can improve data annotation. A number of different ontologies and knowledge graphs are developed enabling the semantic representation of crystal structures, materials properties as well experimental results in the field of heterogeneous catalysis. We discuss the breakdown of the knowledge-graph approach when knowledge is created using artificial intelligence and propose an intermediate information layer. The underlying ontologies can provide background knowledge for possible autonomous intelligent agents in the future. We conclude that making materials science data understandable to machines is still a long way to go and the usefulness of semantic technologies in the domain of materials science is at the moment very limited

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Re-use of tests and arguments for assesing dependable mixed-critically systems

    Get PDF
    The safety assessment of mixed-criticality systems (MCS) is a challenging activity due to system heterogeneity, design constraints and increasing complexity. The foundation for MCSs is the integrated architecture paradigm, where a compact hardware comprises multiple execution platforms and communication interfaces to implement concurrent functions with different safety requirements. Besides a computing platform providing adequate isolation and fault tolerance mechanism, the development of an MCS application shall also comply with the guidelines defined by the safety standards. A way to lower the overall MCS certification cost is to adopt a platform-based design (PBD) development approach. PBD is a model-based development (MBD) approach, where separate models of logic, hardware and deployment support the analysis of the resulting system properties and behaviour. The PBD development of MCSs benefits from a composition of modular safety properties (e.g. modular safety cases), which support the derivation of mixed-criticality product lines. The validation and verification (V&V) activities claim a substantial effort during the development of programmable electronics for safety-critical applications. As for the MCS dependability assessment, the purpose of the V&V is to provide evidences supporting the safety claims. The model-based development of MCSs adds more V&V tasks, because additional analysis (e.g., simulations) need to be carried out during the design phase. During the MCS integration phase, typically hardware-in-the-loop (HiL) plant simulators support the V&V campaigns, where test automation and fault-injection are the key to test repeatability and thorough exercise of the safety mechanisms. This dissertation proposes several V&V artefacts re-use strategies to perform an early verification at system level for a distributed MCS, artefacts that later would be reused up to the final stages in the development process: a test code re-use to verify the fault-tolerance mechanisms on a functional model of the system combined with a non-intrusive software fault-injection, a model to X-in-the-loop (XiL) and code-to-XiL re-use to provide models of the plant and distributed embedded nodes suited to the HiL simulator, and finally, an argumentation framework to support the automated composition and staged completion of modular safety-cases for dependability assessment, in the context of the platform-based development of mixed-criticality systems relying on the DREAMS harmonized platform.La dificultad para evaluar la seguridad de los sistemas de criticidad mixta (SCM) aumenta con la heterogeneidad del sistema, las restricciones de diseño y una complejidad creciente. Los SCM adoptan el paradigma de arquitectura integrada, donde un hardware embebido compacto comprende múltiples plataformas de ejecución e interfaces de comunicación para implementar funciones concurrentes y con diferentes requisitos de seguridad. Además de una plataforma de computación que provea un aislamiento y mecanismos de tolerancia a fallos adecuados, el desarrollo de una aplicación SCM además debe cumplir con las directrices definidas por las normas de seguridad. Una forma de reducir el coste global de la certificación de un SCM es adoptar un enfoque de desarrollo basado en plataforma (DBP). DBP es un enfoque de desarrollo basado en modelos (DBM), en el que modelos separados de lógica, hardware y despliegue soportan el análisis de las propiedades y el comportamiento emergente del sistema diseñado. El desarrollo DBP de SCMs se beneficia de una composición modular de propiedades de seguridad (por ejemplo, casos de seguridad modulares), que facilitan la definición de líneas de productos de criticidad mixta. Las actividades de verificación y validación (V&V) representan un esfuerzo sustancial durante el desarrollo de aplicaciones basadas en electrónica confiable. En la evaluación de la seguridad de un SCM el propósito de las actividades de V&V es obtener las evidencias que apoyen las aseveraciones de seguridad. El desarrollo basado en modelos de un SCM incrementa las tareas de V&V, porque permite realizar análisis adicionales (por ejemplo, simulaciones) durante la fase de diseño. En las campañas de pruebas de integración de un SCM habitualmente se emplean simuladores de planta hardware-in-the-loop (HiL), en donde la automatización de pruebas y la inyección de faltas son la clave para la repetitividad de las pruebas y para ejercitar completamente los mecanismos de tolerancia a fallos. Esta tesis propone diversas estrategias de reutilización de artefactos de V&V para la verificación temprana de un MCS distribuido, artefactos que se emplearán en ulteriores fases del desarrollo: la reutilización de código de prueba para verificar los mecanismos de tolerancia a fallos sobre un modelo funcional del sistema combinado con una inyección de fallos de software no intrusiva, la reutilización de modelo a X-in-the-loop (XiL) y código a XiL para obtener modelos de planta y nodos distribuidos aptos para el simulador HiL y, finalmente, un marco de argumentación para la composición automatizada y la compleción escalonada de casos de seguridad modulares, en el contexto del desarrollo basado en plataformas de sistemas de criticidad mixta empleando la plataforma armonizada DREAMS.Kritikotasun nahastuko sistemen segurtasun ebaluazioa jarduera neketsua da beraien heterogeneotasuna dela eta. Sistema hauen oinarria arkitektura integratuen paradigman datza, non hardware konpaktu batek exekuzio plataforma eta komunikazio interfaze ugari integratu ahal dituen segurtasun baldintza desberdineko funtzio konkurrenteak inplementatzeko. Konputazio plataformek isolamendu eta akatsen aurkako mekanismo egokiak emateaz gain, segurtasun arauek definituriko jarraibideak jarraitu behar dituzte kritikotasun mistodun aplikazioen garapenean. Sistema hauen zertifikazio prozesuaren kostua murrizteko aukera bat plataformetan oinarritutako garapenean (PBD) datza. Garapen planteamendu hau modeloetan oinarrituriko garapena da (MBD) non modeloaren logika, hardware eta garapen desberdinak sistemaren propietateen eta portaeraren aurka aztertzen diren. Kritikotasun mistodun sistemen PBD garapenak etekina ateratzen dio moduluetan oinarrituriko segurtasun propietateei, adibidez: segurtasun kasu modularrak (MSC). Modulu hauek kritikotasun mistodun produktu-lerroak ere hartzen dituzte kontutan. Berifikazio eta balioztatze (V&V) jarduerek esfortzu kontsideragarria eskatzen dute segurtasun-kiritikoetarako elektronika programagarrien garapenean. Kritikotasun mistodun sistemen konfiantzaren ebaluazioaren eta V&V jardueren helburua segurtasun eskariak jasotzen dituzten frogak proportzionatzea da. Kritikotasun mistodun sistemen modelo bidezko garapenek zeregin gehigarriak atxikitzen dizkio V&V jarduerari, fase honetan analisi gehigarriak (hots, simulazioak) zehazten direlako. Bestalde, kritikotasun mistodun sistemen integrazio fasean, hardware-in-the-loop (Hil) simulazio plantek V&V iniziatibak sostengatzen dituzte non testen automatizazioan eta akatsen txertaketan funtsezko jarduerak diren. Jarduera hauek frogen errepikapena eta segurtasun mekanismoak egiaztzea ahalbidetzen dute. Tesi honek V&V artefaktuen berrerabilpenerako estrategiak proposatzen ditu, kritikotasun mistodun sistemen egiaztatze azkarrerako sistema mailan eta garapen prozesuko azken faseetaraino erabili daitezkeenak. Esate baterako, test kodearen berrabilpena akats aurkako mekanismoak egiaztatzeko, modelotik X-in-the-loop (XiL)-ra eta kodetik XiL-rako konbertsioa HiL simulaziorako eta argumentazio egitura bat DREAMS Europear proiektuan definituriko arkitektura estiloan oinarrituriko segurtasun kasu modularrak automatikoki eta gradualki sortzeko
    corecore