Search CORE

697 research outputs found

Molassembler: Molecular graph construction, modification and conformer generation for inorganic and organic molecules

Author: Reiher Markus
Sobez Jan-Grimo
Publication venue: 'American Chemical Society (ACS)'
Publication date: 17/07/2020
Field of study

We present the graph-based molecule software Molassembler for building organic and inorganic molecules. Molassembler provides algorithms for the construction of molecules built from any set of elements from the periodic table. In particular, poly-nuclear transition metal complexes and clusters can be considered. Structural information is encoded as a graph. Stereocenter configurations are interpretable from Cartesian coordinates into an abstract index of permutation for an extensible set of polyhedral shapes. Substituents are distinguished through a ranking algorithm. Graph and stereocenter representations are freely modifiable and chiral state is propagated where possible through incurred ranking changes. Conformers are generated with full stereoisomer control by four spatial dimension Distance Geometry with a refinement error function including dihedral terms. Molecules are comparable by an extended graph isomorphism and their representation is canonicalizeable. Molassembler is written in C++ and provides Python bindings.Comment: 81 pages, 26 figures, 3 table

arXiv.org e-Print Archive

Dynamic homology and phylogenetic systematics: a unified approach using POY

Author: Aagesen Lone
Arango Claudia P.
D’Haese Cyrille
Faivovich Julián
Giribet Gonzalo
Grant Taran
Janies Daniel
Smith William Leo
Varón Andrés
Wheeler Ward C.
Publication venue: 'American Museum of Natural History (BioOne sponsored)'
Publication date: 01/01/2006
Field of study

KU ScholarWorks

Development of deep learning applications for the automated extraction of chemical information from scientific literature

Author: Brinkhaus Otto
Publication venue
Publication date: 01/01/2023
Field of study

This dissertation focuses on developing deep learning applications for extracting chemical information from scientific literature, particularly targeting the automated recognition of molecular structures in images. DECIMER Segmentation, a novel application, employs a Mask Region-based Convolutional Neural Network (MRCNN) model to segment chemical structures in documents, aided by a mask expansion algorithm, marking a significant advancement in processing chemical literature. The Optical Chemical Structure Recognition (OCSR) tool DECIMER Image Transformer uses an encoder-decoder architecture to convert chemical structure depictions into the machine-readable SMILES format. The model has been trained on over 450 million pairs of images and SMILES representations. Its ability to interpret various depiction styles, including hand-drawn structures, sets a new standard in OCSR. To artificially generate large and diverse OCSR training datasets using multiple cheminformatics toolkits, RanDepict was developed. The diversification of training data ensures robust model generalisation across different chemical structure depictions. A unique dataset of hand-drawn molecule images was created to evaluate the model's performance in interpreting these challenging depictions. This dataset further contributes to the understanding of automated structure recognition from diverse styles. The integration of these technologies led to the creation of DECIMER.ai, an open-source web application that combines segmentation and interpretation tools, allowing users to extract and process chemical information from literature efficiently. The work concludes with a discussion on the significance of open data in advancing molecular informatics, highlighting the potential to broader chemical research domains. By adhering to FAIR data standards and open-source principles, the tools developed for this dissertation are designed for adaptability and future development within the community

Digitale Bibliothek Thüringen

The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4

Author: AI4Science Microsoft Research
Quantum Microsoft Azure
Publication venue
Publication date: 08/12/2023
Field of study

In recent years, groundbreaking advancements in natural language processing have culminated in the emergence of powerful large language models (LLMs), which have showcased remarkable capabilities across a vast array of domains, including the understanding, generation, and translation of natural language, and even tasks that extend beyond language processing. In this report, we delve into the performance of LLMs within the context of scientific discovery, focusing on GPT-4, the state-of-the-art language model. Our investigation spans a diverse range of scientific areas encompassing drug discovery, biology, computational chemistry (density functional theory (DFT) and molecular dynamics (MD)), materials design, and partial differential equations (PDE). Evaluating GPT-4 on scientific tasks is crucial for uncovering its potential across various research domains, validating its domain-specific expertise, accelerating scientific progress, optimizing resource allocation, guiding future model development, and fostering interdisciplinary research. Our exploration methodology primarily consists of expert-driven case assessments, which offer qualitative insights into the model's comprehension of intricate scientific concepts and relationships, and occasionally benchmark testing, which quantitatively evaluates the model's capacity to solve well-defined domain-specific problems. Our preliminary exploration indicates that GPT-4 exhibits promising potential for a variety of scientific applications, demonstrating its aptitude for handling complex problem-solving and knowledge integration tasks. Broadly speaking, we evaluate GPT-4's knowledge base, scientific understanding, scientific numerical calculation abilities, and various scientific prediction capabilities.Comment: 230 pages report; 181 pages for main content

arXiv.org e-Print Archive

Chemoinformatics approaches for new drugs discovery

Author: Fanton Marco
Publication venue
Publication date: 22/01/2013
Field of study

Chemoinformatics uses computational methods and technologies to solve chemical problems. It works on molecular structures, their representations, properties and related data. The first and most important phase in this field is the translation of interconnected atomic systems into in-silico models, ensuring complete and correct chemical information transfer. In the last 20 years the chemical databases evolved from the state of molecular repositories to research tools for new drugs identification, while the modern high-throughput technologies allow for continuous chemical libraries size increase as highlighted by publicly available repository like PubChem [http://pubchem.ncbi.nlm.nih.gov/], ZINC [http://zinc.docking.org/], ChemSpider[http://www.chemspider. com/]. Chemical libraries fundamental requirements are molecular uniqueness, absence of ambiguity, chemical correctness (related to atoms, bonds, chemical orthography), standardized storage and registration formats. The aim of this work is the development of chemoinformatics tools and data for drug discovery process. The first part of the research project was focused on accessible commercial chemical space analysis; looking for molecular redundancy and in-silico models correctness in order to identify a unique and univocal molecular descriptor for chemical libraries indexing. This allows for the 0%-redundancy achievement on a 42 millions compounds library. The protocol was implemented as MMsDusty, a web based tool for molecular databases cleaning. The major protocol developed is MMsINC, a chemoinformatics platform based on a starting number of 4 millions non-redundant high-quality annotated and biomedically relevant chemical structures; the library is now being expanded up to 460 millions compounds. MMsINC is able to perform various types of queries, like substructure or similarity search and descriptors filtering. MMsINC is interfaced with PDB(Protein Data Bank)[http://www.rcsb.org/pdb/home/home.do] and related to approved drugs. The second developed protocol is called pepMMsMIMIC, a peptidomimetic screening tool based on multiconformational chemical libraries; the screening process uses pharmacophoric fingerprints similarity to identify small molecules able to geometrically and chemically mimic endogenous peptides or proteins. The last part of this project lead to the implementation of an optimized and exhaustive conformational space analysis protocol for small molecules libraries; this is crucial for high quality 3D molecular models prediction as requested in chemoinformatics applications. The torsional exploration was optimized in the range of most frequent dihedral angles seen in X-ray solved small molecules structures of CSD(Cambridge Structural Database); by appling this on a 89 millions structures library was generated a library of 2.6 x 10 exp 7 high quality conformers. Tools, protocols and platforms developed in this work allow for chemoinformatics analysis and screening on large size chemical libraries achieving high quality, correct and unique chemical data and in-silico model

Archivio istituzionale della ricerca - Università di Padova

Biomedical text mining: State-of-the-art, open problems and future challenges

Author: Holzinger Andreas
Schantl Johannes
Schroettner Miriam
Seifert Christin
Verspoor Karin
Publication venue: Springer
Publication date: 01/01/2014
Field of study

Crossref

University of Twente Research Information

Repository/R-Forge/DateTimeStamp 2012-12-11 16:03:18

Author: Anamaria Necsulea
Delphine Charif
Depends R
Guy Perriere
Jean R. Lobry
Lazydata Yes
Leonor Palmeira
Needscompilation Yes
Simon Penel
Publication venue
Publication date
Field of study

Suggests ade4, segmented Description Exploratory data analysis and data visualization for biological sequence (DNA and protein) data. Include also utilities for sequence data management under the ACNUC system. License GPL (> = 2

CiteSeerX