Search CORE

193 research outputs found

Пользовательский интерфейс для извлечения химико-структурной информации из систематического названия органического соединения

Author: L. Grigoryan A.
Л. Григорян А.
Publication venue: Belarusian National Technical University
Publication date: 04/10/2023
Field of study

The user's interface «Nomenclature Generator» for extraction of the chemical structure information from the systematic name of organic compound represented according to IUPAC nomenclature is developed at the All-Russian Institute for Scientific and Technical Information of Russian Academy of Sciences.В ВИНИТИ РАН разработан пользовательский интерфейс «Номенклатурный Генератор», предназначенный для автоматического извлечения химико-структурной информации из систематического названия органического соединения, данного в номенклатуре ИЮПАК

System analysis and applied information science (E-Journal) / Системный анализ и прикладная информатика

Пользовательский интерфейс для извлечения химико-структурной информации из систематического названия органического соединения

Author: Григорян Л. А.
Publication venue: БНТУ
Publication date
Field of study

В ВИНИТИ РАН разработан пользовательский интерфейс «Номенклатурный Генератор», предназначенный для автоматического извлечения химико-структурной информации из систематического названия органического соединения, данного в номенклатуре ИЮПАК

Repository of Belarusian National Technical University (BNTU)

SELFIES and the future of molecular string representations

Author: Ai Qianxiang
Barthel Senja
Carson Nessa
Frei Angelo
Frey Nathan C.
Friederich Pascal
Gaudin Théophile
Gayle Alberto Alexander
Krenn Mario
Moosavi Seyed Mohamad
Publication venue
Publication date: 01/01/2022
Field of study

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, Smiles, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, Smiles has several shortcomings—most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (Selfies). Selfies has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science

Institutional Repository of the Freie Universität Berlin

SELFIES and the future of molecular string representations

arXiv.org e-Print Archive

VU Research Portal

Proceedings - University of Groningen

KITopen

ARTS repository - University of Groningen

PubMed Central

eScholarship - University of California

MPG.PuRe

Dissertations of the University of Groningen

Recommended from our members

Extraction of chemical structures and reactions from the literature

Author: Lowe Daniel Mark
Publication venue: University of Cambridge
Publication date: 09/10/2012
Field of study

The ever increasing quantity of chemical literature necessitates the creation of automated techniques for extracting relevant information. This work focuses on two aspects: the conversion of chemical names to computer readable structure representations and the extraction of chemical reactions from text. Chemical names are a common way of communicating chemical structure information. OPSIN (Open Parser for Systematic IUPAC Nomenclature), an open source, freely available algorithm for converting chemical names to structures was developed. OPSIN employs a regular grammar to direct tokenisation and parsing leading to the generation of an XML parse tree. Nomenclature operations are applied successively to the tree with many requiring the manipulation of an in-memory connection table representation of the structure under construction. Areas of nomenclature supported are described with attention being drawn to difficulties that may be encountered in name to structure conversion. Results on sets of generated names and names extracted from patents are presented. On generated names, recall of between 96.2% and 99.0% was achieved with a lower bound of 97.9% on precision with all results either being comparable or superior to the tested commercial solutions. On the patent names OPSIN s recall was 2-10% higher than the tested solutions when the patent names were processed as found in the patents. The uses of OPSIN as a web service and as a tool for identifying chemical names in text are shown to demonstrate the direct utility of this algorithm. A software system for extracting chemical reactions from the text of chemical patents was developed. The system relies on the output of ChemicalTagger, a tool for tagging words and identifying phrases of importance in experimental chemistry text. Improvements to this tool required to facilitate this task are documented. The structure of chemical entities are where possible determined using OPSIN in conjunction with a dictionary of name to structure relationships. Extracted reactions are atom mapped to confirm that they are chemically consistent. 424,621 atom mapped reactions were extracted from 65,034 organic chemistry USPTO patents. On a sample of 100 of these extracted reactions chemical entities were identified with 96.4% recall and 88.9% precision. Quantities could be associated with reagents in 98.8% of cases and 64.9% of cases for products whilst the correct role was assigned to chemical entities in 91.8% of cases. Qualitatively the system captured the essence of the reaction in 95% of cases. This system is expected to be useful in the creation of searchable databases of reactions from chemical patents and in facilitating analysis of the properties of large populations of reactions

Apollo (Cambridge)

SELFIES and the future of molecular string representations

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.Comment: 34 pages, 15 figures, comments and suggestions for additional references are welcome

arXiv.org e-Print Archive

MPG.PuRe

SELFIES and the future of molecular string representations

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science

KITopen

Bioinformatics and molecular modeling in glycobiology

Author: A Bohne
A Bohne-Lang
A Bohne-Lang
A Ceroni
A Ceroni
A Chang
A Fernandez-Tejada
A Imberty
A Kerzmann
A Laederach
A Nurisso
A Porter
A Suga
A Varki
A Varki
AB Yongye
AD Hill
AD McNaught
AD McNaught
AJ Lapadula
AJ Petrescu
AK Powell
AS Culf
B Chen
B Eisenhaber
B Tissot
BL Cantarel
BR Brooks
C Breton
C Brooksbank
C Caragea
C Diehl
C Meynier
C Voss
C-W der Lieth von
C-W der Lieth von
C-W der Lieth von
C-W der Lieth von
CA Cooper
CA Cooper
CA Cooper
CA Stortz
D Goldberg
D Goldberg
D Goldberg
D Goldberg
D Spoel Van der
D Weininger
D Xu
DA Case
DB Werz
DC Geus de
DC Kilpatrick
DF Gauto
DJ Harvey
DL Rubin
DL Wheeler
E Banin
E Boutet
E Jung
E Maes
EJ Whitfield
FA Shaikh
FJ Krambeck
FJ Krambeck
FV Toukach
FV Toukach
G Widmalm
GA Jeffrey
GJ Gerwig
H Alonso
H Berman
H Dong
H Geyer
H Kaji
H Karlsson
H Li
H Mamitsuka
H Rodriguez
H Tateno
HA Taha
HJ Joshi
HX Tang
HY Gao
I Marchal
I Sanchez-Medina
I Tvaroska
IR Greig
J Gonzalez-Outeirino
J Hirabayashi
J Hirabayashi
J Irungu
J Jaeken
J Jimenez-Barbero
J Nadas
J Vasur
J Xia
J Zaia
JA Campbell
JD Marth
JE Hansen
JE Turnbull
JF Vliegenthart
JH Lii
JJ Reina
JL Asensio
JL Klepeis
JN Arnold
K Degtyarenko
K Hashimoto
K Hashimoto
K Julenius
K Maass
K Ohtsubo
KF Aoki
KF Aoki
KF Aoki-Kinoshita
KF Aoki-Kinoshita
KK Lohmann
KK Lohmann
KN Gunnerson
KN Kirschner
KT Pilobello
L Buts
L Krishnamoorthy
L Pellegrini
L Royle
LK Mahal
LR Ruhaak
M Crispin
M Eriksson
M Ethier
M Frank
M Frank
M Guerrini
M Hattori
M Karplus
M Krupicka
M Remko
M Sturm
M Sturm
Martin Frank
MB Tessier
ML DeMarco
ML DeMarco
ML Hecht
MM Mackeen
MP Campbell
MR Landon
N Debeljak
N Fankhauser
N Kawasaki
N Kikuchi
N Kikuchi
N Moitessier
NH Packer
NJ Mulder
NK Mishra
O Berteau
O Blixt
O Guvench
O Ozohanics
O Takahashi
P Das
P Jedlovszky
P Kersey
P Mark
P Murray-Rust
P Toukach
P Umana
PE Jansson
PE Jansson
PH Seeberger
PJ Domann
PK Qasba
R Apweiler
R Gupta
R Jefferis
R Kadirvelraj
R Raman
R Raman
R Ranzinger
R Ranzinger
RA Laine
RH Fogh
RK Raju
S Clerens
S Di Lella
S Doubet
S Doubet
S Herget
S Herget
S Hurtley
S Kawano
S Orchard
S Perez
S Perez
S Vandenbussche
SB Engelsen
SE Hamby
Siegfried Schloissnig
SK Ramadugu
SM Haslam
SM Tschampel
SP Gaucher
SS Sahoo
SY Vakhrushev
T Horlacher
T Kawasaki
T Kuboyama
T Lutteke
T Lutteke
T Lutteke
T Lutteke
T Lutteke
T Lutteke
T Nakahara
T Shikanai
T Takaoka
T Weimar
T Yue
TH Thanka Christlet
U Olsson
U Schnupf
U Yu
V Blanchard
V Spiwok
V Spiwok
W Cai
WF Vranken
WL Jorgensen
X Biarnes
Y Chen
Y Choi
Y Hizukuri
Y Liu
Y Wada
Y Wang
Y Yamanishi
YK Fujimoto
YZ Chen
Publication venue: SP Birkhäuser Verlag Basel
Publication date: 01/01/2010
Field of study

The field of glycobiology is concerned with the study of the structure, properties, and biological functions of the family of biomolecules called carbohydrates. Bioinformatics for glycobiology is a particularly challenging field, because carbohydrates exhibit a high structural diversity and their chains are often branched. Significant improvements in experimental analytical methods over recent years have led to a tremendous increase in the amount of carbohydrate structure data generated. Consequently, the availability of databases and tools to store, retrieve and analyze these data in an efficient way is of fundamental importance to progress in glycobiology. In this review, the various graphical representations and sequence formats of carbohydrates are introduced, and an overview of newly developed databases, the latest developments in sequence alignment and data mining, and tools to support experimental glycan analysis are presented. Finally, the field of structural glycoinformatics and molecular modeling of carbohydrates, glycoproteins, and protein–carbohydrate interaction are reviewed

Crossref

Springer - Publisher Connector

PubMed Central

Recommended from our members

Automatic Analysis and Validation of the Chemical Literature

Author: Townsend JA
Publication venue: University of Cambridge
Publication date: 01/02/2008
Field of study

ThesisMethods to automatically extract and validate data from the chemical literature in legacy formats to machine-understandable forms are examined. The work focuses of three types of data: analytical data reported in articles, computational chemistry output files and crystallographic information files (CIFs). It is shown that machines are capable of reading and extracting analytical data from the current legacy formats with high recall and precision. Regular expressions cannot identify chemical names with high precision or recall but non-deterministic methods perform significantly better. The lack of machine-understandable connection tables in the literature has been identified as the major issue preventing molecule-based data-driven science being performed in the area. The extraction of data from computational chemistry output files using parser-like approaches is shown to be not generally possible although such methods work well for input files. A hierarchical regular expression based approach can parse > 99:9% of the output files correctly although significant human input is required to prepare the templates. CIFs may be parsed with extremely high recall and precision, contain connection tables and the data is of high quality. The comparison of bond lengths calculated by two computational chemistry programs show good agreement in general but structures containing specific moieties cause discrepancies. An initial protocol for the high-throughput geometry optimisation of molecules extracted from the CIFs is presented and the refinement of this protocol is discussed. Differences in bond length between calculated and experimentally determined values from the CIFs of less than 0.03 Angstrom are shown to be expected by random error. The final protocol is used to find high-quality structures from crystallography which can be reused for further science.Unilever Centre for Molecular Science Informatic

Apollo (Cambridge)

Terminology of bioanalytical methods (IUPAC Recommendations 2018)

Author: Bowater Richard P.
Fojta Miroslav
Gauglitz Günter
Glatz Zdeněk
Hapala Ivan
Havliš Jan
Hibbert David Brynn
Kilar Aniko
Kilar Ferenc
Labuda Ján
Malinovská Lenka
Sirén Heli M. M.
Skládal Petr
Torta Federico
Valachovič Martin
Wimmerová Michaela
Zdráhal Zbyněk
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/07/2018
Field of study

Recommendations are given concerning the terminology of methods of bioanalytical chemistry. With respect to dynamic development particularly in the analysis and investigation of biomacromolecules, terms related to bioanalytical samples, enzymatic methods, immunoanalytical methods, methods used in genomics and nucleic acid analysis, proteomics, metabolomics, glycomics, lipidomics, and biomolecules interaction studies are introduced

Crossref

University of East Anglia digital repository