Search CORE

4 research outputs found

Consistency of systematic chemical identifiers within and between small-molecule databases

Author: Akhondi S.A. (Saber)
Kors J.A. (Jan)
Muresan C. (Cornelia)
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/12/2012
Field of study

Background: Correctness of structures and associated metadata within public and commercial chemical databases greatly impacts drug discovery research activities such as quantitative structure-property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation. Results: The consistency between systematic chemical identifiers and their corresponding MOL representation varies greatly between data sources (37.2%-98.5%). We observed the lowest overall consistency for MOL-IUPAC names. Disregarding stereochemistry increases the consistency (84.8% to 99.9%). A wide variation in consistency also exists between MOL representations of compounds linked via cross-references (25.8% to 93.7%). Removing stereochemistry improved the consistency (47.6% to 95.6%). Conclusions: We have shown that considerable inconsistency exists in structural representation and systematic chemical identifiers within and between databases. This can have a great influence especially when merging data and if systematic identifiers are used as a key index for structure integration or cross-querying several databases. Regenerating systematic identifiers starting from their MOL representation and applying well-defined and documented chemistry standardisation rules to all compounds prior to creating them can dramatically increase internal consistency

Erasmus University Digital Repository

State-of-the-art report: Intergenerational linkages in families

Author: Abramowska-Kmon A. (Anita)
Broek M.P.B. (Thijs) van den
Dykstra P.A. (Pearl)
Haragus M. (Mihaela)
Haragus P-T. (Paul-Theodor)
Kotowska I.E. (Irena)
Muresan C. (Cornelia)
Publication venue
Publication date: 01/01/2014
Field of study

__Abstract__ We present a state-of-the-art of the literature on linkages between generations within families. We focus specifically on intergenerational coresidence, upward and downward intergenerational transfers in families and the relationship between norms of family obligation and intergenerational transfers. An overview of the academic literature on these topics is provided, as well as suggestions for future research

Erasmus University Digital Repository

Annotated chemical patent corpus: A gold standard for text mining

Author: Akhondi S.A. (Saber)
Boppana K. (Kiran)
Jagarlapudi S.A.R.P. (Sarma A. R. P.)
Klenner A.G. (Alexander G.)
Kors J.A. (Jan)
Lowe D. (Daniel)
Manchala A.K. (Anil K.)
Muresan C. (Cornelia)
Sayle R. (Roger)
Tyrchan C. (Christian)
Zimmermann M. (Marc)
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, t

Directory of Open Access Journals

Fraunhofer-ePrints

PubMed Central

EUR Research Repository

Erasmus University Digital Repository

Ambiguity of non-systematic chemical identifiers within and between small-molecule databases

Author: Akhondi S.A. (Saber)
Kors J.A. (Jan)
Muresan C. (Cornelia)
Williams A.J. (Antony)
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2015
Field of study

Background: A wide range of chemical compound databases are currently available for pharmaceutical research. To retrieve compound information, including structures, researchers can query these chemical databases using non-systematic identifiers. These are source-dependent identifiers (e.g., brand names, generic names), which are usually assigned to the compound at the point of registration. The correctness of non-systematic identifiers (i.e., whether an identifier matches the associated structure) can only be assessed manually, which is cumbersome, but it is possible to automatically check their ambiguity (i.e., whether an identifier matches more than one structure). In this study we have quantified the ambiguity of non-systematic identifiers within and between eight widely used chemical databases. We also studied the effect of chemical structure standardization on reducing the ambiguity of non-systematic identifiers. Results: The ambiguity of non-systematic identifiers within databases varied from 0.1 to 15.2 % (median 2.5 %). Standardization reduced the ambiguity only to a small extent for most databases. A wide range of ambiguity existed for non-systematic identifiers that are shared between databases (17.7-60.2 %, median of 40.3 %). Removing stereochemistry information provided the largest reduction in ambiguity across databases (median reduction 13.7 percentage points). Conclusions: Ambiguity of non-systematic identifiers within chemical databases is generally low, but ambiguity of non-systematic identifiers that are shared between databases, is high. Chemical structure standardization reduces the ambiguity to a limited extent. Our findings can help to improve database integration, curation, and maintenance

Erasmus University Digital Repository