Search CORE

23 research outputs found

Vec2SPARQL:integrating SPARQL queries and knowledge graph embeddings

Author: Dumontier Michel
Gkoutos Georgios V.
Hoehndorf Robert
Kafkas Senay
Karwath Andreas
Kulmanov Maxat
Malic Alexander
Publication venue
Publication date
Field of study

University of Birmingham Research Portal

Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources

Author: Backofen Rolf
Hoehndorf Robert
Jimeno Yepes Antonio
Kafkas Senay
Kim Jee-Hyub
Lewin Ian
Li Chen
Rebholz-Schuhmann Dietrich
Publication venue
Publication date: 01/01/2013
Field of study

Motivation The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features.Ideally all three resources, i.e.~corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them.Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other.We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance.In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs. RESULTS: In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measureperformance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon basedapproaches (LexTag) in combination with disambiguation methods show better results on FsuPrgeand PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions havedifferent precision and recall profiles at the same F1-measure across all corpora. Higher recall isachieved with larger lexical resources, which also introduce more noise (false positive results). TheML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the trainingcorpus. As expected, the false negative errors characterize the test corpora and - on the other hand- the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions thatare based on a large terminological resource in combination with false positive filtering produce betterresults, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tagsolutions. CONCLUSION: The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus shouldbe trained using several different corpora to reduce possible biases. The LexTag solutions havedifferent profiles for their precision and recall performance, but with similar F1-measure. This resultis surprising and suggests that they cover a portion of the most common naming standards, but copedifferently with the term variability across the corpora. The false positive filtering applied to LexTagsolutions does improve the results by increasing their precision without compromising significantlytheir recall. The harmonisation of the annotation schemes in combination with standardized lexicalresources in the tagging solutions will enable their comparability and will pave the way for a sharedstandard

Crossref

Aberystwyth Research Portal

Springer - Publisher Connector

PubMed Central

ZORA

University of Melbourne Institutional Repository

Vec2SPARQL:integrating SPARQL queries and knowledge graph embeddings

Author: Dumontier Michel
Gkoutos Georgios V.
Hoehndorf Robert
Kafkas Senay
Karwath Andreas
Kulmanov Maxat
Malic Alexander
Publication venue
Publication date: 01/01/2018
Field of study

<div>Recent developments in machine learning have led to a rise of large</div><div>number of methods for extracting features from structured data. The features</div><div>are represented as vectors and may encode for some semantic aspects of data.</div><div>They can be used in a machine learning models for different tasks or to com-</div><div>pute similarities between the entities of the data. SPARQL is a query language</div><div>for structured data originally developed for querying Resource Description Frame-</div><div>work (RDF) data. It has been in use for over a decade as a standardized NoSQL</div><div>query language. Many different tools have been developed to enable data shar-</div><div>ing with SPARQL. For example, SPARQL endpoints make your data interopera-</div><div>ble and available to the world. SPARQL queries can be executed across multi-</div><div>ple endpoints. We have developed a Vec2SPARQL, which is a general frame-</div><div>work for integrating structured data and their vector space representations.</div><div>Vec2SPARQL allows jointly querying vector functions such as computing sim-</div><div>ilarities (cosine, correlations) or classifications with machine learning models</div><div>within a single SPARQL query. We demonstrate applications of our approach</div><div>for biomedical and clinical use cases. Our source code is freely available at</div><div>https://github.com/bio-ontology-research-group/vec2sparql and we make a</div><div>Vec2SPARQL endpoint available at http://sparql.bio2vec.net/</div

Maastricht University Research Portal

University of Birmingham Research Portal

FigShare

Recommended from our members

DDIEM: drug database for inborn errors of metabolism

Author: Abdelhakim Marwa
Hoehndorf Robert
Kafkas Senay
Kamau Allan Anthony
McMurray Eunice
Schofield Paul N
Syed Ali Raza
Publication venue: Orphanet Journal of Rare Diseases
Publication date: 11/06/2020
Field of study

Abstract: Background: Inborn errors of metabolism (IEM) represent a subclass of rare inherited diseases caused by a wide range of defects in metabolic enzymes or their regulation. Of over a thousand characterized IEMs, only about half are understood at the molecular level, and overall the development of treatment and management strategies has proved challenging. An overview of the changing landscape of therapeutic approaches is helpful in assessing strategic patterns in the approach to therapy, but the information is scattered throughout the literature and public data resources. Results: We gathered data on therapeutic strategies for 300 diseases into the Drug Database for Inborn Errors of Metabolism (DDIEM). Therapeutic approaches, including both successful and ineffective treatments, were manually classified by their mechanisms of action using a new ontology. Conclusions: We present a manually curated, ontologically formalized knowledgebase of drugs, therapeutic procedures, and mitigated phenotypes. DDIEM is freely available through a web interface and for download at http://ddiem.phenomebrowser.net

Apollo (Cambridge)

Recommended from our members

DDIEM: drug database for inborn errors of metabolism

Author: Abdelhakim Marwa
Hoehndorf Robert
Kafkas Senay
Kamau Allan Anthony
McMurray Eunice
Schofield Paul N
Syed Ali Raza
Publication venue: 'Organisation for Economic Co-Operation and Development (OECD)'
Publication date: 11/06/2021
Field of study

Apollo (Cambridge)

What is the right sequencing approach? Solo VS extended family analysis in consanguineous populations.

Author: Ababneh Faroug
Abdallah Abdallah M
Abdelhakim Marwa
Al Mutairi Fuad
Alahmad Ahmed
Alaskar Aljoharah
Albalwi Mohammed
Alfadhel Majid
Alfares Ahmed
Almutairi Mashael
Aloraini Taghrid
Alotaibi Raniah
Alothaim Ali
Alsamer Alhanouf
Alsubaie Lamia
Alswaid Abdulrahman
Althagafi Azza
Altharawi Nouf
Büchmann-Møller Stine
Cheung Nicole
Eyaid Wafaa
Fukasawa Yoshinori
Gojobori Takashi
Hoehndorf Robert
Kafkas Senay
Mineta Katsuhiko
Rajan Issaac
Rashid Mamoon
Zhao Xiang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/07/2020
Field of study

Testing strategies is crucial for genetics clinics and testing laboratories. In this study, we tried to compare the hit rate between solo and trio and trio plus testing and between trio and sibship testing. Finally, we studied the impact of extended family analysis, mainly in complex and unsolved cases. Three cohorts were used for this analysis: one cohort to assess the hit rate between solo, trio and trio plus testing, another cohort to examine the impact of the testing strategy of sibship genome vs trio-based analysis, and a third cohort to test the impact of an extended family analysis of up to eight family members to lower the number of candidate variants. The hit rates in solo, trio and trio plus testing were 39, 40, and 41%, respectively. The total number of candidate variants in the sibship testing strategy was 117 variants compared to 59 variants in the trio-based analysis. We noticed that the average number of coding candidate variants in trio-based analysis was 1192 variants and 26,454 noncoding variants, and this number was lowered by 50-75% after adding additional family members, with up to two coding and 66 noncoding homozygous variants only, in families with eight family members. There was no difference in the hit rate between solo and extended family members. Trio-based analysis was a better approach than sibship testing, even in a consanguineous population. Finally, each additional family member helped to narrow down the number of variants by 50-75%. Our findings could help clinicians, researchers and testing laboratories select the most cost-effective and appropriate sequencing approach for their patients. Furthermore, using extended family analysis is a very useful tool for complex cases with novel genes

Qatar University Institutional Repository

Assessment of NER solutions against the first and second CALBC Silver Standard Corpus

Background Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions. Results All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I. The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants’ solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE. Conclusions The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs’ annotation solutions in comparison to the SSC-I

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

ZORA

Digital.CSIC

Erasmus University Digital Repository

UPF Digital Repository

University of Melbourne Institutional Repository