Search CORE

696 research outputs found

Is the crowd better as an assistant or a replacement in ontology engineering? An exploration through the lens of the Gene Ontology

Author: Dumontier Michel
Fan-Minogue Hua
Hughey Jake J.
Mortensen Jonathan M.
Musen Mark A.
Telis Natalie
Van Auken Kimberly
Publication venue: 'Elsevier BV'
Publication date: 01/04/2016
Field of study

Biomedical ontologies contain errors. Crowdsourcing, defined as taking a job traditionally performed by a designated agent and outsourcing it to an undefined large group of people, provides scalable access to humans. Therefore, the crowd has the potential overcome the limited accuracy and scalability found in current ontology quality assurance approaches. Crowd-based methods have identified errors in SNOMED CT, a large, clinical ontology, with an accuracy similar to that of experts, suggesting that crowdsourcing is indeed a feasible approach for identifying ontology errors. This work uses that same crowd-based methodology, as well as a panel of experts, to verify a subset of the Gene Ontology (200 relationships). Experts identified 16 errors, generally in relationships referencing acids and metals. The crowd performed poorly in identifying those errors, with an area under the receiver operating characteristic curve ranging from 0.44 to 0.73, depending on the methods configuration. However, when the crowd verified what experts considered to be easy relationships with useful definitions, they performed reasonably well. Notably, there are significantly fewer Google search results for Gene Ontology concepts than SNOMED CT concepts. This disparity may account for the difference in performance – fewer search results indicate a more difficult task for the worker. The number of Internet search results could serve as a method to assess which tasks are appropriate for the crowd. These results suggest that the crowd fits better as an expert assistant, helping experts with their verification by completing the easy tasks and allowing experts to focus on the difficult tasks, rather than an expert replacement

Maastricht University Research Portal

PubMed Central

Caltech Authors

Towards Best Practices for Crowdsourcing Ontology Alignment Benchmarks

Author: Amini Reihaneh
Publication venue: CORE Scholar
Publication date: 01/01/2016
Field of study

Ontology alignment systems establish the semantic links between ontologies that enable knowledge from various sources and domains to be used by automated applications in many different ways. Unfortunately, these systems are not perfect. Currently, the results of even the best-performing automated alignment systems need to be manually verified in order to be fully trusted. Ontology alignment researchers have turned to crowdsourcing platforms such as Amazon\u27s Mechanical Turk to accomplish this. However, there has been little systematic analysis of the accuracy of crowdsourcing for alignment verification and the establishment of best practices. In this work, we analyze the impact of the presentation of the context of potential matches and the way in which the question is presented to workers on the accuracy of crowdsourcing for alignment verification. Our overall recommendations are that users interested in high precision are likely to achieve the best results by presenting the definitions of the entity labels and allowing workers to respond with true/false to the question of whether or not an equivalence relationship exists. Conversely, if the alignment researcher is interested in high recall, they are better off presenting workers with a graphical depiction of the entity relationships and a set of options about the type of relation that exists, if any

OhioLINK Electronic Thesis and Dissertation Center

CORE

Use of ontology structure and Bayesian models to aid the crowdsourcing of ICD-11 sanctioning rules

Author: Bui
Bundschus
Chapman
Coden
Cornet
Csongor Nyulas
Doan
Hahn
Hearst
Koller
Leroy
Liu
Mark A. Musen
Mortensen
Mortensen
Nadkarni
Navas
Rebholz-Schuhmann
Rector
Robert J.G. Chalmers
Rodríguez-González
Rogers
Samson W. Tu
Schulz
Sutton
Tania Tudorache
Tudorache
Wang
Whitehall
Yun Lou
Zeng
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Knowledge-based Biomedical Data Science 2019

Author: Callahan Tiffany J.
Hunter Lawrence E.
Pielke-Lombardo Harrison
Tripodi Ignacio J.
Publication venue
Publication date: 08/10/2019
Field of study

Knowledge-based biomedical data science (KBDS) involves the design and implementation of computer systems that act as if they knew about biomedicine. Such systems depend on formally represented knowledge in computer systems, often in the form of knowledge graphs. Here we survey the progress in the last year in systems that use formally represented knowledge to address data science problems in both clinical and biological domains, as well as on approaches for creating knowledge graphs. Major themes include the relationships between knowledge graphs and machine learning, the use of natural language processing, and the expansion of knowledge-based approaches to novel domains, such as Chinese Traditional Medicine and biodiversity.Comment: Manuscript 43 pages with 3 tables; Supplemental material 43 pages with 3 table

arXiv.org e-Print Archive

Linked Data Quality Assessment and its Application to Societal Progress Measurement

Author: Zaveri Amrapali
Publication venue
Publication date: 17/04/2015
Field of study

In recent years, the Linked Data (LD) paradigm has emerged as a simple mechanism for employing the Web as a medium for data and knowledge integration where both documents and data are linked. Moreover, the semantics and structure of the underlying data are kept intact, making this the Semantic Web. LD essentially entails a set of best practices for publishing and connecting structure data on the Web, which allows publish- ing and exchanging information in an interoperable and reusable fashion. Many different communities on the Internet such as geographic, media, life sciences and government have already adopted these LD principles. This is confirmed by the dramatically growing Linked Data Web, where currently more than 50 billion facts are represented. With the emergence of Web of Linked Data, there are several use cases, which are possible due to the rich and disparate data integrated into one global information space. Linked Data, in these cases, not only assists in building mashups by interlinking heterogeneous and dispersed data from multiple sources but also empowers the uncovering of meaningful and impactful relationships. These discoveries have paved the way for scientists to explore the existing data and uncover meaningful outcomes that they might not have been aware of previously. In all these use cases utilizing LD, one crippling problem is the underlying data quality. Incomplete, inconsistent or inaccurate data affects the end results gravely, thus making them unreliable. Data quality is commonly conceived as fitness for use, be it for a certain application or use case. There are cases when datasets that contain quality problems, are useful for certain applications, thus depending on the use case at hand. Thus, LD consumption has to deal with the problem of getting the data into a state in which it can be exploited for real use cases. The insufficient data quality can be caused either by the LD publication process or is intrinsic to the data source itself. A key challenge is to assess the quality of datasets published on the Web and make this quality information explicit. Assessing data quality is particularly a challenge in LD as the underlying data stems from a set of multiple, autonomous and evolving data sources. Moreover, the dynamic nature of LD makes assessing the quality crucial to measure the accuracy of representing the real-world data. On the document Web, data quality can only be indirectly or vaguely defined, but there is a requirement for more concrete and measurable data quality metrics for LD. Such data quality metrics include correctness of facts wrt. the real-world, adequacy of semantic representation, quality of interlinks, interoperability, timeliness or consistency with regard to implicit information. Even though data quality is an important concept in LD, there are few methodologies proposed to assess the quality of these datasets. Thus, in this thesis, we first unify 18 data quality dimensions and provide a total of 69 metrics for assessment of LD. The first methodology includes the employment of LD experts for the assessment. This assessment is performed with the help of the TripleCheckMate tool, which was developed specifically to assist LD experts for assessing the quality of a dataset, in this case DBpedia. The second methodology is a semi-automatic process, in which the first phase involves the detection of common quality problems by the automatic creation of an extended schema for DBpedia. The second phase involves the manual verification of the generated schema axioms. Thereafter, we employ the wisdom of the crowds i.e. workers for online crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) to assess the quality of DBpedia. We then compare the two approaches (previous assessment by LD experts and assessment by MTurk workers in this study) in order to measure the feasibility of each type of the user-driven data quality assessment methodology. Additionally, we evaluate another semi-automated methodology for LD quality assessment, which also involves human judgement. In this semi-automated methodology, selected metrics are formally defined and implemented as part of a tool, namely R2RLint. The user is not only provided the results of the assessment but also specific entities that cause the errors, which help users understand the quality issues and thus can fix them. Finally, we take into account a domain-specific use case that consumes LD and leverages on data quality. In particular, we identify four LD sources, assess their quality using the R2RLint tool and then utilize them in building the Health Economic Research (HER) Observatory. We show the advantages of this semi-automated assessment over the other types of quality assessment methodologies discussed earlier. The Observatory aims at evaluating the impact of research development on the economic and healthcare performance of each country per year. We illustrate the usefulness of LD in this use case and the importance of quality assessment for any data analysis

Qucosa - Publikationsserver der Universität Leipzig

The Computer Science Ontology: A Comprehensive Automatically-Generated Taxonomy of Research Areas

Author: Birukou Aliaksandr
Mannocci Andrea
Motta Enrico
Osborne Francesco
Salatino Angelo
Thanapalasingam Thiviyan
Publication venue: 'MIT Press - Journals'
Publication date: 24/07/2020
Field of study

Ontologies of research areas are important tools for characterising, exploring, and analysing the research landscape. Some fields of research are comprehensively described by large-scale taxonomies, e.g., MeSH in Biology and PhySH in Physics. Conversely, current Computer Science taxonomies are coarse-grained and tend to evolve slowly. For instance, the ACM classification scheme contains only about 2K research topics and the last version dates back to 2012. In this paper, we introduce the Computer Science Ontology (CSO), a large-scale, automatically generated ontology of research areas, which includes about 14K topics and 162K semantic relationships. It was created by applying the Klink-2 algorithm on a very large dataset of 16M scientific articles. CSO presents two main advantages over the alternatives: i) it includes a very large number of topics that do not appear in other classifications, and ii) it can be updated automatically by running Klink-2 on recent corpora of publications. CSO powers several tools adopted by the editorial team at Springer Nature and has been used to enable a variety of solutions, such as classifying research publications, detecting research communities, and predicting research trends. To facilitate the uptake of CSO, we have also released the CSO Classifier, a tool for automatically classifying research papers, and the CSO Portal, a web application that enables users to download, explore, and provide granular feedback on CSO. Users can use the portal to navigate and visualise sections of the ontology, rate topics and relationships, and suggest missing ones. The portal will support the publication of and access to regular new releases of CSO, with the aim of providing a comprehensive resource to the various research communities engaged with scholarly data

Crossref

Open Research Online (The Open University)

Crowdsourcing and the Semantic Web: A Research Manifesto

Author
Publication venue: 'Human Computation Institute'
Publication date
Field of study

Crossref

Ensemble labeling towards scientific information extraction (ELSIE)

Author: Murphy Erin
Publication venue: DePaul University
Publication date: 09/11/2020
Field of study

Extracting scientific facts from unstructured text is difficult due to challenges specific to the ambiguity of the language, the complexity of the scientific named entities and relations to be extracted. This problem is well illustrated through the extraction of polymer names and their properties. Even in the cases where the property is a temperature, identifying the polymer name associated with the temperature may require expertise due to the use of acronyms, synonyms, complicated naming conventions and by the fact that new polymer names are being “introduced” to the vernacular as polymer science advances. While there exist domain-specific machine learning toolkits that address these challenges, perhaps the greatest challenge is the lack of—time-consuming, error-prone and costly—labeled data to train these machine learning models. Our work repurposes Snorkel, a data programming tool, in a novel approach as a way to identify sentences that contain the relation of interest in order to generate training data, and as a first step towards extracting the entities themselves. We achieve 94% recall and demonstrate the importance of identifying the complex sentences prior to extraction by comparing to a state-of-the-art domain-aware natural language processing toolkit. We also show that our system captures sentences missed by both the toolkit and the expert labelers

Via Sapientiae: The Institutional Repository at DePaul University

Crowdsourcing Definitions and Its Features: An Academic Technical Report

Author: Hosseini Mahmoud
Publication venue: None
Publication date: 24/02/2014
Field of study

Bournemouth University Research Online

Towards natural language question generation for the validation of ontologies and mappings

Author: Abacha Asma Ben
Mrabet Yassine
Pruski Cédric
Reis Julio Cesar Dos
Silveira Marcos Da
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/11/2017
Field of study

Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)The increasing number of open-access ontologies and their key role in several applications such as decision-support systems highlight the importance of their validation. Human expertise is crucial for the validation of ontologies from a domain point-of-view. However, the growing number of ontologies and their fast evolution over time make manual validation challenging. Methods: We propose a novel semi-automatic approach based on the generation of natural language (NL) questions to support the validation of ontologies and their evolution. The proposed approach includes the automatic generation, factorization and ordering of NL questions from medical ontologies. The final validation and correction is performed by submitting these questions to domain experts and automatically analyzing their feedback. We also propose a second approach for the validation of mappings impacted by ontology changes. The method exploits the context of the changes to propose correction alternatives presented as Multiple Choice Questions. Results: This research provides a question optimization strategy to maximize the validation of ontology entities with a reduced number of questions. We evaluate our approach for the validation of three medical ontologies. We also evaluate the feasibility and efficiency of our mappings validation approach in the context of ontology evolution. These experiments are performed with different versions of SNOMED-CT and ICD9. Conclusions: The obtained experimental results suggest the feasibility and adequacy of our approach to support the validation of interconnected and evolving ontologies. Results also suggest that taking into account RDFS and OWL entailment helps reducing the number of questions and validation time. The application of our approach to validate mapping evolution also shows the difficulty of adapting mapping evolution over time and highlights the importance of semi-automatic validation.The increasing number of open-access ontologies and their key role in several applications such as decision-support systems highlight the importance of their validation. Human expertise is crucial for the validation of ontologies from a domain point-of-vi7115FAPESP - FUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DE SÃO PAULOFundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)2014/14890-

Springer - Publisher Connector

Repositorio da Producao Cientifica e Intelectual da Unicamp