518 research outputs found
Web based knowledge extraction and consolidation for automatic ontology instantiation
The Web is probably the largest and richest information repository available today. Search engines are the common access routes to this valuable source. However, the role of these search engines is often limited to the retrieval of lists of potentially relevant documents. The burden of analysing the returned documents and identifying the knowledge of interest is therefore left to the user. The Artequakt system aims to deploy natural language tools to automatically ex-tract and consolidate knowledge from web documents and instantiate a given ontology, which dictates the type and form of knowledge to extract. Artequakt focuses on the domain of artists, and uses the harvested knowledge to gen-erate tailored biographies. This paper describes the latest developments of the system and discusses the problem of knowledge consolidation
Ontology Population via NLP Techniques in Risk Management
In this paper we propose an NLP-based method for Ontology Population from texts and apply it to semi automatic instantiate a Generic Knowledge Base (Generic Domain Ontology) in the risk management domain. The approach is semi-automatic and uses a domain expert intervention for validation. The proposed approach relies on a set of Instances Recognition Rules based on syntactic structures, and on the predicative power of verbs in the instantiation process. It is not domain dependent since it heavily relies on linguistic knowledge. A description of an experiment performed on a part of the ontology of the PRIMA project (supported by the European community) is given. A first validation of the method is done by populating this ontology with Chemical Fact Sheets from Environmental Protection Agency . The results of this experiment complete the paper and support the hypothesis that relying on the predicative power of verbs in the instantiation process improves the performance.Information Extraction, Instance Recognition Rules, Ontology Population, Risk Management, Semantic Analysis
An Architecture for Data and Knowledge Acquisition for the Semantic Web: the AGROVOC Use Case
We are surrounded by ever growing volumes of unstructured and weakly-structured information, and for a human being, domain expert or not, it is nearly impossible to read, understand and categorize such information in a fair amount of time. Moreover, different user categories have different expectations: final users need easy-to-use tools and services for specific tasks, knowledge engineers require robust tools for knowledge acquisition, knowledge categorization and semantic resources development, while semantic applications developers demand for flexible frameworks for fast and easy, standardized development of complex applications. This work represents an experience report on the use of the CODA framework for rapid prototyping and deployment of knowledge acquisition systems for RDF. The system integrates independent NLP tools and custom libraries complying with UIMA standards. For our experiment a document set has been processed to populate the AGROVOC thesaurus with two new relationships
An Automated Method to Enrich and Expand Consumer Health Vocabularies Using GloVe Word Embeddings
Clear language makes communication easier between any two parties. However, a layman may have difficulty communicating with a professional due to not understanding the specialized terms common to the domain. In healthcare, it is rare to find a layman knowledgeable in medical jargon, which can lead to poor understanding of their condition and/or treatment. To bridge this gap, several professional vocabularies and ontologies have been created to map laymen medical terms to professional medical terms and vice versa. Many of the presented vocabularies are built manually or semi-automatically requiring large investments of time and human effort and consequently the slow growth of these vocabularies. In this dissertation, we present an automatic method to enrich existing concepts in a medical ontology with additional laymen terms and also to expand the number of concepts in the ontology that do not have associated laymen terms. Our work has the benefit of being applicable to vocabularies in any domain.
Our entirely automatic approach uses machine learning, specifically Global Vectors for Word Embeddings (GloVe), on a corpus collected from a social media healthcare platform to extend and enhance consumer health vocabularies. We improve these vocabularies by incorporating synonyms and hyponyms from the WordNet ontology. By performing iterative feedback using GloVe’s candidate terms, we can boost the number of word occurrences in the co-occurrence matrix allowing our approach to work with a smaller training corpus.
Our novel algorithms and GloVe were evaluated using two laymen datasets from the National Library of Medicine (NLM), the Open-Access and Collaborative Consumer Health Vocabulary (OAC CHV) and the MedlinePlus Healthcare Vocabulary. For our first goal, enriching concepts, the results show that GloVe was able to find new laymen terms with an F-score of 48.44%. Our best algorithm enhanced the corpus with synonyms from WordNet, outperformed GloVe with an F-score relative improvement of 25%. For our second goal, expanding the number of concepts with related laymen’s terms, our synonym-enhanced GloVe outperformed GloVe with a relative F-score relative improvement of 63%.
The results of the system were in general promising and can be applied not only to enrich and expand laymen vocabularies for medicine but any ontology for a domain, given an appropriate corpus for the domain. Our approach is applicable to narrow domains that may not have the huge training corpora typically used with word embedding approaches. In essence, by incorporating an external source of linguistic information, WordNet, and expanding the training corpus, we are getting more out of our training corpus. Our system can help building an application for patients where they can read their physician\u27s letters more understandably and clearly. Moreover, the output of this system can be used to improve the results of healthcare search engines, entity recognition systems, and many others
A Survey of Volunteered Open Geo-Knowledge Bases in the Semantic Web
Over the past decade, rapid advances in web technologies, coupled with
innovative models of spatial data collection and consumption, have generated a
robust growth in geo-referenced information, resulting in spatial information
overload. Increasing 'geographic intelligence' in traditional text-based
information retrieval has become a prominent approach to respond to this issue
and to fulfill users' spatial information needs. Numerous efforts in the
Semantic Geospatial Web, Volunteered Geographic Information (VGI), and the
Linking Open Data initiative have converged in a constellation of open
knowledge bases, freely available online. In this article, we survey these open
knowledge bases, focusing on their geospatial dimension. Particular attention
is devoted to the crucial issue of the quality of geo-knowledge bases, as well
as of crowdsourced data. A new knowledge base, the OpenStreetMap Semantic
Network, is outlined as our contribution to this area. Research directions in
information integration and Geographic Information Retrieval (GIR) are then
reviewed, with a critical discussion of their current limitations and future
prospects
Extracting Synonyms from Bilingual Dictionaries
We present our progress in developing a novel algorithm to extract synonyms
from bilingual dictionaries. Identification and usage of synonyms play a
significant role in improving the performance of information access
applications. The idea is to construct a translation graph from translation
pairs, then to extract and consolidate cyclic paths to form bilingual sets of
synonyms. The initial evaluation of this algorithm illustrates promising
results in extracting Arabic-English bilingual synonyms. In the evaluation, we
first converted the synsets in the Arabic WordNet into translation pairs (i.e.,
losing word-sense memberships). Next, we applied our algorithm to rebuild these
synsets. We compared the original and extracted synsets obtaining an F-Measure
of 82.3% and 82.1% for Arabic and English synsets extraction, respectively.Comment: In Proceedings - 11th International Global Wordnet Conference
(GWC2021). Global Wordnet Association (2021
Ontology Enrichment from Free-text Clinical Documents: A Comparison of Alternative Approaches
While the biomedical informatics community widely acknowledges the utility of domain ontologies, there remain many barriers to their effective use. One important requirement of domain ontologies is that they achieve a high degree of coverage of the domain concepts and concept relationships. However, the development of these ontologies is typically a manual, time-consuming, and often error-prone process. Limited resources result in missing concepts and relationships, as well as difficulty in updating the ontology as domain knowledge changes. Methodologies developed in the fields of Natural Language Processing (NLP), Information Extraction (IE), Information Retrieval (IR), and Machine Learning (ML) provide techniques for automating the enrichment of ontology from free-text documents. In this dissertation, I extended these methodologies into biomedical ontology development. First, I reviewed existing methodologies and systems developed in the fields of NLP, IR, and IE, and discussed how existing methods can benefit the development of biomedical ontologies. This previously unconducted review was published in the Journal of Biomedical Informatics. Second, I compared the effectiveness of three methods from two different approaches, the symbolic (the Hearst method) and the statistical (the Church and Lin methods), using clinical free-text documents. Third, I developed a methodological framework for Ontology Learning (OL) evaluation and comparison. This framework permits evaluation of the two types of OL approaches that include three OL methods. The significance of this work is as follows: 1) The results from the comparative study showed the potential of these methods for biomedical ontology enrichment. For the two targeted domains (NCIT and RadLex), the Hearst method revealed an average of 21% and 11% new concept acceptance rates, respectively. The Lin method produced a 74% acceptance rate for NCIT; the Church method, 53%. As a result of this study (published in the Journal of Methods of Information in Medicine), many suggested candidates have been incorporated into the NCIT; 2) The evaluation framework is flexible and general enough that it can analyze the performance of ontology enrichment methods for many domains, thus expediting the process of automation and minimizing the likelihood that key concepts and relationships would be missed as domain knowledge evolves
- …