2,053 research outputs found

    Ontologies and Information Extraction

    Full text link
    This report argues that, even in the simplest cases, IE is an ontology-driven process. It is not a mere text filtering method based on simple pattern matching and keywords, because the extracted pieces of texts are interpreted with respect to a predefined partial domain model. This report shows that depending on the nature and the depth of the interpretation to be done for extracting the information, more or less knowledge must be involved. This report is mainly illustrated in biology, a domain in which there are critical needs for content-based exploration of the scientific literature and which becomes a major application domain for IE

    A knowledge-based approach to information extraction for semantic interoperability in the archaeology domain

    Get PDF
    The paper presents a method for automatic semantic indexing of archaeological grey-literature reports using empirical (rule-based) Information Extraction techniques in combination with domain-specific knowledge organization systems. Performance is evaluated via the Gold Standard method. The semantic annotation system (OPTIMA) performs the tasks of Named Entity Recognition, Relation Extraction, Negation Detection and Word Sense disambiguation using hand-crafted rules and terminological resources for associating contextual abstractions with classes of the standard ontology (ISO 21127:2006) CIDOC Conceptual Reference Model (CRM) for cultural heritage and its archaeological extension, CRM-EH, together with concepts from English Heritage thesauri and glossaries.Relation Extraction performance benefits from a syntactic based definition of relation extraction patterns derived from domain oriented corpus analysis. The evaluation also shows clear benefit in the use of assistive NLP modules relating to word-sense disambiguation, negation detection and noun phrase validation, together with controlled thesaurus expansion.The semantic indexing results demonstrate the capacity of rule-based Information Extraction techniques to deliver interoperable semantic abstractions (semantic annotations) with respect to the CIDOC CRM and archaeological thesauri. Major contributions include recognition of relevant entities using shallow parsing NLP techniques driven by a complimentary use of ontological and terminological domain resources and empirical derivation of context-driven relation extraction rules for the recognition of semantic relationships from phrases of unstructured text. The semantic annotations have proven capable of supporting semantic query, document study and cross-searching via the ontology framework

    An Ontology-Driven Methodology To Derive Cases From Structured And Unstructured Sources

    Get PDF
    The problem-solving capability of a Case-Based Reasoning (CBR) system largely depends on the richness of its knowledge stored in the form of cases, i.e. the CaseBase (CB). Populating and subsequently maintaining a critical mass of cases in a CB is a tedious manual activity demanding vast human and operational resources. The need for human involvement in populating a CB can be drastically reduced as case-like knowledge already exists in the form of databases and documents and harnessed and transformed into cases that can be operationalized. Nevertheless, the transformation process poses many hurdles due to the disparate structure and the heterogeneous coding standards used. The featured work aims to address knowledge creation from heterogeneous sources and structures. To meet this end, this thesis presents a Multi-Source Case Acquisition and Transformation Info-Structure (MUSCATI). MUSCATI has been implemented as a multi-layer architecture using state-of-the-practice tools and can be perceived as a functional extension to traditional CBR-systems. In principle, MUSCATI can be applied in any domain but in this thesis healthcare was chosen. Thus, Electronic Medical Records (EMRs) were used as the source to generate the knowledge. The results from the experiments showed that the volume and diversity of cases improves the reasoning outcome of the CBR engine. The experiments showed that knowledge found in medical records (regardless of structure) can be leveraged and standardized to enhance the (medical) knowledge of traditional medical CBR systems. Subsequently, the Google search engine proved to be very critical in “fixing” and enriching the domain ontology on-the-fly

    The Development of a Temporal Information Dictionary for Social Media Analytics

    Get PDF
    Dictionaries have been used to analyze text even before the emergence of social media and the use of dictionaries for sentiment analysis there. While dictionaries have been used to understand the tonality of text, so far it has not been possible to automatically detect if the tonality refers to the present, past, or future. In this research, we develop a dictionary containing time-indicating words in a wordlist (T-wordlist). To test how the dictionary performs, we apply our T-wordlist on different disaster related social media datasets. Subsequently we will validate the wordlist and results by a manual content analysis. So far, in this research-in-progress, we were able to develop a first dictionary and will also provide some initial insight into the performance of our wordlist

    ArCo: the Italian Cultural Heritage Knowledge Graph

    Full text link
    ArCo is the Italian Cultural Heritage knowledge graph, consisting of a network of seven vocabularies and 169 million triples about 820 thousand cultural entities. It is distributed jointly with a SPARQL endpoint, a software for converting catalogue records to RDF, and a rich suite of documentation material (testing, evaluation, how-to, examples, etc.). ArCo is based on the official General Catalogue of the Italian Ministry of Cultural Heritage and Activities (MiBAC) - and its associated encoding regulations - which collects and validates the catalogue records of (ideally) all Italian Cultural Heritage properties (excluding libraries and archives), contributed by CH administrators from all over Italy. We present its structure, design methods and tools, its growing community, and delineate its importance, quality, and impact

    The underpinnings of a composite measure for automatic term extraction: The case of SRC

    Full text link
    The corpus-based identification of those lexical units which serve to describe a given specialized domain usually becomes a complex task, where an analysis oriented to the frequency of words and the likelihood of lexical associations is often ineffective. The goal of this article is to demonstrate that a user-adjustable composite metric such as SRC can accommodate to the diversity of domain-specific glossaries to be constructed from small-and medium-sized specialized corpora of non-structured texts. Unlike for most of the research in automatic term extraction, where single metrics are usually combined indiscriminately to produce the best results, SRC is grounded on the theoretical principles of salience, relevance and cohesion, which have been rationally implemented in the three components of this metric.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grants FFI2011-29798-C02-01 and FFI2014-53788-C3-1-P.Periñán Pascual, JC. (2015). The underpinnings of a composite measure for automatic term extraction: The case of SRC. Terminology. 21(2):151-179. doi:10.1075/term.21.2.02perS15117921
    corecore