3,777 research outputs found

    Learning Terminological Knowledge with High Confidence from Erroneous Data

    Get PDF
    Description logics knowledge bases are a popular approach to represent terminological and assertional knowledge suitable for computers to work with. Despite that, the practicality of description logics is impaired by the difficulties one has to overcome to construct such knowledge bases. Previous work has addressed this issue by providing methods to learn valid terminological knowledge from data, making use of ideas from formal concept analysis. A basic assumption here is that the data is free of errors, an assumption that can in general not be made for practical applications. This thesis presents extensions of these results that allow to handle errors in the data. For this, knowledge that is "almost valid" in the data is retrieved, where the notion of "almost valid" is formalized using the notion of confidence from data mining. This thesis presents two algorithms which achieve this retrieval. The first algorithm just extracts all almost valid knowledge from the data, while the second algorithm utilizes expert interaction to distinguish errors from rare but valid counterexamples

    On the Usability of Probably Approximately Correct Implication Bases

    Full text link
    We revisit the notion of probably approximately correct implication bases from the literature and present a first formulation in the language of formal concept analysis, with the goal to investigate whether such bases represent a suitable substitute for exact implication bases in practical use-cases. To this end, we quantitatively examine the behavior of probably approximately correct implication bases on artificial and real-world data sets and compare their precision and recall with respect to their corresponding exact implication bases. Using a small example, we also provide qualitative insight that implications from probably approximately correct bases can still represent meaningful knowledge from a given data set.Comment: 17 pages, 8 figures; typos added, corrected x-label on graph

    Multiple Retrieval Models and Regression Models for Prior Art Search

    Get PDF
    This paper presents the system called PATATRAS (PATent and Article Tracking, Retrieval and AnalysiS) realized for the IP track of CLEF 2009. Our approach presents three main characteristics: 1. The usage of multiple retrieval models (KL, Okapi) and term index definitions (lemma, phrase, concept) for the three languages considered in the present track (English, French, German) producing ten different sets of ranked results. 2. The merging of the different results based on multiple regression models using an additional validation set created from the patent collection. 3. The exploitation of patent metadata and of the citation structures for creating restricted initial working sets of patents and for producing a final re-ranking regression model. As we exploit specific metadata of the patent documents and the citation relations only at the creation of initial working sets and during the final post ranking step, our architecture remains generic and easy to extend

    An Ontology-Driven Methodology To Derive Cases From Structured And Unstructured Sources

    Get PDF
    The problem-solving capability of a Case-Based Reasoning (CBR) system largely depends on the richness of its knowledge stored in the form of cases, i.e. the CaseBase (CB). Populating and subsequently maintaining a critical mass of cases in a CB is a tedious manual activity demanding vast human and operational resources. The need for human involvement in populating a CB can be drastically reduced as case-like knowledge already exists in the form of databases and documents and harnessed and transformed into cases that can be operationalized. Nevertheless, the transformation process poses many hurdles due to the disparate structure and the heterogeneous coding standards used. The featured work aims to address knowledge creation from heterogeneous sources and structures. To meet this end, this thesis presents a Multi-Source Case Acquisition and Transformation Info-Structure (MUSCATI). MUSCATI has been implemented as a multi-layer architecture using state-of-the-practice tools and can be perceived as a functional extension to traditional CBR-systems. In principle, MUSCATI can be applied in any domain but in this thesis healthcare was chosen. Thus, Electronic Medical Records (EMRs) were used as the source to generate the knowledge. The results from the experiments showed that the volume and diversity of cases improves the reasoning outcome of the CBR engine. The experiments showed that knowledge found in medical records (regardless of structure) can be leveraged and standardized to enhance the (medical) knowledge of traditional medical CBR systems. Subsequently, the Google search engine proved to be very critical in “fixing” and enriching the domain ontology on-the-fly

    Towards reducing communication gaps in multicultural and global requirements elicitation

    Get PDF
    This paper focuses on the collaborative aspects of requirements elicitation, in the context of software, systems and service development. The aim is to identify and understand challenges of requirements elicitation in general and in distributed environments. We focus on human, social, and cultural factors that have an impact on communication in the requirements elicitation process. More specifically we aim to i) unfold potential cultural impediments that hamper the requirements elicitation process; ii) highlight cultural factors that should be taken into account in the requirements elicitation process in order to avoid incomplete and inconsistent requirements; and iii) make recommendations for alleviating the problems. In this paper our first step is to report on the findings of a review of the literature regarding culture in RE. The results suggest that the cultural studies in the field of RE are insufficient and thus more empirical studies are required. Secondly, we look at current solutions that are being adopted to assist in improving the cultural aspect of the requirements elicitation process. In the following step we map the identified communication gaps to the SPI Manifesto Values revealing the manifestations of the problems and finally we prescribe a set of recommendations that could be exercised and fulfilled by actors in the requirements elicitation process in order for them to improve cultural considerations in the RE process. These recommendations address the shortcomings that were identified in the literature review and mapped the Values of the SPI Manifesto. The proposals regard technologies, platforms, methods, and frameworks that are readily available. A requirements elicitation process that adopts one or a number of these proposals can help alleviate the challenges invoked by stakeholders’ cultural diversity in the RE process, thus leading to systems development and deployment that much better reflects the requirements/needs of diverse stakeholders and users

    Variation and Semantic Relation Interpretation: Linguistic and Processing Issues

    Get PDF
    International audienceStudies in linguistics define lexico-syntactic patterns to characterize the linguistic utterances that can be interpreted with semantic relations. Because patterns are assumed to reflect linguistic regularities that have a stable interpretation, several software implement such patterns to extract semantic relations from text. Nevertheless, a thorough analysis of pattern occurrences in various corpora proved that variation may affect their interpretation. In this paper, we report the linguistic variations that impact relation interpretation in language, and may lead to errors in relation extraction systems. We analyze several features of state-of-the-art pattern-based relation extraction tools, mostly how patterns are represented and matched with text, and discuss their role in the tool ability to manage variation

    Fusing Automatically Extracted Annotations for the Semantic Web

    Get PDF
    This research focuses on the problem of semantic data fusion. Although various solutions have been developed in the research communities focusing on databases and formal logic, the choice of an appropriate algorithm is non-trivial because the performance of each algorithm and its optimal configuration parameters depend on the type of data, to which the algorithm is applied. In order to be reusable, the fusion system must be able to select appropriate techniques and use them in combination. Moreover, because of the varying reliability of data sources and algorithms performing fusion subtasks, uncertainty is an inherent feature of semantically annotated data and has to be taken into account by the fusion system. Finally, the issue of schema heterogeneity can have a negative impact on the fusion performance. To address these issues, we propose KnoFuss: an architecture for Semantic Web data integration based on the principles of problem-solving methods. Algorithms dealing with different fusion subtasks are represented as components of a modular architecture, and their capabilities are described formally. This allows the architecture to select appropriate methods and configure them depending on the processed data. In order to handle uncertainty, we propose a novel algorithm based on the Dempster-Shafer belief propagation. KnoFuss employs this algorithm to reason about uncertain data and method results in order to refine the fused knowledge base. Tests show that these solutions lead to improved fusion performance. Finally, we addressed the problem of data fusion in the presence of schema heterogeneity. We extended the KnoFuss framework to exploit results of automatic schema alignment tools and proposed our own schema matching algorithm aimed at facilitating data fusion in the Linked Data environment. We conducted experiments with this approach and obtained a substantial improvement in performance in comparison with public data repositories

    Assessing and Improving Domain Knowledge Representation in DBpedia

    Get PDF
    With the development of knowledge graphs and the billions of triples generated on the Linked Data cloud, it is paramount to ensure the quality of data. In this work, we focus on one of the central hubs of the Linked Data cloud, DBpedia. In particular, we assess the quality of DBpedia for domain knowledge representation. Our results show that DBpedia has still much room for improvement in this regard, especially for the description of concepts and their linkage with the DBpedia ontology. Based on this analysis, we leverage open relation extraction and the information already available on DBpedia to partly correct the issue, by providing novel relations extracted from Wikipedia abstracts and discovering entity types using the dbo:type predicate. Our results show that open relation extraction can indeed help enrich domain knowledge representation in DBpedia

    STRUCTURAL AND LEXICAL METHODS FOR AUDITING BIOMEDICAL TERMINOLOGIES

    Get PDF
    Biomedical terminologies serve as knowledge sources for a wide variety of biomedical applications including information extraction and retrieval, data integration and management, and decision support. Quality issues of biomedical terminologies, if not addressed, could affect all downstream applications that use them as knowledge sources. Therefore, Terminology Quality Assurance (TQA) has become an integral part of the terminology management lifecycle. However, identification of potential quality issues is challenging due to the ever-growing size and complexity of biomedical terminologies. It is time-consuming and labor-intensive to manually audit them and hence, automated TQA methods are highly desirable. In this dissertation, systematic and scalable methods to audit biomedical terminologies utilizing their structural as well as lexical information are proposed. Two inference-based methods, two non-lattice-based methods and a deep learning-based method are developed to identify potentially missing hierarchical (or is-a) relations, erroneous is-a relations, and missing concepts in biomedical terminologies including the Gene Ontology (GO), the National Cancer Institute thesaurus (NCIt), and SNOMED CT. In the first inference-based method, the GO concept names are represented using set-of-words model and sequence-of-words model, respectively. Inconsistencies derived between hierarchical linked and unlinked concept pairs are leveraged to detect potentially missing or erroneous is-a relations. The set-of-words model detects a total of 5,359 potential inconsistencies in the 03/28/2017 release of GO and the sequence-of-words model detects 4,959. Domain experts’ evaluation shows that the set-of-words model achieves a precision of 53.78% (128 out of 238) and the sequence-of-words model achieves a precision of 57.55% (122 out of 212) in identifying inconsistencies. In the second inference-based method, a Subsumption-based Sub-term Inference Framework (SSIF) is developed by introducing a novel term-algebra on top of a sequence-based representation of GO concepts. The sequence-based representation utilizes the part of speech of concept names, sub-concepts (concept names appearing inside another concept name), and antonyms appearing in concept names. Three conditional rules (monotonicity, intersection, and sub-concept rules) are developed for backward subsumption inference. Applying SSIF to the 10/03/2018 release of GO suggests 1,938 potentially missing is-a relations. Domain experts’ evaluation of randomly selected 210 potentially missing is-a relations shows that SSIF achieves a precision of 60.61%, 60.49%, and 46.03% for the monotonicity, intersection, and sub-concept rules, respectively. In the first non-lattice-based method, lexical patterns of concepts in Non-Lattice Subgraphs (NLSs: graph fragments with a higher tendency to contain quality issues), are mined to detect potentially missing is-a relations and missing concepts in NCIt. Six lexical patterns: containment, union, intersection, union-intersection, inference-contradiction, and inference-union are leveraged. Each pattern indicates a potential specific type of error and suggests a potential type of remediation. This method identifies 809 NLSs exhibiting these patterns in the 16.12d version of NCIt, achieving a precision of 66% (33 out of 50). In the second non-lattice-based method, enriched lexical attributes from concept ancestors are leveraged to identify potentially missing is-a relations in NLSs. The lexical attributes of a concept are inherited in two ways: from ancestors within the NLS, and from all the ancestors. For a pair of concepts without a hierarchical relation, if the lexical attributes of one concept is a subset of that of the other, a potentially missing is-a relation between the two concepts is suggested. This method identifies a total of 1,022 potentially missing is-a relations in the 19.01d release of NCIt with a precision of 84.44% (76 out of 90) for inheriting lexical attributes from ancestors within the NLS and 89.02% (73 out of 82) for inheriting from all the ancestors. For the non-lattice-based methods, similar NLSs may contain similar quality issues, and thus exhaustive examination of NLSs would involve redundant work. A hybrid method is introduced to identify similar NLSs to avoid redundant analyses. Given an input NLS, a graph isomorphism algorithm is used to obtain its structurally identical NLSs. A similarity score between the input NLS and each of its structurally identical NLSs is computed based on semantic similarity between their corresponding concept names. To compute the similarity between concept names, the concept names are converted to vectors using the Doc2Vec document embedding model and then the cosine similarity of the two vectors is computed. All the structurally identical NLSs with a similarity score above 0.85 is considered to be similar to the input NLS. Applying this method to 10 different structures of NLSs in the 02/12/2018 release of GO reveals that 38.43% of these NLSs have at least one similar NLS. Finally, a deep learning-based method is explored to facilitate the suggestion of missing is-a relations in NCIt and SNOMED CT. Concept pairs exhibiting a containment pattern is the focus here. The problem is framed as a binary classification task, where given a pair of concepts, the deep learning model learns to predict whether the two concepts have an is-a relation or not. Positive training samples are existing is-a relations in the terminology exhibiting containment pattern. Negative training samples are concept-pairs without is-a relations that are also exhibiting containment pattern. A graph neural network model is constructed for this task and trained with subgraphs generated enclosing the pairs of concepts in the samples. To evaluate each model trained by the two terminologies, two evaluation sets are created considering newer releases of each terminology as a partial reference standard. The model trained on NCIt achieves a precision of 0.5, a recall of 0.75, and an F1 score of 0.6. The model trained on SNOMED CT achieves a precision of 0.51, a recall of 0.64 and an F1 score of 0.56
    corecore