11,918 research outputs found

    Distributed Holistic Clustering on Linked Data

    Full text link
    Link discovery is an active field of research to support data integration in the Web of Data. Due to the huge size and number of available data sources, efficient and effective link discovery is a very challenging task. Common pairwise link discovery approaches do not scale to many sources with very large entity sets. We here propose a distributed holistic approach to link many data sources based on a clustering of entities that represent the same real-world object. Our clustering approach provides a compact and fused representation of entities, and can identify errors in existing links as well as many new links. We support a distributed execution of the clustering approach to achieve faster execution times and scalability for large real-world data sets. We provide a novel gold standard for multi-source clustering, and evaluate our methods with respect to effectiveness and efficiency for large data sets from the geographic and music domains

    Building an effective and efficient background knowledge resource to enhance ontology matching

    Get PDF
    International audienceOntology matching is critical for data integration and interoperability. Original ontology matching approaches relied solely on the content of the ontologies to align. However, these approaches are less effective when equivalent concepts have dissimilar labels and are structured with different modeling views. To overcome this semantic heterogeneity, the community has turned to the use of external background knowledge resources. Several methods have been proposed to select ontologies, other than the ones to align, as background knowledge to enhance a given ontology-matching task. However, these methods return a set of complete ontologies, while, in most cases, only fragments of the returned ontologies are effective for discovering new mappings. In this article, we propose an approach to select and build a background knowledge resource with just the right concepts chosen from a set of ontologies, which improves efficiency without loss of effectiveness. The use of background knowledge in ontology matching is a double-edged sword: while it may increase recall (i.e., retrieve more correct mappings), it may lower precision (i.e., produce more incorrect mappings). Therefore, we propose two methods to select the most relevant mappings from the candidate ones: (1)~a selection based on a set of rules and (2)~a selection based on supervised machine learning. Our experiments, conducted on two Ontology Alignment Evaluation Initiative (OAEI) datasets, confirm the effectiveness and efficiency of our approach. Moreover, the F-measure values obtained with our approach are very competitive to those of the state-of-the-art matchers exploiting background knowledge resources

    Automatic Generation of Trace Links in Model-driven Software Development

    Get PDF
    Traceability data provides the knowledge on dependencies and logical relations existing amongst artefacts that are created during software development. In reasoning over traceability data, conclusions can be drawn to increase the quality of software. The paradigm of Model-driven Software Engineering (MDSD) promotes the generation of software out of models. The latter are specified through different modelling languages. In subsequent model transformations, these models are used to generate programming code automatically. Traceability data of the involved artefacts in a MDSD process can be used to increase the software quality in providing the necessary knowledge as described above. Existing traceability solutions in MDSD are based on the integral model mapping of transformation execution to generate traceability data. Yet, these solutions still entail a wide range of open challenges. One challenge is that the collected traceability data does not adhere to a unified formal definition, which leads to poorly integrated traceability data. This aggravates the reasoning over traceability data. Furthermore, these traceability solutions all depend on the existence of a transformation engine. However, not in all cases pertaining to MDSD can a transformation engine be accessed, while taking into account proprietary transformation engines, or manually implemented transformations. In these cases it is not possible to instrument the transformation engine for the sake of generating traceability data, resulting in a lack of traceability data. In this work, we address these shortcomings. In doing so, we propose a generic traceability framework for augmenting arbitrary transformation approaches with a traceability mechanism. To integrate traceability data from different transformation approaches, our approach features a methodology for augmentation possibilities based on a design pattern. The design pattern supplies the engineer with recommendations for designing the traceability mechanism and for modelling traceability data. Additionally, to provide a traceability mechanism for inaccessible transformation engines, we leverage parallel model matching to generate traceability data for arbitrary source and target models. This approach is based on a language-agnostic concept of three similarity measures for matching. To realise the similarity measures, we exploit metamodel matching techniques for graph-based model matching. Finally, we evaluate our approach according to a set of transformations from an SAP business application and the domain of MDSD

    Matching Biomedical Knowledge Graphs with Neural Embeddings

    Get PDF
    Tese de mestrado, Ciência de Dados, Universidade de Lisboa, Faculdade de Ciências, 2020Os grafos de conhecimento são estruturas que se tornaram fundamentais para a organização dos dados biomédicos que têm sido produzidos a um ritmo exponencial nos últimos anos. A abrangente adoção desta forma de estruturar e descrever dados levou ao desenvolvimento de abordagens de prospeção de dados que tirassem partido desta informação com o intuito de auxiliar o progresso do conhecimento científico. Porém, devido à impossibilidade de isolamento de domínios de conhecimento e à idiossincrasia humana, grafos de conhecimento construídos por diferentes indivíduos contêm muitas vezes conceitos equivalentes descritos de forma diferente, dificultando uma análise integrada de dados de diferentes grafos de conhecimento. Vários sistemas de alinhamento de grafos de conhecimento têm focado a resolução deste desafio. Contudo, o desempenho destes sistemas no alinhamento de grafos de conhecimento biomédicos estagnou nos últimos quatro anos com algoritmos e recursos externos bastante trabalhados para aprimorar os resultados. Nesta dissertação, apresentamos duas novas abordagens de alinhamento de grafos de conhecimento empregando Neural Embeddings: uma utilizando semelhança simples entre embeddings à base de palavras e de entidades de grafos; outra treinando um modelo mais complexo que refinasse a informação proveniente de embeddings baseados em palavras. A metodologia proposta visa integrar estas abordagens no processo regular de alinhamento, utilizando como infraestrutura o sistema AgreementMakerLight. Estas novas componentes permitem extender os algoritmos de alinhamento do sistema, descobrindo novos mapeamentos, e criar uma abordagem de alinhamento mais generalizável e menos dependente de ontologias biomédicas externas. Esta nova metodologia foi avaliada em três casos de teste de alinhamento de ontologias biomédicas, provenientes da Ontology Alignment Evaluation Initiative. Os resultados demonstraram que apesar de ambas as abordagens não excederem o estado da arte, estas obtiveram um desempenho benéfico nas tarefas de alinhamento, superando a performance de todos os sistemas que não usam ontologias externas e inclusive alguns que tiram proveito das mesmas, o que demonstra o valor das técnicas de Neural Embeddings na tarefa de alinhamento de grafos do conhecimento biomédicos.Knowledge graphs are data structures which became essential to organize biomedical data produced at an exponential rate in the last few years. The broad adoption of this method of structuring and describing data resulted in the increased interest to develop data mining approaches which took advantage of these information structures in order to improve scientific knowledge. However, due to human idiosyncrasy and also the impossibility to isolate knowledge domains in separate pieces, knowledge graphs constructed by different individuals often contain equivalent concepts described differently. This obstructs the path to an integrated analysis of data described by multiple knowledge graphs. Multiple knowledge graph matching systems have been developed to address this challenge. Nevertheless, the performance of these systems has stagnated in the last four years, despite the fact that they were provided with highly tailored algorithms and external resources to tackle this task. In this dissertation, we present two novel knowledge graph matching approaches employing neural embeddings: one using plain embedding similarity based on word and graph models; the other one using a more complex word-based model which requires training data to refine embeddings. The proposed methodology aims to integrate these approaches in the regular matching process, using the AgreementMakerLight system as a foundation. These new components enable the extension of the system’s current matching algorithms, discovering new mappings, and developing a more generalizable and less dependent on external biomedical ontologies matching procedure. This new methodology was evaluated on three biomedical ontology matching test cases provided by the Ontology Alignment Evaluation Initiative. The results showed that despite both embedding approaches don’t exceed state of the art results, they still produce better results than any other matching systems which do not make use of external ontologies and also surpass some that do benefit from them. This shows that Neural Embeddings are a valuable technique to tackle the challenge of biomedical knowledge graph matching

    Advanced Methods for Entity Linking in the Life Sciences

    Get PDF
    The amount of knowledge increases rapidly due to the increasing number of available data sources. However, the autonomy of data sources and the resulting heterogeneity prevent comprehensive data analysis and applications. Data integration aims to overcome heterogeneity by unifying different data sources and enriching unstructured data. The enrichment of data consists of different subtasks, amongst other the annotation process. The annotation process links document phrases to terms of a standardized vocabulary. Annotated documents enable effective retrieval methods, comparability of different documents, and comprehensive data analysis, such as finding adversarial drug effects based on patient data. A vocabulary allows the comparability using standardized terms. An ontology can also represent a vocabulary, whereas concepts, relationships, and logical constraints additionally define an ontology. The annotation process is applicable in different domains. Nevertheless, there is a difference between generic and specialized domains according to the annotation process. This thesis emphasizes the differences between the domains and addresses the identified challenges. The majority of annotation approaches focuses on the evaluation of general domains, such as Wikipedia. This thesis evaluates the developed annotation approaches with case report forms that are medical documents for examining clinical trials. The natural language provides different challenges, such as similar meanings using different phrases. The proposed annotation method, AnnoMap, considers the fuzziness of natural language. A further challenge is the reuse of verified annotations. Existing annotations represent knowledge that can be reused for further annotation processes. AnnoMap consists of a reuse strategy that utilizes verified annotations to link new documents to appropriate concepts. Due to the broad spectrum of areas in the biomedical domain, different tools exist. The tools perform differently regarding a particular domain. This thesis proposes a combination approach to unify results from different tools. The method utilizes existing tool results to build a classification model that can classify new annotations as correct or incorrect. The results show that the reuse and the machine learning-based combination improve the annotation quality compared to existing approaches focussing on the biomedical domain. A further part of data integration is entity resolution to build unified knowledge bases from different data sources. A data source consists of a set of records characterized by attributes. The goal of entity resolution is to identify records representing the same real-world entity. Many methods focus on linking data sources consisting of records being characterized by attributes. Nevertheless, only a few methods can handle graph-structured knowledge bases or consider temporal aspects. The temporal aspects are essential to identify the same entities over different time intervals since these aspects underlie certain conditions. Moreover, records can be related to other records so that a small graph structure exists for each record. These small graphs can be linked to each other if they represent the same. This thesis proposes an entity resolution approach for census data consisting of person records for different time intervals. The approach also considers the graph structure of persons given by family relationships. For achieving qualitative results, current methods apply machine-learning techniques to classify record pairs as the same entity. The classification task used a model that is generated by training data. In this case, the training data is a set of record pairs that are labeled as a duplicate or not. Nevertheless, the generation of training data is a time-consuming task so that active learning techniques are relevant for reducing the number of training examples. The entity resolution method for temporal graph-structured data shows an improvement compared to previous collective entity resolution approaches. The developed active learning approach achieves comparable results to supervised learning methods and outperforms other limited budget active learning methods. Besides the entity resolution approach, the thesis introduces the concept of evolution operators for communities. These operators can express the dynamics of communities and individuals. For instance, we can formulate that two communities merged or split over time. Moreover, the operators allow observing the history of individuals. Overall, the presented annotation approaches generate qualitative annotations for medical forms. The annotations enable comprehensive analysis across different data sources as well as accurate queries. The proposed entity resolution approaches improve existing ones so that they contribute to the generation of qualitative knowledge graphs and data analysis tasks
    • …
    corecore