8 research outputs found

    Measuring Clusters of Labels in an Embedding Space to Refine Relations in Ontology Alignment

    Get PDF
    International audienceOntology alignment plays a key role in the management of heterogeneous data sources and metadata. In this context, various ontology alignment techniques have been proposed to discover correspondences between the entities of different ontologies. This paper proposes a new ontology alignment approach based on a set of rules exploiting the embedding space and measuring clusters of labels to discover the relationship between entities. We tested our system on the OAEI conference complex alignment benchmark track and then applied it to aligning ontologies in a real-world case study. The experimental results show that the combination of word embedding and a measure of dispersion of the clusters of labels, which we call the radius measure, makes it possible to determine, with good accuracy, not only equivalence relations, but also hierarchical relations between entities

    OMT, an Ontology Matching System

    Get PDF
    Dissertação de mestrado integrado em Informatics EngineeringIn recent years ontologies have become an integral part of storing information in a structured and formal manner and a way of sharing said information. With this rise in usage, it was only a matter of time before different people tried to use ontologies to represent the same knowledge domain. The area of Ontology Matching was created with the purpose of finding correspondences between different ontologies that represented information in the same domain area. This document reports a Master’s work that started with the study of already existing ontology matching techniques and tools in order to gain knowledge on what techniques exist, as well as understand the advantages and disadvantages of each one. Using the knowledge obtained from the study of the bibliography research, a new web-based tool called OMT was created to automatically merge two given ontologies. The OMT tool processes ontologies written in different ontology representation languages, such as the OWL family or any language written according to the RDF web standards. The OMT tool provides the user with basic information about the submitted ontologies and after the matching occurs, provides the user with a simplified version of the results focusing on the number of objects that were matched and merged. The user can also download a Log File, if he so chooses. This Log File contains a detailed description of the matching process and the reasoning behind the decisions the OMT tool made. The OMT tool was tested throughout its development phase against various different potential inputs to assess its accuracy. Lastly, a web application was developed to host the OMT tool in order to facilitate the access and use of the tool for the users.Nos últimos tempos, ontologias têm-se tornado fundamentais quando os objetivos são armazenar informação de forma formal e estruturada bem como a partilha de tal informação. Com o aumento da procura e utilização de ontologias, tornou-se inevitável que indivíduos diferentes criassem ontologias para representar o mesmo domínio de informação. A área de concordância de ontologias foi criada com o intuito de encontrar correspondências entre ontologias que representem informação no mesmo domínio. Este documento reporta o trabalho de uma tese de Mestrado que começou pelo estudo de técnicas e ferramentas já existentes na área de concordância de ontologias com o objetivo de obter conhecimento nestas mesmas e perceber as suas vantagens e desvantagens. A partir do conhecimento obtido a partir deste estudo, uma nova ferramenta web chamada OMT foi criada para automaticamente alinhar duas ontologias. A ferramenta OMT processa ontologias escritas em diferentes linguagens de representação, tal como a familia de linguages OWL ou qualquer linguagem que respeite o padrão RDF. A ferramenta OMT fornece ao utilizador informação básica sobre as ontologias e após o alinhamento ocorrer, fornece ao utilizador uma versão simplificada dos resultados obtidos, focando no numero de objetos que foram alinhados. O utilizador pode também descarregar um ficheiro Log. Este ficheiro contém uma descrição destalhada do processo de alinhamento e a justificação para as diferentes decisões tomadas pelo ferramenta OMT. A ferramenta OMT foi testada durante todo o processo de desenvolvimento com diferentes tipos de ontologia de entrada para avaliar a sua capacidade de alinhamento. Por último, foi também desenvolvida uma aplicação web para hospedar a ferramenta OMT de forma a facilitar o acesso e uso da ferramenta aos utilizadores

    Exploiting general-purpose background knowledge for automated schema matching

    Full text link
    The schema matching task is an integral part of the data integration process. It is usually the first step in integrating data. Schema matching is typically very complex and time-consuming. It is, therefore, to the largest part, carried out by humans. One reason for the low amount of automation is the fact that schemas are often defined with deep background knowledge that is not itself present within the schemas. Overcoming the problem of missing background knowledge is a core challenge in automating the data integration process. In this dissertation, the task of matching semantic models, so-called ontologies, with the help of external background knowledge is investigated in-depth in Part I. Throughout this thesis, the focus lies on large, general-purpose resources since domain-specific resources are rarely available for most domains. Besides new knowledge resources, this thesis also explores new strategies to exploit such resources. A technical base for the development and comparison of matching systems is presented in Part II. The framework introduced here allows for simple and modularized matcher development (with background knowledge sources) and for extensive evaluations of matching systems. One of the largest structured sources for general-purpose background knowledge are knowledge graphs which have grown significantly in size in recent years. However, exploiting such graphs is not trivial. In Part III, knowledge graph em- beddings are explored, analyzed, and compared. Multiple improvements to existing approaches are presented. In Part IV, numerous concrete matching systems which exploit general-purpose background knowledge are presented. Furthermore, exploitation strategies and resources are analyzed and compared. This dissertation closes with a perspective on real-world applications

    Evolution von ontologiebasierten Mappings in den Lebenswissenschaften

    Get PDF
    Im Bereich der Lebenswissenschaften steht eine große und wachsende Menge heterogener Datenquellen zur Verfügung, welche häufig in quellübergreifenden Analysen und Auswertungen miteinander kombiniert werden. Um eine einheitliche und strukturierte Erfassung von Wissen sowie einen formalen Austausch zwischen verschiedenen Applikationen zu erleichtern, kommen Ontologien und andere strukturierte Vokabulare zum Einsatz. Sie finden Anwendung in verschiedenen Domänen wie der Molekularbiologie oder Chemie und dienen zumeist der Annotation realer Objekte wie z.B. Gene oder Literaturquellen. Unterschiedliche Ontologien enthalten jedoch teilweise überlappendes Wissen, so dass die Bestimmung einer Abbildung (Ontologiemapping) zwischen ihnen notwendig ist. Oft ist eine manuelle Mappingerstellung zwischen großen Ontologien kaum möglich, weshalb typischerweise automatische Verfahren zu deren Abgleich (Matching) eingesetzt werden. Aufgrund neuer Forschungserkenntnisse und Nutzeranforderungen verändern sich die Ontologien kontinuierlich weiter. Die Evolution der Ontologien hat wiederum Auswirkungen auf abhängige Daten wie beispielsweise Annotations- und Ontologiemappings, welche entsprechend aktualisiert werden müssen. Im Rahmen dieser Arbeit werden neue Methoden und Algorithmen zum Umgang mit der Evolution ontologie-basierter Mappings entwickelt. Dabei wird die generische Infrastruktur GOMMA zur Verwaltung und Analyse der Evolution von Ontologien und Mappings genutzt und erweitert. Zunächst wurde eine vergleichende Analyse der Evolution von Ontologiemappings für drei Subdomänen der Lebenswissenschaften durchgeführt. Ontologien sowie Mappings unterliegen teilweise starken Änderungen, wobei die Evolutionsintensität von der untersuchten Domäne abhängt. Insgesamt zeigt sich ein deutlicher Einfluss von Ontologieänderungen auf Ontologiemappings. Dementsprechend können bestehende Mappings infolge der Weiterentwicklung von Ontologien ungültig werden, so dass sie auf aktuelle Ontologieversionen migriert werden müssen. Dabei sollte eine aufwendige Neubestimmung der Mappings vermieden werden. In dieser Arbeit werden zwei generische Algorithmen zur (semi-) automatischen Adaptierung von Ontologiemappings eingeführt. Ein Ansatz basiert auf der Komposition von Ontologiemappings, wohingegen der andere Ansatz eine individuelle Behandlung von Ontologieänderungen zur Adaptierung der Mappings erlaubt. Beide Verfahren ermöglichen die Wiederverwendung unbeeinflusster, bereits bestätigter Mappingteile und adaptieren nur die von Änderungen betroffenen Bereiche der Mappings. Eine Evaluierung für sehr große, biomedizinische Ontologien und Mappings zeigt, dass beide Verfahren qualitativ hochwertige Ergebnisse produzieren. Ähnlich zu Ontologiemappings werden auch ontologiebasierte Annotationsmappings durch Ontologieänderungen beeinflusst. Die Arbeit stellt einen generischen Ansatz zur Bewertung der Qualität von Annotationsmappings auf Basis ihrer Evolution vor. Verschiedene Qualitätsmaße erlauben die Identifikation glaubwürdiger Annotationen beispielsweise anhand ihrer Stabilität oder Herkunftsinformationen. Eine umfassende Analyse großer Annotationsdatenquellen zeigt zahlreiche Instabilitäten z.B. aufgrund temporärer Annotationslöschungen. Dementsprechend stellt sich die Frage, inwieweit die Datenevolution zu einer Veränderung von abhängigen Analyseergebnissen führen kann. Dazu werden die Auswirkungen der Ontologie- und Annotationsevolution auf sogenannte funktionale Analysen großer biologischer Datensätze untersucht. Eine Evaluierung anhand verschiedener Stabilitätsmaße erlaubt die Bewertung der Änderungsintensität der Ergebnisse und gibt Aufschluss, inwieweit Nutzer mit einer signifikanten Veränderung ihrer Ergebnisse rechnen müssen. Darüber hinaus wird GOMMA um effiziente Verfahren für das Matching sehr großer Ontologien erweitert. Diese werden u.a. für den Abgleich neuer Konzepte während der Adaptierung von Ontologiemappings benötigt. Viele der existierenden Match-Systeme skalieren nicht für das Matching besonders großer Ontologien wie sie im Bereich der Lebenswissenschaften auftreten. Ein effizienter, kompositionsbasierter Ansatz gleicht Ontologien indirekt ab, indem existierende Mappings zu Mediatorontologien wiederverwendet und miteinander kombiniert werden. Mediatorontologien enthalten wertvolles Hintergrundwissen, so dass sich die Mappingqualität im Vergleich zu einem direkten Matching verbessern kann. Zudem werden generelle Strategien für das parallele Ontologie-Matching unter Verwendung mehrerer Rechenknoten vorgestellt. Eine größenbasierte Partitionierung der Eingabeontologien verspricht eine gute Lastbalancierung und Skalierbarkeit, da kleinere Teilaufgaben des Matchings parallel verarbeitet werden können. Die Evaluierung im Rahmen der Ontology Alignment Evaluation Initiative (OAEI) vergleicht GOMMA und andere Systeme für das Matching von Ontologien in verschiedenen Domänen. GOMMA kann u.a. durch Anwendung des parallelen und kompositionsbasierten Matchings sehr gute Ergebnisse bezüglich der Effektivität und Effizienz des Matchings, insbesondere für Ontologien aus dem Bereich der Lebenswissenschaften, erreichen.In the life sciences, there is an increasing number of heterogeneous data sources that need to be integrated and combined in comprehensive analysis tasks. Often ontologies and other structured vocabularies are used to provide a formal representation of knowledge and to facilitate data exchange between different applications. Ontologies are used in different domains like molecular biology or chemistry. One of their most important applications is the annotation of real-world objects like genes or publications. Since different ontologies can contain overlapping knowledge it is necessary to determine mappings between them (ontology mappings). A manual mapping creation can be very time-consuming or even infeasible such that (semi-) automatic ontology matching methods are typically applied. Ontologies are not static but underlie continuous modifications due to new research insights and changing user requirements. The evolution of ontologies can have impact on dependent data like annotation or ontology mappings. This thesis presents novel methods and algorithms to deal with the evolution of ontology-based mappings. Thereby the generic infrastructure GOMMA is used and extended to manage and analyze the evolution of ontologies and mappings. First, a comparative evolution analysis for ontologies and mappings from three life science domains shows heavy changes in ontologies and mappings as well as an impact of ontology changes on the mappings. Hence, existing ontology mappings can become invalid and need to be migrated to current ontology versions. Thereby an expensive redetermination of the mappings should be avoided. This thesis introduces two generic algorithms to (semi-) automatically adapt ontology mappings: (1) a composition-based adaptation relies on the principle of mapping composition, and (2) a diff-based adaptation algorithm allows for individually handling change operations to update mappings. Both approaches reuse unaffected mapping parts, and adapt only affected parts of the mappings. An evaluation for very large biomedical ontologies and mappings shows that both approaches produce ontology mappings of high quality. Similarly, ontology changes may also affect ontology-based annotation mappings. The thesis introduces a generic evaluation approach to assess the quality of annotation mappings based on their evolution. Different quality measures allow for the identification of reliable annotations, e.g., based on their stability or provenance information. A comprehensive analysis of large annotation data sources shows numerous instabilities, e.g., due to the temporary absence of annotations. Such modifications may influence results of dependent applications such as functional enrichment analyses that describe experimental data in terms of ontological groupings. The question arises to what degree ontology and annotation changes may affect such analyses. Based on different stability measures the evaluation assesses change intensities of application results and gives insights whether users need to expect significant changes of their analysis results. Moreover, GOMMA is extended by large-scale ontology matching techniques. Such techniques are useful, a.o., to match new concepts during ontology mapping adaptation. Many existing match systems do not scale for aligning very large ontologies, e.g., from the life science domain. One efficient composition-based approach indirectly computes ontology mappings by reusing and combining existing mappings to intermediate ontologies. Intermediate ontologies can contain useful background knowledge such that the mapping quality can be improved compared to a direct match approach. Moreover, the thesis introduces general strategies for matching ontologies in parallel using several computing nodes. A size-based partitioning of the input ontologies enables good load balancing and scalability since smaller match tasks can be processed in parallel. The evaluation of the Ontology Alignment Evaluation Initiative (OAEI) compares GOMMA and other systems in terms of matching ontologies from different domains. Using the parallel and composition-based matching, GOMMA can achieve very good results w.r.t. efficiency and effectiveness, especially for ontologies from the life science domain

    Methods for Matching of Linked Open Social Science Data

    Get PDF
    In recent years, the concept of Linked Open Data (LOD), has gained popularity and acceptance across various communities and domains. Science politics and organizations claim that the potential of semantic technologies and data exposed in this manner may support and enhance research processes and infrastructures providing research information and services. In this thesis, we investigate whether these expectations can be met in the domain of the social sciences. In particular, we analyse and develop methods for matching social scientific data that is published as Linked Data, which we introduce as Linked Open Social Science Data. Based on expert interviews and a prototype application, we investigate the current consumption of LOD in the social sciences and its requirements. Following these insights, we first focus on the complete publication of Linked Open Social Science Data by extending and developing domain-specific ontologies for representing research communities, research data and thesauri. In the second part, methods for matching Linked Open Social Science Data are developed that address particular patterns and characteristics of the data typically used in social research. The results of this work contribute towards enabling a meaningful application of Linked Data in a scientific domain

    Bioinformatička platforma za izvršavanje Federated SPARQL upita nad ontološkim bazama podataka i detektovanje sličnih podataka utvrđivanjem njihove semantičke povezanosti

    Get PDF
    Značaj bioinformatike, kao interdisciplinarne oblasti, bazira se na velikom broju bioloških podataka koji se mogu adekvatno upotrebiti i procesirati primenom aktuelnih informatičkih tehnologija. Ono što je od vitalnog značaja u domenu bioinformatike danas, jeste dostupnost podataka relevantnih za istraživanja, kao i saznanje o tome da takvi podaci već postoje. Značajan preduslov za to je da su potrebni podaci javno dostupni, integrisani i da su razvijeni mehanizmi za njihovu pretragu. U cilju rešavanja datih problema bioinformatička zajednica koristi tehnologije semantičkog veba. U tom pogledu razvijeni su mnogi semantički repozitorijumi i softverska rešenja, koji su izrazito potpomogli istraživačkim aktivnostima na bioinformatičkoj sceni. Međutim, ovi pristupi često se suočavaju sa problemima jer su se mnoge baze podataka razvijale u izolovanom okruženju, bez poštovanja osnovnih standarda bioinformatičke zajednice. Ove heterogene baze, koje su karike mnogih visoko specijalizovanih i nezavisnih resursa, često koriste različite konvencije, rečnike i formate za predstavljanje podataka. Zbog toga se aktuelna softverska rešenja suočavaju sa različitim izazovima u cilju pretrage i otkrivanja relevantnih podataka. Takođe, mnoge baze podataka se preklapaju, čime se pokrivaju, odnosno prikrivaju slični podaci, formirajući na taj način polu-homogene ili homogene izvore podataka. U takvim slučajevima semantička korelacija ovakvih baza često je nejasna i neophodno je primeniti odgovarajuće metode za analizu podataka, kako bi se utvrdili slični podaci. Ova disertacija je nastala kao rezultat istraživanja u cilju prevazilaženja nedostataka postojećih rešenja. U disertaciji je prikazan doprinos u razvoju bioinformatičke platforme, koja se ogleda u nizu originalnih softverskih pristupa koji predstavljaju osnovu ključnih funkcionalnosti: izvršavanje Federated SPARQL upita nad inicijalnim (i korisnički selektovanim) bazama podataka u cilju otkrivanja podataka relevantnih za bioinformatička istraživanja, kao i detektovanje sličnih podataka koje je zasnovano na utvrđivanju semantičke povezanosti podataka. Izvršavanje Federated SPARQL upita izvodi se nad bazama podataka koje koriste Resource Description Framework (RDF) kao model podataka. Rezultati upita se mogu naknadno filtrirati, čime se doprinosi poboljšanju njihove značajnosti. Filtriranje podrazumeva odabir specifičnih svojstava (predikata) prilikom dinamičke projekcije RDF strukture baze podataka i izvršavanje dinamički generisanih star-shaped SPARQL upita. Algoritam, koji je razvijen za potrebe detekcije sličnih podatka, prezentuje originalan pristup i primenjuje se nad instancama ontoloških baza podataka. On koristi principe ontološkog poravnanja, rudarenje tekstualnih podataka, model vektorskog prostora za matematičku reprezentaciju podataka i meru kosinusne sličnosti za numeričko određivanje sličnosti podataka. Treba napomenuti da je Platforma nastala kao posledica višegodišnjeg istraživanja u okviru CPCTAS (Centre for PreClinical Testing of Active Substances) i Laboratorije za ćelijsku i molekularnu biologiju kao deo Instituta za biologiju i ekologiju Prirodno-matematičkog fakulteta Univerziteta u Kragujevcu. Aktivnost Laboratorije pokriva jednu od važnih bioinformatičkih podgrana - prekliničko testiranje bioaktivnih supstanci (potencijalnih lekova za kancer). Primarni cilj Platforme je da istraživanja u okviru Laboratorije učini produktivnijim i efikasnijim. Validacija Platforme je sprovedena nad testnim i relanim bioinformatičkim izvorima podataka, ukazujući na visoku iskorišćenost resursa. Zahvaljujući efikasnim metodama Platforme otvoren je put za nova istraživanja u oblasti bioinformatike, ali i u bilo kojoj drugoj oblasti koja pokriva ontološko modelovanje podataka.The importance of bioinformatics, as an interdisciplinary field, is based on a large number of biological data that can be adequately used and processed using current information technology. What is of vital importance in the field of bioinformatics today is the availability of data relevant to the research, as well as the knowledge that such data already exists. An important prerequisite for this is that the necessary data is publicly available, integrated and that mechanisms for their search have been developed. In order to solve these problems, the bioinformatics community uses semantic web technologies. In this respect, many semantic repositories and software solutions have been developed, which have significantly contributed to the research activities in the bioinformatic scene. However, these approaches often face problems because many databases have developed in an isolated environment, without respecting the basic standards of the bioinformatics community. These heterogeneous databases, which links a number of highly specialized and independent resources, often use different conventions, vocabularies and formats for presenting data. Therefore, current software solutions face different challenges in order to search for and discover relevant data. Also, many databases overlap, covering or concealing similar data, thus forming a homogeneous or semi-homogenous data sources. In such cases, the semantic correlation of such databases is often unclear and it is necessary to apply appropriate methods for data analysis, to determine similar data. This dissertation was created as a result of research in order to overcome the shortcomings of existing solutions. The dissertation presents a contribution to the development of the bioinformatics platform, which presents a number of genuine software approaches that are the basis of key functionalities: executing Federated SPARQL queries over initial (and user selected) databases in order to discover data relevant to bioinformatics research, and the detection of similar data based on determining the semantic relatedness of data. Execution of Federated SPARQL queries is performed over databases that use the Resource Description Framework (RDF) as a data model. Query results can be subsequently filtered, thereby contributing to the improvement of their significance. Filtering involves selecting specific properties (predicates) during the dynamic projection of the RDF database structure and executing dynamically generated star-shaped SPARQL queries. The algorithm, developed for the detection of similar data, presents the original approach and is applied to instances of ontological databases. It uses the principles of ontological alignment, text data mining, the vector space model for the mathematical representation of data, and the cosine similarity measure for the numerical determination of the similarity of data. It should be noted that the Platform was the result of long-term research within the CPCTAS (Center for PreClinical Testing of Active Substances) Laboratory for Cellular and Molecular Biology as part of the Institute of Biology and Ecology at the Faculty of Science, University of Kragujevac. Laboratory activity covers one of the important bioinformatics subgroups - preclinical testing of bioactive substances (potential drugs for cancer). The primary goal of the Platform is to make Laboratory research more productive and more efficient. Platform validation was conducted over real and test bioinformatic data sources, indicating high utilization of resources. Thanks to effective Platform methods, a new path for new research in the field of bioinformatics has been opened, but also in any other area that covers ontological data modelling

    Kelowna Courier

    Get PDF
    corecore