24 research outputs found

    A Data-driven Approach to Large Knowledge Graph Matching

    Get PDF
    In the last decade, a remarkable number of open Knowledge Graphs (KGs) were developed, such as DBpedia, NELL, and YAGO. While some of such KGs are curated via crowdsourcing platforms, others are semi-automatically constructed. This has resulted in a significant degree of semantic heterogeneity and overlapping facts. KGs are highly complementary; thus, mapping them can benefit intelligent applications that require integrating different KGs such as recommendation systems, query answering, and semantic web navigation. Although the problem of ontology matching has been investigated and a significant number of systems have been developed, the challenges of mapping large-scale KGs remain significant. KG matching has been a topic of interest in the Semantic Web community since it has been introduced to the Ontology Alignment Evaluation Initiative (OAEI) in 2018. Nonetheless, a major limitation of the current benchmarks is their lack of representation of real-world KGs. This work also highlights a number of limitations with current matching methods, such as: (i) they are highly dependent on string-based similarity measures, and (ii) they are primarily built to handle well-formed ontologies. These features make them unsuitable for large, (semi/fully) automatically constructed KGs with hundreds of classes and millions of instances. Another limitation of current work is the lack of benchmark datasets that represent the challenging task of matching real-world KGs. This work addresses the limitation of the current datasets by first introducing two gold standard datasets for matching the schema of large, automatically constructed, less-well-structured KGs based on common KGs such as NELL, DBpedia, and Wikidata. We believe that the datasets which we make public in this work make the largest domain-independent benchmarks for matching KG classes. As many state-of-the-art methods are not suitable for matching large-scale and cross-domain KGs that often suffer from highly imbalanced class distribution, recent studies have revisited instance-based matching techniques in addressing this task. This is because such large KGs often lack a well-defined structure and descriptive metadata about their classes, but contain numerous class instances. Therefore, inspired by the role of instances in KGs, we propose a hybrid matching approach. Our method composes an instance-based matcher that casts the schema-matching process as a text classification task by exploiting instances of KG classes, and a string-based matcher. Our method is domain-independent and is able to handle KG classes with imbalanced populations. Further, we show that incorporating an instance-based approach with the appropriate data balancing strategy results in significant results in matching large and common KG classes

    Yavaa: supporting data workflows from discovery to visualization

    Get PDF
    Recent years have witness an increasing number of data silos being opened up both within organizations and to the general public: Scientists publish their raw data as supplements to articles or even standalone artifacts to enable others to verify and extend their work. Governments pass laws to open up formerly protected data treasures to improve accountability and transparency as well as to enable new business ideas based on this public good. Even companies share structured information about their products and services to advertise their use and thus increase revenue. Exploiting this wealth of information holds many challenges for users, though. Oftentimes data is provided as tables whose sheer endless rows of daunting numbers are barely accessible. InfoVis can mitigate this gap. However, offered visualization options are generally very limited and next to no support is given in applying any of them. The same holds true for data wrangling. Only very few options to adjust the data to the current needs and barely any protection are in place to prevent even the most obvious mistakes. When it comes to data from multiple providers, the situation gets even bleaker. Only recently tools emerged to search for datasets across institutional borders reasonably. Easy-to-use ways to combine these datasets are still missing, though. Finally, results generally lack proper documentation of their provenance. So even the most compelling visualizations can be called into question when their coming about remains unclear. The foundations for a vivid exchange and exploitation of open data are set, but the barrier of entry remains relatively high, especially for non-expert users. This thesis aims to lower that barrier by providing tools and assistance, reducing the amount of prior experience and skills required. It covers the whole workflow ranging from identifying proper datasets, over possible transformations, up until the export of the result in the form of suitable visualizations

    Tailored retrieval of health information from the web for facilitating communication and empowerment of elderly people

    Get PDF
    A patient, nowadays, acquires health information from the Web mainly through a “human-to-machine” communication process with a generic search engine. This, in turn, affects, positively or negatively, his/her empowerment level and the “human-to-human” communication process that occurs between a patient and a healthcare professional such as a doctor. A generic communication process can be modelled by considering its syntactic-technical, semantic-meaning, and pragmatic-effectiveness levels and an efficacious communication occurs when all the communication levels are fully addressed. In the case of retrieval of health information from the Web, although a generic search engine is able to work at the syntactic-technical level, the semantic and pragmatic aspects are left to the user and this can be challenging, especially for elderly people. This work presents a custom search engine, FACILE, that works at the three communication levels and allows to overcome the challenges confronted during the search process. A patient can specify his/her information requirements in a simple way and FACILE will retrieve the “right” amount of Web content in a language that he/she can easily understand. This facilitates the comprehension of the found information and positively affects the empowerment process and communication with healthcare professionals

    Language complexity in on-line health information retrieval

    Get PDF
    The number of people searching for on-line health information has been steadily growing over the years so it is crucial to understand their specific requirements in order to help them finding easily and quickly the specific in-formation they are looking for. Although generic search engines are typically used by health information seekers as the starting point for searching information, they have been shown to be limited and unsatisfactory because they make generic searches, often overloading the user with the provided amount of results. Moreover, they are not able to provide specific information to different types of users. At the same time, specific search engines mostly work on medical literature and provide extracts from medical journals that are mainly useful for medical researchers and experts but not for non-experts. A question then arises: Is it possible to facilitate the search of on-line health/medical information based on specific user requirements? In this pa-per, after analysing the main characteristics and requirements of on-line health seeking, we provide a first answer to this question by exploiting the Web structured data for the health domain and presenting a system that allows different types of users, i.e., non-medical experts and medical experts, to retrieve Web pages with language complexity levels suitable to their expertise. Furthermore, we apply our methodology to the results of a generic search engine, such as Google, in order to re-rank them and provide different users with the proper health/medical Web pages in terms of language complexity

    On the synthesis of metadata tags for HTML files

    Get PDF
    RDFa, JSON-LD, Microdata, and Microformats allow to endow the data in HTML files with metadata tags that help software agents understand them. Unluckily, there are many HTML files that do not have any metadata tags, which has motivated many authors to work on proposals to synthesize them. But they have some problems: the authors either provide an overall picture of their designs without too many details on the techniques behind the scenes or focus on the techniques but do not describe the design of the software systems that support them; many of them cannot deal with data that are encoded using semistructured formats like forms, listings, or tables; and the few proposals that can work on tables can deal with horizontal listings only. In this article, we describe the design of a system that overcomes the previous limitations using a novel embedding approach that has proven to outperform four state-of-the-art techniques on a repository with randomly selected HTML files from 40 differ ent sites. According to our experimental analysis, our proposal can achieve an F1 score that outperforms the others by 10.14%; this difference was confirmed to be statistically significant at the standard confidence level.Junta de Andalucía P18-RT-1060Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016-75394-

    Provision of tailored health information for patient empowerment: an initial study

    Get PDF
    Search of “right” health information by patients/citizens is an important step towards their empowerment. The number of health information seekers on the Internet is steadily increasing over the years so it is crucial to understand their information needs and the challenges they face during the search process. However, generic search engines do not make any distinction among the users and overload them with the amount of information. Moreover, specific search engines/sites mostly work on medical literature and are built by hand. This paper analyses the possibility of providing the user with tailored web information by exploiting the web semantic capabilities and, in particular, those of schema.org and its healthlifesci extension. After presenting a short review of the main user requirements when searching for health information on the Internet, an analysis of schema.org and its health-lifesci extension is shown to understand the main properties and semantic capabilities in the health/medical domain. Finally, an initial mapping among user requirements and schema.org elements is presented in order to provide expert and non-expert user categories with web pages that satisfy their specific requirements

    Verification and Validation of Semantic Annotations

    Full text link
    In this paper, we propose a framework to perform verification and validation of semantically annotated data. The annotations, extracted from websites, are verified against the schema.org vocabulary and Domain Specifications to ensure the syntactic correctness and completeness of the annotations. The Domain Specifications allow checking the compliance of annotations against corresponding domain-specific constraints. The validation mechanism will detect errors and inconsistencies between the content of the analyzed schema.org annotations and the content of the web pages where the annotations were found.Comment: Accepted for the A.P. Ershov Informatics Conference 2019(the PSI Conference Series, 12th edition) proceedin

    Provision of tailored health information for patient empowerment: An initial study

    Get PDF
    Search of “right” health information by patients/citizens is an important step towards their empowerment. The number of health information seekers on the Internet is steadily increasing over the years so it is crucial to understand their information needs and the challenges they face during the search process. However, generic search engines do not make any distinction among the users and overload them with the amount of information. Moreover, specific search engines/sites mostly work on medical literature and are built by hand. This paper analyses the possibility of providing the user with tailored web information by exploiting the web semantic capabilities and, in particular, those of schema.org and its health-lifesci extension. After presenting a short review of the main user requirements when searching for health information on the Internet, an analysis of schema.org and its health-lifesci extension is shown to understand the main properties and semantic capabilities in the health/medical domain. Finally, an initial mapping among user requirements and schema.org elements is presented in order to provide expert and non-expert user categories with web pages that satisfy their specific requirements

    Observing LOD using Equivalent Set Graphs: it is mostly flat and sparsely linked

    Full text link
    This paper presents an empirical study aiming at understanding the modeling style and the overall semantic structure of Linked Open Data. We observe how classes, properties and individuals are used in practice. We also investigate how hierarchies of concepts are structured, and how much they are linked. In addition to discussing the results, this paper contributes (i) a conceptual framework, including a set of metrics, which generalises over the observable constructs; (ii) an open source implementation that facilitates its application to other Linked Data knowledge graphs.Comment: 18 page

    Facilitating access to health web pages with different language complexity levels

    Get PDF
    The number of people looking for health information on the Internet is constantly growing. When searching for health information, different types of users, such as patients, clinicians or medical researchers, have different needs and should easily find the information they are looking for based on their specific requirements. However, generic search engines do not make any distinction among the users and, often, overload them with the provided amount of information. On the other hand, specific search engines mostly work on medical literature and specialized web sites are often not free and contain focused information built by hand. This paper presents a method to facilitate the search of health information on the web so that users can easily and quickly find information based on their specific requirements. In particular, it allows different types of users to find health web pages with required language complexity levels. To this end, we first use the structured data contained in the web to classify health web pages based on different audience types such as, patients, clinicians and medical researchers. Next, we evaluate the language complexity levels of the different web pages. Finally, we propose a mapping between the language complexity levels and the different audience types that allows us to provide different types of users, e.g., experts and non-experts with tailored web pages in terms of language complexity
    corecore