182 research outputs found

    Entity Matching and Disambiguation Across Multiple Knowledge Graphs

    Get PDF
    Knowledge graphs are considered an important representation that lie between free text on one hand and fully-structured relational data on the other. Knowledge graphs are a back-bone of many applications on the Web. With the rise of many large-scale open-domain knowledge graphs like Freebase, DBpedia, and Yago, various applications including document retrieval, question answering, and data integration have been relying on them. In this thesis, We are primarily interested in knowledge graphs from the perspective of integrating disparate heterogeneous sources, with an eye towards applications such as document retrieval and question answering. Integrating different knowledge graphs is very important for enriching the knowledge shared among them. The core part of this integration process is matching entities across the knowledge graphs. The biggest challenge to entity matching is the ambiguity. The obvious solution is to make use of the graph structure and entity neighbourhoods for matching and disambiguating entities. We formalize the entity matching problem and present the rst large-scale dataset, Ambiguous DBpedia-Wikidata, for this task based on exiting cross-ontology links between DBpedia and Wikidata, focused on several hundred thousand ambiguous entities. We propose an entity matching framework that is capable of disambiguating entities across different knowledge graphs. The framework consists of fuzzy string matcher and graph embedding-based matcher. Using a classifi cation-based approach, we find that a simple multi-layered perceptron based on representations derived from RDF2VEC graph embeddings of entities in each knowledge graph is sufficient to achieve high accuracy, with only limited training data. The contribution of our work is both a large dataset for examining this problem and strong baselines on which future work can be based. We also present SimpleDBpediaQA, a new benchmark dataset for simple question answering over knowledge graphs that was created by mapping SimpleQuestions entities and predicates from Freebase to DBpedia. We show how entity matching using manual annotations can be used for migrating datasets across knowledge graphs. Although this mapping is conceptually straightforward, there are a number of nuances that make the task non-trivial, owing to the different conceptual organizations of the two knowledge graphs. Finally, if manual annotations are scarce, we show how our entity matching framework can be used to generate free annotations to train our model and then use it for disambiguation. In that essence, we introduce SimpleQuestions++, a new question answering benchmark that have all questions linked to Freebase, DBpedia, and Wikidata

    Graph-Based Weakly-Supervised Methods for Information Extraction & Integration

    Get PDF
    The variety and complexity of potentially-related data resources available for querying --- webpages, databases, data warehouses --- has been growing ever more rapidly. There is a growing need to pose integrative queries across multiple such sources, exploiting foreign keys and other means of interlinking data to merge information from diverse sources. This has traditionally been the focus of research within Information Extraction (IE) and Information Integration (II) communities, with IE focusing on converting unstructured sources into structured sources, and II focusing on providing a unified view of diverse structured data sources. However, most of the current IE and II methods, which can potentially be applied to the pro blem of integration across sources, require large amounts of human supervision, often in the form of annotated data. This need for extensive supervision makes existing methods expensive to deploy and difficult to maintain. In this thesis, we develop techniques that generalize from limited human input, via weakly-supervised methods for IE and II. In particular, we argue that graph-based representation of data and learning over such graphs can result in effective and scalable methods for large-scale Information Extraction and Integration. Within IE, we focus on the problem of assigning semantic classes to entities. First we develop a context pattern induction method to extend small initial entity lists of various semantic classes. We also demonstrate that features derived from such extended entity lists can significantly improve performance of state-of-the-art discriminative taggers. The output of pattern-based class-instance extractors is often high-precision and low-recall in nature, which is inadequate for many real world applications. We use Adsorption, a graph based label propagation algorithm, to significantly increase recall of an initial high-precision, low-recall pattern-based extractor by combining evidences from unstructured and structured text corpora. Building on Adsorption, we propose a new label propagation algorithm, Modified Adsorption (MAD), and demonstrate its effectiveness on various real-world datasets. Additionally, we also show how class-instance acquisition performance in the graph-based SSL setting can be improved by incorporating additional semantic constraints available in independently developed knowledge bases. Within Information Integration, we develop a novel system, Q, which draws ideas from machine learning and databases to help a non-expert user construct data-integrating queries based on keywords (across databases) and interactive feedback on answers. We also present an information need-driven strategy for automatically incorporating new sources and their information in Q. We also demonstrate that Q\u27s learning strategy is highly effective in combining the outputs of ``black box\u27\u27 schema matchers and in re-weighting bad alignments. This removes the need to develop an expensive mediated schema which has been necessary for most previous systems

    Privacy protection in context aware systems.

    Get PDF
    Smartphones, loaded with users’ personal information, are a primary computing device for many. Advent of 4G networks, IPV6 and increased number of subscribers to these has triggered a host of application developers to develop softwares that are easy to install on the mobile devices. During the application download process, users accept the terms and conditions that permit revelation of private information. The free application markets are sustainable as the revenue model for most of these service providers is through profiling of users and pushing advertisements to the users. This creates a serious threat to users privacy and hence it is important that “privacy protection mechanisms” should be in place to protect the users’ privacy. Most of the existing solutions falsify or modify the information in the service request and starve the developers of their revenue. In this dissertation, we attempt to bridge the gap by proposing a novel integrated CLOPRO framework (Context Cloaking Privacy Protection) that achieves Identity privacy, Context privacy and Query privacy without depriving the service provider of sustainable revenue made from the CAPPA (Context Aware Privacy Preserving Advertising). Each service request has three parameters: identity, context and actual query. The CLOPRO framework reduces the risk of an adversary linking all of the three parameters. The main objective is to ensure that no single entity in the system has all the information about the user, the queries or the link between them, even though the user gets the desired service in a viable time frame. The proposed comprehensive framework for privacy protecting, does not require the user to use a modified OS or the service provider to modify the way an application developer designs and deploys the application and at the same time protecting the revenue model of the service provider. The system consists of two non-colluding servers, one to process the location coordinates (Location server) and the other to process the original query (Query server). This approach makes several inherent algorithmic and research contributions. First, we have proposed a formal definition of privacy and the attack. We identified and formalized that the privacy is protected if the transformation functions used are non-invertible. Second, we propose use of clustering of every component of the service request to provide anonymity to the user. We use a unique encrypted identity for every service request and a unique id for each cluster of users that ensures Identity privacy. We have designed a Split Clustering Anonymization Algorithms (SCAA) that consists of two algorithms Location Anonymization Algorithm (LAA) and Query Anonymization Algorithm (QAA). The application of LAA replaces the actual location for the users in the cluster with the centroid of the location coordinates of all users in that cluster to achieve Location privacy. The time of initiation of the query is not a part of the message string to the service provider although it is used for identifying the timed out requests. Thus, Context privacy is achieved. To ensure the Query privacy, the generic queries (created using QAA) are used that cover the set of possible queries, based on the feature variations between the queries. The proposed CLOPRO framework associates the ads/coupons relevant to the generic query and the location of the users and they are sent to the user along with the result without revealing the actual user, the initiation time of query or the location and the query, of the user to the service provider. Lastly, we introduce the use of caching in query processing to improve the response time in case of repetitive queries. The Query processing server caches the query result. We have used multiple approaches to prove that privacy is preserved in CLOPRO system. We have demonstrated using the properties of the transformation functions and also using graph theoretic approaches that the user’s Identity, Context and Query is protected against the curious but honest adversary attack, fake query and also replay attacks with the use of CLOPRO framework. The proposed system not only provides \u27k\u27 anonymity, but also satisfies the \u3c k; s \u3e and \u3c k; T \u3e anonymity properties required for privacy protection. The complexity of our proposed algorithm is O(n)

    Scalable Data Integration for Linked Data

    Get PDF
    Linked Data describes an extensive set of structured but heterogeneous datasources where entities are connected by formal semantic descriptions. In thevision of the Semantic Web, these semantic links are extended towards theWorld Wide Web to provide as much machine-readable data as possible forsearch queries. The resulting connections allow an automatic evaluation to findnew insights into the data. Identifying these semantic connections betweentwo data sources with automatic approaches is called link discovery. We derivecommon requirements and a generic link discovery workflow based on similaritiesbetween entity properties and associated properties of ontology concepts. Mostof the existing link discovery approaches disregard the fact that in times ofBig Data, an increasing volume of data sources poses new demands on linkdiscovery. In particular, the problem of complex and time-consuming linkdetermination escalates with an increasing number of intersecting data sources.To overcome the restriction of pairwise linking of entities, holistic clusteringapproaches are needed to link equivalent entities of multiple data sources toconstruct integrated knowledge bases. In this context, the focus on efficiencyand scalability is essential. For example, reusing existing links or backgroundinformation can help to avoid redundant calculations. However, when dealingwith multiple data sources, additional data quality problems must also be dealtwith. This dissertation addresses these comprehensive challenges by designingholistic linking and clustering approaches that enable reuse of existing links.Unlike previous systems, we execute the complete data integration workflowvia a distributed processing system. At first, the LinkLion portal will beintroduced to provide existing links for new applications. These links act asa basis for a physical data integration process to create a unified representationfor equivalent entities from many data sources. We then propose a holisticclustering approach to form consolidated clusters for same real-world entitiesfrom many different sources. At the same time, we exploit the semantic typeof entities to improve the quality of the result. The process identifies errorsin existing links and can find numerous additional links. Additionally, theentity clustering has to react to the high dynamics of the data. In particular,this requires scalable approaches for continuously growing data sources withmany entities as well as additional new sources. Previous entity clusteringapproaches are mostly static, focusing on the one-time linking and clustering ofentities from few sources. Therefore, we propose and evaluate new approaches for incremental entity clustering that supports the continuous addition of newentities and data sources. To cope with the ever-increasing number of LinkedData sources, efficient and scalable methods based on distributed processingsystems are required. Thus we propose distributed holistic approaches to linkmany data sources based on a clustering of entities that represent the samereal-world object. The implementation is realized on Apache Flink. In contrastto previous approaches, we utilize efficiency-enhancing optimizations for bothdistributed static and dynamic clustering. An extensive comparative evaluationof the proposed approaches with various distributed clustering strategies showshigh effectiveness for datasets from multiple domains as well as scalability on amulti-machine Apache Flink cluster

    Doctor of Philosophy

    Get PDF
    dissertationThe explosion of structured Web data (e.g., online databases, Wikipedia infoboxes) creates many opportunities for integrating and querying these data that go far beyond the simple search capabilities provided by search engines. Although much work has been devoted to data integration in the database community, the Web brings new challenges: the Web-scale (e.g., the large and growing volume of data) and the heterogeneity in Web data. Because there are so much data, scalable techniques that require little or no manual intervention and that are robust to noisy data are needed. In this dissertation, we propose a new and effective approach for matching Web-form interfaces and for matching multilingual Wikipedia infoboxes. As a further step toward these problems, we propose a general prudent schema-matching framework that matches a large number of schemas effectively. Our comprehensive experiments for Web-form interfaces and Wikipedia infoboxes show that it can enable on-the-fly, automatic integration of large collections of structured Web data. Another problem we address in this dissertation is schema discovery. While existing integration approaches assume that the relevant data sources and their schemas have been identified in advance, schemas are not always available for structured Web data. Approaches exist that exploit information in Wikipedia to discover the entity types and their associate schemas. However, due to inconsistencies, sparseness, and noise from the community contribution, these approaches are error prone and require substantial human intervention. Given the schema heterogeneity in Wikipedia infoboxes, we developed a new approach that uses the structured information available in infoboxes to cluster similar infoboxes and infer the schemata for entity types. Our approach is unsupervised and resilient to the unpredictable skew in the entity class distribution. Our experiments, using over one hundred thousand infoboxes extracted from Wikipedia, indicate that our approach is effective and produces accurate schemata for Wikipedia entities

    Graph-Based ETL Processes For Warehousing Statistical Open Data

    Get PDF
    ICEIS 2015 will be held in conjunction with ENASE 2015 and GISTAM 2015International audienceWarehousing is a promising mean to cross and analyse Statistical Open Data (SOD). But extracting structures, integrating and defining multidimensional schema from several scattered and heterogeneous tables in the SOD are major problems challenging the traditional ETL (Extract-Transform-Load) processes. In this paper, we present a three step ETL processes which rely on RDF graphs to meet all these problems. In the first step, we automatically extract tables structures and values using a table anatomy ontology. This phase converts structurally heterogeneous tables into a unified RDF graph representation. The second step performs a holistic integration of several semantically heterogeneous RDF graphs. The optimal integration is performed through an Integer Linear Program (ILP). In the third step, system interacts with users to incrementally transform the integrated RDF graph into a multidimensional schema

    Semantic Systems. The Power of AI and Knowledge Graphs

    Get PDF
    This open access book constitutes the refereed proceedings of the 15th International Conference on Semantic Systems, SEMANTiCS 2019, held in Karlsruhe, Germany, in September 2019. The 20 full papers and 8 short papers presented in this volume were carefully reviewed and selected from 88 submissions. They cover topics such as: web semantics and linked (open) data; machine learning and deep learning techniques; semantic information management and knowledge integration; terminology, thesaurus and ontology management; data mining and knowledge discovery; semantics in blockchain and distributed ledger technologies

    A Pure Embedding of Roles: Exploring 4-dimensional Dispatch for Roles in Structured Contexts

    Get PDF
    Present-day software systems have to fulfill an increasing number of requirements, which makes them more and more complex. Many systems need to anticipate changing contexts or need to adapt to changing business rules or requirements. The challenge of 21th-century software development will be to cope with these aspects. We believe that the role concept offers a simple way to adapt an object-oriented program to its changing context. In a role-based application, an object plays multiple roles during its lifetime. If the contexts are represented as first-class entities, they provide dynamic views to the object-oriented program, and if a context changes, the dynamic views can be switched easily, and the software system adapts automatically. However, the concepts of roles and dynamic contexts have been discussed for a long time in many areas of computer science. So far, their employment in an existing object-oriented language requires a specific runtime environment. Also, classical object-oriented languages and their runtime systems are not able to cope with essential role-specific features, such as true delegation or dynamic binding of roles. In addition to that, contexts and views seem to be important in software development. The traditional code-oriented approach to software engineering becomes less and less satisfactory. The support for multiple views of a software system scales much better to the needs of todays systems. However, it relies on programming languages to provide roles for the construction of views. As a solution, this thesis presents an implementation pattern for role-playing objects that does not require a specific runtime system, the SCala ROles Language (SCROLL). Via this library approach, roles are embedded in a statically typed base language as dynamically evolving objects. The approach is pure in the sense that there is no need for an additional compiler or tooling. The implementation pattern is demonstrated on the basis of the Scala language. As technical support from Scala, the pattern requires dynamic mixins, compiler-translated function calls, and implicit conversions. The details how roles are implemented are hidden in a Scala library and therefore transparent to SCROLL programmers. The SCROLL library supports roles embedded in structured contexts. Additionally, a four-dimensional, context-aware dispatch at runtime is presented. It overcomes the subtle ambiguities introduced with the rich semantics of role-playing objects. SCROLL is written in Scala, which blends a modern object-oriented with a functional programming language. The size of the library is below 1400 lines of code so that it can be considered to have minimalistic design and to be easy to maintain. Our approach solves several practical problems arising in the area of dynamical extensibility and adaptation

    Algorithms for Context-Aware Trajectory Analysis

    Get PDF
    • …
    corecore