390 research outputs found

    From Appearance to Essence: Comparing Truth Discovery Methods without Using Ground Truth

    Full text link
    Truth discovery has been widely studied in recent years as a fundamental means for resolving the conflicts in multi-source data. Although many truth discovery methods have been proposed based on different considerations and intuitions, investigations show that no single method consistently outperforms the others. To select the right truth discovery method for a specific application scenario, it becomes essential to evaluate and compare the performance of different methods. A drawback of current research efforts is that they commonly assume the availability of certain ground truth for the evaluation of methods. However, the ground truth may be very limited or even out-of-reach in practice, rendering the evaluation biased by the small ground truth or even unfeasible. In this paper, we present CompTruthHyp, a general approach for comparing the performance of truth discovery methods without using ground truth. In particular, our approach calculates the probability of observations in a dataset based on the output of different methods. The probability is then ranked to reflect the performance of these methods. We review and compare twelve existing truth discovery methods and consider both single-valued and multi-valued objects. Empirical studies on both real-world and synthetic datasets demonstrate the effectiveness of our approach for comparing truth discovery methods

    Crowdsourced Truth Discovery in the Presence of Hierarchies for Knowledge Fusion

    Full text link
    Existing works for truth discovery in categorical data usually assume that claimed values are mutually exclusive and only one among them is correct. However, many claimed values are not mutually exclusive even for functional predicates due to their hierarchical structures. Thus, we need to consider the hierarchical structure to effectively estimate the trustworthiness of the sources and infer the truths. We propose a probabilistic model to utilize the hierarchical structures and an inference algorithm to find the truths. In addition, in the knowledge fusion, the step of automatically extracting information from unstructured data (e.g., text) generates a lot of false claims. To take advantages of the human cognitive abilities in understanding unstructured data, we utilize crowdsourcing to refine the result of the truth discovery. We propose a task assignment algorithm to maximize the accuracy of the inferred truths. The performance study with real-life datasets confirms the effectiveness of our truth inference and task assignment algorithms

    MedTruth: A Semi-supervised Approach to Discovering Knowledge Condition Information from Multi-Source Medical Data

    Full text link
    Knowledge Graph (KG) contains entities and the relations between entities. Due to its representation ability, KG has been successfully applied to support many medical/healthcare tasks. However, in the medical domain, knowledge holds under certain conditions. For example, symptom \emph{runny nose} highly indicates the existence of disease \emph{whooping cough} when the patient is a baby rather than the people at other ages. Such conditions for medical knowledge are crucial for decision-making in various medical applications, which is missing in existing medical KGs. In this paper, we aim to discovery medical knowledge conditions from texts to enrich KGs. Electronic Medical Records (EMRs) are systematized collection of clinical data and contain detailed information about patients, thus EMRs can be a good resource to discover medical knowledge conditions. Unfortunately, the amount of available EMRs is limited due to reasons such as regularization. Meanwhile, a large amount of medical question answering (QA) data is available, which can greatly help the studied task. However, the quality of medical QA data is quite diverse, which may degrade the quality of the discovered medical knowledge conditions. In the light of these challenges, we propose a new truth discovery method, MedTruth, for medical knowledge condition discovery, which incorporates prior source quality information into the source reliability estimation procedure, and also utilizes the knowledge triple information for trustworthy information computation. We conduct series of experiments on real-world medical datasets to demonstrate that the proposed method can discover meaningful and accurate conditions for medical knowledge by leveraging both EMR and QA data. Further, the proposed method is tested on synthetic datasets to validate its effectiveness under various scenarios.Comment: Accepted as CIKM2019 long pape

    TXtract: Taxonomy-Aware Knowledge Extraction for Thousands of Product Categories

    Full text link
    Extracting structured knowledge from product profiles is crucial for various applications in e-Commerce. State-of-the-art approaches for knowledge extraction were each designed for a single category of product, and thus do not apply to real-life e-Commerce scenarios, which often contain thousands of diverse categories. This paper proposes TXtract, a taxonomy-aware knowledge extraction model that applies to thousands of product categories organized in a hierarchical taxonomy. Through category conditional self-attention and multi-task learning, our approach is both scalable, as it trains a single model for thousands of categories, and effective, as it extracts category-specific attribute values. Experiments on products from a taxonomy with 4,000 categories show that TXtract outperforms state-of-the-art approaches by up to 10% in F1 and 15% in coverage across all categories.Comment: Accepted to ACL 2020 (Long Paper

    Temporal graph-based clustering for historical record linkage

    Full text link
    Research in the social sciences is increasingly based on large and complex data collections, where individual data sets from different domains are linked and integrated to allow advanced analytics. A popular type of data used in such a context are historical censuses, as well as birth, death, and marriage certificates. Individually, such data sets however limit the types of studies that can be conducted. Specifically, it is impossible to track individuals, families, or households over time. Once such data sets are linked and family trees spanning several decades are available it is possible to, for example, investigate how education, health, mobility, employment, and social status influence each other and the lives of people over two or even more generations. A major challenge is however the accurate linkage of historical data sets which is due to data quality and commonly also the lack of ground truth data being available. Unsupervised techniques need to be employed, which can be based on similarity graphs generated by comparing individual records. In this paper we present initial results from clustering birth records from Scotland where we aim to identify all births of the same mother and group siblings into clusters. We extend an existing clustering technique for record linkage by incorporating temporal constraints that must hold between births by the same mother, and propose a novel greedy temporal clustering technique. Experimental results show improvements over non-temporary approaches, however further work is needed to obtain links of high quality

    Integration of Probabilistic Uncertain Information

    Full text link
    We study the problem of data integration from sources that contain probabilistic uncertain information. Data is modeled by possible-worlds with probability distribution, compactly represented in the probabilistic relation model. Integration is achieved efficiently using the extended probabilistic relation model. We study the problem of determining the probability distribution of the integration result. It has been shown that, in general, only probability ranges can be determined for the result of integration. In this paper we concentrate on a subclass of extended probabilistic relations, those that are obtainable through integration. We show that under intuitive and reasonable assumptions we can determine the exact probability distribution of the result of integration.Comment: 30 page

    Restricted Boltzmann Machines for Robust and Fast Latent Truth Discovery

    Full text link
    We address the problem of latent truth discovery, LTD for short, where the goal is to discover the underlying true values of entity attributes in the presence of noisy, conflicting or incomplete information. Despite a multitude of algorithms to address the LTD problem that can be found in literature, only little is known about their overall performance with respect to effectiveness (in terms of truth discovery capabilities), efficiency and robustness. A practical LTD approach should satisfy all these characteristics so that it can be applied to heterogeneous datasets of varying quality and degrees of cleanliness. We propose a novel algorithm for LTD that satisfies the above requirements. The proposed model is based on Restricted Boltzmann Machines, thus coined LTD-RBM. In extensive experiments on various heterogeneous and publicly available datasets, LTD-RBM is superior to state-of-the-art LTD techniques in terms of an overall consideration of effectiveness, efficiency and robustness

    ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages

    Full text link
    In many documents, such as semi-structured webpages, textual semantics are augmented with additional information conveyed using visual elements including layout, font size, and color. Prior work on information extraction from semi-structured websites has required learning an extraction model specific to a given template via either manually labeled or distantly supervised data from that template. In this work, we propose a solution for "zero-shot" open-domain relation extraction from webpages with a previously unseen template, including from websites with little overlap with existing sources of knowledge for distant supervision and websites in entirely new subject verticals. Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage and the relationships between them, enabling generalization to new templates. Experiments show this approach provides a 31% F1 gain over a baseline for zero-shot extraction in a new subject vertical.Comment: Accepted to ACL 202

    Limiting the Spread of Fake News on Social Media Platforms by Evaluating Users' Trustworthiness

    Full text link
    Today's social media platforms enable to spread both authentic and fake news very quickly. Some approaches have been proposed to automatically detect such "fake" news based on their content, but it is difficult to agree on universal criteria of authenticity (which can be bypassed by adversaries once known). Besides, it is obviously impossible to have each news item checked by a human. In this paper, we a mechanism to limit the spread of fake news which is not based on content. It can be implemented as a plugin on a social media platform. The principle is as follows: a team of fact-checkers reviews a small number of news items (the most popular ones), which enables to have an estimation of each user's inclination to share fake news items. Then, using a Bayesian approach, we estimate the trustworthiness of future news items, and treat accordingly those of them that pass a certain "untrustworthiness" threshold. We then evaluate the effectiveness and overhead of this technique on a large Twitter graph. We show that having a few thousands users exposed to one given news item enables to reach a very precise estimation of its reliability. We thus identify more than 99% of fake news items with no false positives. The performance impact is very small: the induced overhead on the 90th percentile latency is less than 3%, and less than 8% on the throughput of user operations.Comment: 10 pages, 9 figure

    Learning relationships between data obtained independently

    Full text link
    The aim of this paper is to provide a new method for learning the relationships between data that have been obtained independently. Unlike existing methods like matching, the proposed technique does not require any contextual information, provided that the dependency between the variables of interest is monotone. It can therefore be easily combined with matching in order to exploit the advantages of both methods. This technique can be described as a mix between quantile matching, and deconvolution. We provide for it a theoretical and an empirical validation
    • …
    corecore