390 research outputs found
From Appearance to Essence: Comparing Truth Discovery Methods without Using Ground Truth
Truth discovery has been widely studied in recent years as a fundamental
means for resolving the conflicts in multi-source data. Although many truth
discovery methods have been proposed based on different considerations and
intuitions, investigations show that no single method consistently outperforms
the others. To select the right truth discovery method for a specific
application scenario, it becomes essential to evaluate and compare the
performance of different methods. A drawback of current research efforts is
that they commonly assume the availability of certain ground truth for the
evaluation of methods. However, the ground truth may be very limited or even
out-of-reach in practice, rendering the evaluation biased by the small ground
truth or even unfeasible. In this paper, we present CompTruthHyp, a general
approach for comparing the performance of truth discovery methods without using
ground truth. In particular, our approach calculates the probability of
observations in a dataset based on the output of different methods. The
probability is then ranked to reflect the performance of these methods. We
review and compare twelve existing truth discovery methods and consider both
single-valued and multi-valued objects. Empirical studies on both real-world
and synthetic datasets demonstrate the effectiveness of our approach for
comparing truth discovery methods
Crowdsourced Truth Discovery in the Presence of Hierarchies for Knowledge Fusion
Existing works for truth discovery in categorical data usually assume that
claimed values are mutually exclusive and only one among them is correct.
However, many claimed values are not mutually exclusive even for functional
predicates due to their hierarchical structures. Thus, we need to consider the
hierarchical structure to effectively estimate the trustworthiness of the
sources and infer the truths. We propose a probabilistic model to utilize the
hierarchical structures and an inference algorithm to find the truths. In
addition, in the knowledge fusion, the step of automatically extracting
information from unstructured data (e.g., text) generates a lot of false
claims. To take advantages of the human cognitive abilities in understanding
unstructured data, we utilize crowdsourcing to refine the result of the truth
discovery. We propose a task assignment algorithm to maximize the accuracy of
the inferred truths. The performance study with real-life datasets confirms the
effectiveness of our truth inference and task assignment algorithms
MedTruth: A Semi-supervised Approach to Discovering Knowledge Condition Information from Multi-Source Medical Data
Knowledge Graph (KG) contains entities and the relations between entities.
Due to its representation ability, KG has been successfully applied to support
many medical/healthcare tasks. However, in the medical domain, knowledge holds
under certain conditions. For example, symptom \emph{runny nose} highly
indicates the existence of disease \emph{whooping cough} when the patient is a
baby rather than the people at other ages. Such conditions for medical
knowledge are crucial for decision-making in various medical applications,
which is missing in existing medical KGs. In this paper, we aim to discovery
medical knowledge conditions from texts to enrich KGs.
Electronic Medical Records (EMRs) are systematized collection of clinical
data and contain detailed information about patients, thus EMRs can be a good
resource to discover medical knowledge conditions. Unfortunately, the amount of
available EMRs is limited due to reasons such as regularization. Meanwhile, a
large amount of medical question answering (QA) data is available, which can
greatly help the studied task. However, the quality of medical QA data is quite
diverse, which may degrade the quality of the discovered medical knowledge
conditions. In the light of these challenges, we propose a new truth discovery
method, MedTruth, for medical knowledge condition discovery, which incorporates
prior source quality information into the source reliability estimation
procedure, and also utilizes the knowledge triple information for trustworthy
information computation. We conduct series of experiments on real-world medical
datasets to demonstrate that the proposed method can discover meaningful and
accurate conditions for medical knowledge by leveraging both EMR and QA data.
Further, the proposed method is tested on synthetic datasets to validate its
effectiveness under various scenarios.Comment: Accepted as CIKM2019 long pape
TXtract: Taxonomy-Aware Knowledge Extraction for Thousands of Product Categories
Extracting structured knowledge from product profiles is crucial for various
applications in e-Commerce. State-of-the-art approaches for knowledge
extraction were each designed for a single category of product, and thus do not
apply to real-life e-Commerce scenarios, which often contain thousands of
diverse categories. This paper proposes TXtract, a taxonomy-aware knowledge
extraction model that applies to thousands of product categories organized in a
hierarchical taxonomy. Through category conditional self-attention and
multi-task learning, our approach is both scalable, as it trains a single model
for thousands of categories, and effective, as it extracts category-specific
attribute values. Experiments on products from a taxonomy with 4,000 categories
show that TXtract outperforms state-of-the-art approaches by up to 10% in F1
and 15% in coverage across all categories.Comment: Accepted to ACL 2020 (Long Paper
Temporal graph-based clustering for historical record linkage
Research in the social sciences is increasingly based on large and complex
data collections, where individual data sets from different domains are linked
and integrated to allow advanced analytics. A popular type of data used in such
a context are historical censuses, as well as birth, death, and marriage
certificates. Individually, such data sets however limit the types of studies
that can be conducted. Specifically, it is impossible to track individuals,
families, or households over time. Once such data sets are linked and family
trees spanning several decades are available it is possible to, for example,
investigate how education, health, mobility, employment, and social status
influence each other and the lives of people over two or even more generations.
A major challenge is however the accurate linkage of historical data sets which
is due to data quality and commonly also the lack of ground truth data being
available. Unsupervised techniques need to be employed, which can be based on
similarity graphs generated by comparing individual records. In this paper we
present initial results from clustering birth records from Scotland where we
aim to identify all births of the same mother and group siblings into clusters.
We extend an existing clustering technique for record linkage by incorporating
temporal constraints that must hold between births by the same mother, and
propose a novel greedy temporal clustering technique. Experimental results show
improvements over non-temporary approaches, however further work is needed to
obtain links of high quality
Integration of Probabilistic Uncertain Information
We study the problem of data integration from sources that contain
probabilistic uncertain information. Data is modeled by possible-worlds with
probability distribution, compactly represented in the probabilistic relation
model. Integration is achieved efficiently using the extended probabilistic
relation model. We study the problem of determining the probability
distribution of the integration result. It has been shown that, in general,
only probability ranges can be determined for the result of integration. In
this paper we concentrate on a subclass of extended probabilistic relations,
those that are obtainable through integration. We show that under intuitive and
reasonable assumptions we can determine the exact probability distribution of
the result of integration.Comment: 30 page
Restricted Boltzmann Machines for Robust and Fast Latent Truth Discovery
We address the problem of latent truth discovery, LTD for short, where the
goal is to discover the underlying true values of entity attributes in the
presence of noisy, conflicting or incomplete information. Despite a multitude
of algorithms to address the LTD problem that can be found in literature, only
little is known about their overall performance with respect to effectiveness
(in terms of truth discovery capabilities), efficiency and robustness. A
practical LTD approach should satisfy all these characteristics so that it can
be applied to heterogeneous datasets of varying quality and degrees of
cleanliness.
We propose a novel algorithm for LTD that satisfies the above requirements.
The proposed model is based on Restricted Boltzmann Machines, thus coined
LTD-RBM. In extensive experiments on various heterogeneous and publicly
available datasets, LTD-RBM is superior to state-of-the-art LTD techniques in
terms of an overall consideration of effectiveness, efficiency and robustness
ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages
In many documents, such as semi-structured webpages, textual semantics are
augmented with additional information conveyed using visual elements including
layout, font size, and color. Prior work on information extraction from
semi-structured websites has required learning an extraction model specific to
a given template via either manually labeled or distantly supervised data from
that template. In this work, we propose a solution for "zero-shot" open-domain
relation extraction from webpages with a previously unseen template, including
from websites with little overlap with existing sources of knowledge for
distant supervision and websites in entirely new subject verticals. Our model
uses a graph neural network-based approach to build a rich representation of
text fields on a webpage and the relationships between them, enabling
generalization to new templates. Experiments show this approach provides a 31%
F1 gain over a baseline for zero-shot extraction in a new subject vertical.Comment: Accepted to ACL 202
Limiting the Spread of Fake News on Social Media Platforms by Evaluating Users' Trustworthiness
Today's social media platforms enable to spread both authentic and fake news
very quickly. Some approaches have been proposed to automatically detect such
"fake" news based on their content, but it is difficult to agree on universal
criteria of authenticity (which can be bypassed by adversaries once known).
Besides, it is obviously impossible to have each news item checked by a human.
In this paper, we a mechanism to limit the spread of fake news which is not
based on content. It can be implemented as a plugin on a social media platform.
The principle is as follows: a team of fact-checkers reviews a small number of
news items (the most popular ones), which enables to have an estimation of each
user's inclination to share fake news items. Then, using a Bayesian approach,
we estimate the trustworthiness of future news items, and treat accordingly
those of them that pass a certain "untrustworthiness" threshold.
We then evaluate the effectiveness and overhead of this technique on a large
Twitter graph. We show that having a few thousands users exposed to one given
news item enables to reach a very precise estimation of its reliability. We
thus identify more than 99% of fake news items with no false positives. The
performance impact is very small: the induced overhead on the 90th percentile
latency is less than 3%, and less than 8% on the throughput of user operations.Comment: 10 pages, 9 figure
Learning relationships between data obtained independently
The aim of this paper is to provide a new method for learning the
relationships between data that have been obtained independently. Unlike
existing methods like matching, the proposed technique does not require any
contextual information, provided that the dependency between the variables of
interest is monotone. It can therefore be easily combined with matching in
order to exploit the advantages of both methods. This technique can be
described as a mix between quantile matching, and deconvolution. We provide for
it a theoretical and an empirical validation
- …