145,036 research outputs found
A Survey on Truth Discovery
Thanks to information explosion, data for the objects of interest can be
collected from increasingly more sources. However, for the same object, there
usually exist conflicts among the collected multi-source information. To tackle
this challenge, truth discovery, which integrates multi-source noisy
information by estimating the reliability of each source, has emerged as a hot
topic. Several truth discovery methods have been proposed for various
scenarios, and they have been successfully applied in diverse application
domains. In this survey, we focus on providing a comprehensive overview of
truth discovery methods, and summarizing them from different aspects. We also
discuss some future directions of truth discovery research. We hope that this
survey will promote a better understanding of the current progress on truth
discovery, and offer some guidelines on how to apply these approaches in
application domains
MedTruth: A Semi-supervised Approach to Discovering Knowledge Condition Information from Multi-Source Medical Data
Knowledge Graph (KG) contains entities and the relations between entities.
Due to its representation ability, KG has been successfully applied to support
many medical/healthcare tasks. However, in the medical domain, knowledge holds
under certain conditions. For example, symptom \emph{runny nose} highly
indicates the existence of disease \emph{whooping cough} when the patient is a
baby rather than the people at other ages. Such conditions for medical
knowledge are crucial for decision-making in various medical applications,
which is missing in existing medical KGs. In this paper, we aim to discovery
medical knowledge conditions from texts to enrich KGs.
Electronic Medical Records (EMRs) are systematized collection of clinical
data and contain detailed information about patients, thus EMRs can be a good
resource to discover medical knowledge conditions. Unfortunately, the amount of
available EMRs is limited due to reasons such as regularization. Meanwhile, a
large amount of medical question answering (QA) data is available, which can
greatly help the studied task. However, the quality of medical QA data is quite
diverse, which may degrade the quality of the discovered medical knowledge
conditions. In the light of these challenges, we propose a new truth discovery
method, MedTruth, for medical knowledge condition discovery, which incorporates
prior source quality information into the source reliability estimation
procedure, and also utilizes the knowledge triple information for trustworthy
information computation. We conduct series of experiments on real-world medical
datasets to demonstrate that the proposed method can discover meaningful and
accurate conditions for medical knowledge by leveraging both EMR and QA data.
Further, the proposed method is tested on synthetic datasets to validate its
effectiveness under various scenarios.Comment: Accepted as CIKM2019 long pape
Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach
Relation extraction is a fundamental task in information extraction. Most
existing methods have heavy reliance on annotations labeled by human experts,
which are costly and time-consuming. To overcome this drawback, we propose a
novel framework, REHession, to conduct relation extractor learning using
annotations from heterogeneous information source, e.g., knowledge base and
domain heuristics. These annotations, referred as heterogeneous supervision,
often conflict with each other, which brings a new challenge to the original
relation extraction task: how to infer the true label from noisy labels for a
given instance. Identifying context information as the backbone of both
relation extraction and true label discovery, we adopt embedding techniques to
learn the distributed representations of context, which bridges all components
with mutual enhancement in an iterative fashion. Extensive experimental results
demonstrate the superiority of REHession over the state-of-the-art.Comment: EMNLP 201
Crowdsourced Truth Discovery in the Presence of Hierarchies for Knowledge Fusion
Existing works for truth discovery in categorical data usually assume that
claimed values are mutually exclusive and only one among them is correct.
However, many claimed values are not mutually exclusive even for functional
predicates due to their hierarchical structures. Thus, we need to consider the
hierarchical structure to effectively estimate the trustworthiness of the
sources and infer the truths. We propose a probabilistic model to utilize the
hierarchical structures and an inference algorithm to find the truths. In
addition, in the knowledge fusion, the step of automatically extracting
information from unstructured data (e.g., text) generates a lot of false
claims. To take advantages of the human cognitive abilities in understanding
unstructured data, we utilize crowdsourcing to refine the result of the truth
discovery. We propose a task assignment algorithm to maximize the accuracy of
the inferred truths. The performance study with real-life datasets confirms the
effectiveness of our truth inference and task assignment algorithms
Finding News Citations for Wikipedia
An important editing policy in Wikipedia is to provide citations for added
statements in Wikipedia pages, where statements can be arbitrary pieces of
text, ranging from a sentence to a paragraph. In many cases citations are
either outdated or missing altogether.
In this work we address the problem of finding and updating news citations
for statements in entity pages. We propose a two-stage supervised approach for
this problem. In the first step, we construct a classifier to find out whether
statements need a news citation or other kinds of citations (web, book,
journal, etc.). In the second step, we develop a news citation algorithm for
Wikipedia statements, which recommends appropriate citations from a given news
collection. Apart from IR techniques that use the statement to query the news
collection, we also formalize three properties of an appropriate citation,
namely: (i) the citation should entail the Wikipedia statement, (ii) the
statement should be central to the citation, and (iii) the citation should be
from an authoritative source.
We perform an extensive evaluation of both steps, using 20 million articles
from a real-world news collection. Our results are quite promising, and show
that we can perform this task with high precision and at scale
Understanding Infographics through Textual and Visual Tag Prediction
We introduce the problem of visual hashtag discovery for infographics:
extracting visual elements from an infographic that are diagnostic of its
topic. Given an infographic as input, our computational approach automatically
outputs textual and visual elements predicted to be representative of the
infographic content. Concretely, from a curated dataset of 29K large
infographic images sampled across 26 categories and 391 tags, we present an
automated two step approach. First, we extract the text from an infographic and
use it to predict text tags indicative of the infographic content. And second,
we use these predicted text tags as a supervisory signal to localize the most
diagnostic visual elements from within the infographic i.e. visual hashtags. We
report performances on a categorization and multi-label tag prediction problem
and compare our proposed visual hashtags to human annotations
Object Discovery with a Copy-Pasting GAN
We tackle the problem of object discovery, where objects are segmented for a
given input image, and the system is trained without using any direct
supervision whatsoever. A novel copy-pasting GAN framework is proposed, where
the generator learns to discover an object in one image by compositing it into
another image such that the discriminator cannot tell that the resulting image
is fake. After carefully addressing subtle issues, such as preventing the
generator from `cheating', this game results in the generator learning to
select objects, as copy-pasting objects is most likely to fool the
discriminator. The system is shown to work well on four very different
datasets, including large object appearance variations in challenging cluttered
backgrounds
Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning
Forecasting events like civil unrest movements, disease outbreaks, financial
market movements and government elections from open source indicators such as
news feeds and social media streams is an important and challenging problem.
From the perspective of human analysts and policy makers, forecasting
algorithms need to provide supporting evidence and identify the causes related
to the event of interest. We develop a novel multiple instance learning based
approach that jointly tackles the problem of identifying evidence-based
precursors and forecasts events into the future. Specifically, given a
collection of streaming news articles from multiple sources we develop a nested
multiple instance learning approach to forecast significant societal events
across three countries in Latin America. Our algorithm is able to identify news
articles considered as precursors for a protest. Our empirical evaluation shows
the strengths of our proposed approaches in filtering candidate precursors,
forecasting the occurrence of events with a lead time and predicting the
characteristics of different events in comparison to several other
formulations. We demonstrate through case studies the effectiveness of our
proposed model in filtering the candidate precursors for inspection by a human
analyst.Comment: The conference version of the paper is submitted for publicatio
RUSSE'2020: Findings of the First Taxonomy Enrichment Task for the Russian language
This paper describes the results of the first shared task on taxonomy
enrichment for the Russian language. The participants were asked to extend an
existing taxonomy with previously unseen words: for each new word their systems
should provide a ranked list of possible (candidate) hypernyms. In comparison
to the previous tasks for other languages, our competition has a more realistic
task setting: new words were provided without definitions. Instead, we provided
a textual corpus where these new terms occurred. For this evaluation campaign,
we developed a new evaluation dataset based on unpublished RuWordNet data. The
shared task features two tracks: "nouns" and "verbs". 16 teams participated in
the task demonstrating high results with more than half of them outperforming
the provided baseline
End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning
End-to-end text-to-speech (TTS) has shown great success on large quantities
of paired text plus speech data. However, laborious data collection remains
difficult for at least 95% of the languages over the world, which hinders the
development of TTS in different languages. In this paper, we aim to build TTS
systems for such low-resource (target) languages where only very limited paired
data are available. We show such TTS can be effectively constructed by
transferring knowledge from a high-resource (source) language. Since the model
trained on source language cannot be directly applied to target language due to
input space mismatch, we propose a method to learn a mapping between source and
target linguistic symbols. Benefiting from this learned mapping, pronunciation
information can be preserved throughout the transferring procedure. Preliminary
experiments show that we only need around 15 minutes of paired data to obtain a
relatively good TTS system. Furthermore, analytic studies demonstrated that the
automatically discovered mapping correlate well with the phonetic expertise.Comment: Accepted to Interspeech 201
- …