4,099 research outputs found
Co-evolution of RDF Datasets
Linking Data initiatives have fostered the publication of large number of RDF
datasets in the Linked Open Data (LOD) cloud, as well as the development of
query processing infrastructures to access these data in a federated fashion.
However, different experimental studies have shown that availability of LOD
datasets cannot be always ensured, being RDF data replication required for
envisioning reliable federated query frameworks. Albeit enhancing data
availability, RDF data replication requires synchronization and conflict
resolution when replicas and source datasets are allowed to change data over
time, i.e., co-evolution management needs to be provided to ensure consistency.
In this paper, we tackle the problem of RDF data co-evolution and devise an
approach for conflict resolution during co-evolution of RDF datasets. Our
proposed approach is property-oriented and allows for exploiting semantics
about RDF properties during co-evolution management. The quality of our
approach is empirically evaluated in different scenarios on the DBpedia-live
dataset. Experimental results suggest that proposed proposed techniques have a
positive impact on the quality of data in source datasets and replicas.Comment: 18 pages, 4 figures, Accepted in ICWE, 201
DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents
Vision-language pretraining models have achieved great success in supporting
multimedia applications by understanding the alignments between images and
text. While existing vision-language pretraining models primarily focus on
understanding single image associated with a single piece of text, they often
ignore the alignment at the intra-document level, consisting of multiple
sentences with multiple images. In this work, we propose DocumentCLIP, a
salience-aware contrastive learning framework to enforce vision-language
pretraining models to comprehend the interaction between images and longer text
within documents. Our model is beneficial for the real-world multimodal
document understanding like news article, magazines, product descriptions,
which contain linguistically and visually richer content. To the best of our
knowledge, we are the first to explore multimodal intra-document links by
contrastive learning. In addition, we collect a large Wikipedia dataset for
pretraining, which provides various topics and structures. Experiments show
DocumentCLIP not only outperforms the state-of-the-art baselines in the
supervised setting, but also achieves the best zero-shot performance in the
wild after human evaluation. Our code is available at
https://github.com/FuxiaoLiu/DocumentCLIP.Comment: 8 pages, 5 figures. In submissio
A Coherent Unsupervised Model for Toponym Resolution
Toponym Resolution, the task of assigning a location mention in a document to
a geographic referent (i.e., latitude/longitude), plays a pivotal role in
analyzing location-aware content. However, the ambiguities of natural language
and a huge number of possible interpretations for toponyms constitute
insurmountable hurdles for this task. In this paper, we study the problem of
toponym resolution with no additional information other than a gazetteer and no
training data. We demonstrate that a dearth of large enough annotated data
makes supervised methods less capable of generalizing. Our proposed method
estimates the geographic scope of documents and leverages the connections
between nearby place names as evidence to resolve toponyms. We explore the
interactions between multiple interpretations of mentions and the relationships
between different toponyms in a document to build a model that finds the most
coherent resolution. Our model is evaluated on three news corpora, two from the
literature and one collected and annotated by us; then, we compare our methods
to the state-of-the-art unsupervised and supervised techniques. We also examine
three commercial products including Reuters OpenCalais, Yahoo! YQL Placemaker,
and Google Cloud Natural Language API. The evaluation shows that our method
outperforms the unsupervised technique as well as Reuters OpenCalais and Google
Cloud Natural Language API on all three corpora; also, our method shows a
performance close to that of the state-of-the-art supervised method and
outperforms it when the test data has 40% or more toponyms that are not seen in
the training data.Comment: 9 pages (+1 page reference), WWW '18 Proceedings of the 2018 World
Wide Web Conferenc
From Data Fusion to Knowledge Fusion
The task of {\em data fusion} is to identify the true values of data items
(eg, the true date of birth for {\em Tom Cruise}) among multiple observed
values drawn from different sources (eg, Web sites) of varying (and unknown)
reliability. A recent survey\cite{LDL+12} has provided a detailed comparison of
various fusion methods on Deep Web data. In this paper, we study the
applicability and limitations of different fusion techniques on a more
challenging problem: {\em knowledge fusion}. Knowledge fusion identifies true
subject-predicate-object triples extracted by multiple information extractors
from multiple information sources. These extractors perform the tasks of entity
linkage and schema alignment, thus introducing an additional source of noise
that is quite different from that traditionally considered in the data fusion
literature, which only focuses on factual errors in the original sources. We
adapt state-of-the-art data fusion techniques and apply them to a knowledge
base with 1.6B unique knowledge triples extracted by 12 extractors from over 1B
Web pages, which is three orders of magnitude larger than the data sets used in
previous data fusion papers. We show great promise of the data fusion
approaches in solving the knowledge fusion problem, and suggest interesting
research directions through a detailed error analysis of the methods.Comment: VLDB'201
Materializing the editing history of Wikipedia as linked data in DBpedia
International audienceWe describe a DBpedia extractor materializing the editing history of Wikipedia pages as linked data to support queries and indicators on the history
Augmenting cross-domain knowledge bases using web tables
Cross-domain knowledge bases are increasingly used for a large variety of applications. As the usefulness of a knowledge base for many of these applications increases with its completeness, augmenting knowledge bases with new knowledge is an important task. A source for this new knowledge could be in the form of web tables, which are relational HTML tables extracted from the Web.
This thesis researches data integration methods for cross-domain knowledge base augmentation from web tables. Existing methods have focused on the task of slot filling static data. We research methods that additionally enable augmentation in the form of slot filling time-dependent data and entity expansion.
When augmenting knowledge bases using time-dependent web table data, we require time-aware fusion methods. They identify from a set of conflicting web table values the one that is valid given a certain temporal scope. A primary concern of time-aware fusion is therefore the estimation of temporal scope annotations, which web table data lacks. We introduce two time-aware fusion approaches. In the first, we extract timestamps from the table and its context to exploit as temporal scopes, additionally introducing approaches to reduce the sparsity and noisiness of these timestamps. We introduce a second time-aware fusion method that exploits a temporal knowledge base to propagate temporal scopes to web table data, reducing the dependence on noisy and sparse timestamps.
Entity expansion enriches a knowledge base with previously unknown long-tail entities. It is a task that to our knowledge has not been researched before. We introduce the Long-Tail Entity Extraction Pipeline, the first system that can perform entity expansion from web table data. The pipeline works by employing identity resolution twice, once to disambiguate between entity occurrences within web tables, and once between entities created from web tables and existing entities in the knowledge base. In addition to identifying new long-tail entities, the pipeline also creates their descriptions according to the knowledge base schema.
By running the pipeline on a large-scale web table corpus, we profile the potential of web tables for the task of entity expansion. We find, that given certain classes, we can enrich a knowledge base with tens and even hundreds of thousands new entities and corresponding facts.
Finally, we introduce a weak supervision approach for long-tail entity extraction, where supervision in the form of a large number of manually labeled matching and non-matching pairs is substituted with a small set of bold matching rules build using the knowledge base schema. Using this, we can reduce the supervision effort required to train our pipeline to enable cross-domain entity expansion at web-scale.
In the context of this research, we created and published two datasets. The Time-Dependent Ground Truth contains time-dependent knowledge with more than one million temporal facts and corresponding temporal scope annotations. It could potentially be employed for a large variety of tasks that consider the temporal aspect of data. We also built the Web Tables for Long-Tail Entity Extraction gold standard, the first benchmark for the task of entity expansion from web tables
- …