277 research outputs found
Linked Data - the story so far
The term “Linked Data” refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions— the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward
Using ChatGPT for Entity Matching
Entity Matching is the task of deciding if two entity descriptions refer to
the same real-world entity. State-of-the-art entity matching methods often rely
on fine-tuning Transformer models such as BERT or RoBERTa. Two major drawbacks
of using these models for entity matching are that (i) the models require
significant amounts of fine-tuning data for reaching a good performance and
(ii) the fine-tuned models are not robust concerning out-of-distribution
entities. In this paper, we investigate using ChatGPT for entity matching as a
more robust, training data-efficient alternative to traditional Transformer
models. We perform experiments along three dimensions: (i) general prompt
design, (ii) in-context learning, and (iii) provision of higher-level matching
knowledge. We show that ChatGPT is competitive with a fine-tuned RoBERTa model,
reaching an average zero-shot performance of 83% F1 on a challenging matching
task on which RoBERTa requires 2000 training examples for reaching a similar
performance. Adding in-context demonstrations to the prompts further improves
the F1 by up to 5% even using only a small set of 20 handpicked examples.
Finally, we show that guiding the zero-shot model by stating higher-level
matching rules leads to similar gains as providing in-context examples
Column Type Annotation using ChatGPT
Column type annotation is the task of annotating the columns of a relational
table with the semantic type of the values contained in each column. Column
type annotation is a crucial pre-processing step for data search and
integration in the context of data lakes. State-of-the-art column type
annotation methods either rely on matching table columns to properties of a
knowledge graph or fine-tune pre-trained language models such as BERT for the
column type annotation task. In this work, we take a different approach and
explore using ChatGPT for column type annotation. We evaluate different prompt
designs in zero- and few-shot settings and experiment with providing task
definitions and detailed instructions to the model. We further implement a
two-step table annotation pipeline which first determines the class of the
entities described in the table and depending on this class asks ChatGPT to
annotate columns using only the relevant subset of the overall vocabulary.
Using instructions as well as the two-step pipeline, ChatGPT reaches F1 scores
of over 85% in zero- and one-shot setups. To reach a similar F1 score a RoBERTa
model needs to be fine-tuned with 300 examples. This comparison shows that
ChatGPT is able deliver competitive results for the column type annotation task
given no or only a minimal amount of task-specific demonstrations
Product Attribute Value Extraction using Large Language Models
E-commerce applications such as faceted product search or product comparison
are based on structured product descriptions like attribute/value pairs. The
vendors on e-commerce platforms do not provide structured product descriptions
but describe offers using titles or descriptions. To process such offers, it is
necessary to extract attribute/value pairs from textual product attributes.
State-of-the-art attribute/value extraction techniques rely on pre-trained
language models (PLMs), such as BERT. Two major drawbacks of these models for
attribute/value extraction are that (i) the models require significant amounts
of task-specific training data and (ii) the fine-tuned models face challenges
in generalizing to attribute values not included in the training data. This
paper explores the potential of large language models (LLMs) as a training
data-efficient and robust alternative to PLM-based attribute/value extraction
methods. We consider hosted LLMs, such as GPT-3.5 and GPT-4, as well as
open-source LLMs based on Llama2. We evaluate the models in a zero-shot
scenario and in a scenario where task-specific training data is available. In
the zero-shot scenario, we compare various prompt designs for representing
information about the target attributes of the extraction. In the scenario with
training data, we investigate (i) the provision of example attribute values,
(ii) the selection of in-context demonstrations, and (iii) the fine-tuning of
GPT-3.5. Our experiments show that GPT-4 achieves an average F1-score of 85% on
the two evaluation datasets while the best PLM-based techniques perform on
average 5% worse using the same amount of training data. GPT-4 achieves a 10%
higher F1-score than the best open-source LLM. The fine-tuned GPT-3.5 model
reaches a similar performance as GPT-4 while being significantly more
cost-efficient
SC-Block: Supervised Contrastive Blocking within Entity Resolution Pipelines
The goal of entity resolution is to identify records in multiple datasets
that represent the same real-world entity. However, comparing all records
across datasets can be computationally intensive, leading to long runtimes. To
reduce these runtimes, entity resolution pipelines are constructed of two
parts: a blocker that applies a computationally cheap method to select
candidate record pairs, and a matcher that afterwards identifies matching pairs
from this set using more expensive methods. This paper presents SC-Block, a
blocking method that utilizes supervised contrastive learning for positioning
records in the embedding space, and nearest neighbour search for candidate set
building. We benchmark SC-Block against eight state-of-the-art blocking
methods. In order to relate the training time of SC-Block to the reduction of
the overall runtime of the entity resolution pipeline, we combine SC-Block with
four matching methods into complete pipelines. For measuring the overall
runtime, we determine candidate sets with 99.5% pair completeness and pass them
to the matcher. The results show that SC-Block is able to create smaller
candidate sets and pipelines with SC-Block execute 1.5 to 2 times faster
compared to pipelines with other blockers, without sacrificing F1 score.
Blockers are often evaluated using relatively small datasets which might lead
to runtime effects resulting from a large vocabulary size being overlooked. In
order to measure runtimes in a more challenging setting, we introduce a new
benchmark dataset that requires large numbers of product offers to be blocked.
On this large-scale benchmark dataset, pipelines utilizing SC-Block and the
best-performing matcher execute 8 times faster than pipelines utilizing another
blocker with the same matcher reducing the runtime from 2.5 hours to 18
minutes, clearly compensating for the 5 minutes required for training SC-Block
WDC Products: A Multi-Dimensional Entity Matching Benchmark
The difficulty of an entity matching task depends on a combination of
multiple factors such as the amount of corner-case pairs, the fraction of
entities in the test set that have not been seen during training, and the size
of the development set. Current entity matching benchmarks usually represent
single points in the space along such dimensions or they provide for the
evaluation of matching methods along a single dimension, for instance the
amount of training data. This paper presents WDC Products, an entity matching
benchmark which provides for the systematic evaluation of matching systems
along combinations of three dimensions while relying on real-word data. The
three dimensions are (i) amount of corner-cases (ii) generalization to unseen
entities, and (iii) development set size. Generalization to unseen entities is
a dimension not covered by any of the existing benchmarks yet but is crucial
for evaluating the robustness of entity matching systems. WDC Products is based
on heterogeneous product data from thousands of e-shops which mark-up products
offers using schema.org annotations. Instead of learning how to match entity
pairs, entity matching can also be formulated as a multi-class classification
task that requires the matcher to recognize individual entities. WDC Products
is the first benchmark that provides a pair-wise and a multi-class formulation
of the same tasks and thus allows to directly compare the two alternatives. We
evaluate WDC Products using several state-of-the-art matching systems,
including Ditto, HierGAT, and R-SupCon. The evaluation shows that all matching
systems struggle with unseen entities to varying degrees. It also shows that
some systems are more training data efficient than others
The Web Data Commons Structured Data Extraction
More and more websites annotate their content using different markup formats. These annotations involve a large number of topics such as persons, events, products, hotels, organizations and cities. The purpose of embedding structured data in HTML pages is to make the content of those pages understandable to web applications. In this way, the retrieval and integration of data deriving from different web pages is greatly facilitated. The presented poster gives an overview of the Web Data Commons - structured data project for the year 2016. The Web Data Commons project extracts structured data from the web corpus provided by Common Crawl, the largest public web corpus, and offers the extracted data for public download. In order to process these huge amounts of data, Web Data Commons builds upon its Extraction Framework and the Amazon Web Services
Column type annotation using ChatGPT
Column type annotation is the task of annotating the columns of a relational table with the semantic type of the values contained in each column. Column type annotation is an important pre-processing step for data search and data integration in the context of data lakes. State-of-the-art column type annotation methods either rely on matching table columns to properties of a knowledge graph or fine-tune pre-trained language models such as BERT for column type annotation. In this work, we take a different approach and explore using ChatGPT for column type annotation. We evaluate different prompt designs in zero- and few-shot settings and experiment with providing task definitions and detailed instructions to the model.
We further implement a two-step table annotation pipeline which first determines the class of the entities described in the table and depending on this class asks ChatGPT to annotate columns using only the relevant subset of the overall vocabulary.
Using instructions as well as the two-step pipeline, ChatGPT reaches F1 scores of over 85% in zero- and one-shot setups. To reach a similar F1 score a RoBERTa model needs to be fine-tuned with 356 examples. This comparison shows that ChatGPT is able deliver competitive results for the column type annotation task given no or only a minimal amount of task-specific demonstrations
Integrating product data using deep learning : Art.-Nr. 11
Product matching is the task of deciding whether two product descriptions refer to the same real-world product. Product matching is a central task in e-commerce applications such as online market places and price comparison portals, as these applications need to find out which offers refer to the same product before they can integrate data from the offers or compare product prices. Product matching is a non-trivial task as merchants describe products in different ways and as small differences in the product descriptions matter for distinguishing between different variants of the same product. A successful approach for dealing with the heterogeneity of product offers is to combine deep learning-based matching techniques with large amounts of training data which can be extracted from Web corpora such as the Common Crawl. Training deep learning methods involving millions of parameters for use cases such as product matching requires access to large compute resources. In this extended abstract, we report how we trained different RNN- and BERT-based models for product matching using the bwHPC infrastructure and how this extended training allowed us to reach peak performance. Afterwards, we describe how we use the bwHPC infrastructure for our ongoing research on table representation learning for data integration
- …