126 research outputs found
Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish
Product matching corresponds to the task of matching identical products
across different data sources. It typically employs available product features
which, apart from being multimodal, i.e., comprised of various data types,
might be non-homogeneous and incomplete. The paper shows that pre-trained,
multilingual Transformer models, after fine-tuning, are suitable for solving
the product matching problem using textual features both in English and Polish
languages. We tested multilingual mBERT and XLM-RoBERTa models in English on
Web Data Commons - training dataset and gold standard for large-scale product
matching. The obtained results show that these models perform similarly to the
latest solutions tested on this set, and in some cases, the results were even
better.
Additionally, we prepared a new dataset entirely in Polish and based on
offers in selected categories obtained from several online stores for the
research purpose. It is the first open dataset for product matching tasks in
Polish, which allows comparing the effectiveness of the pre-trained models.
Thus, we also showed the baseline results obtained by the fine-tuned mBERT and
XLM-RoBERTa models on the Polish datasets.Comment: 11 pages, 5 figure
WDC Products: A Multi-Dimensional Entity Matching Benchmark
The difficulty of an entity matching task depends on a combination of
multiple factors such as the amount of corner-case pairs, the fraction of
entities in the test set that have not been seen during training, and the size
of the development set. Current entity matching benchmarks usually represent
single points in the space along such dimensions or they provide for the
evaluation of matching methods along a single dimension, for instance the
amount of training data. This paper presents WDC Products, an entity matching
benchmark which provides for the systematic evaluation of matching systems
along combinations of three dimensions while relying on real-word data. The
three dimensions are (i) amount of corner-cases (ii) generalization to unseen
entities, and (iii) development set size. Generalization to unseen entities is
a dimension not covered by any of the existing benchmarks yet but is crucial
for evaluating the robustness of entity matching systems. WDC Products is based
on heterogeneous product data from thousands of e-shops which mark-up products
offers using schema.org annotations. Instead of learning how to match entity
pairs, entity matching can also be formulated as a multi-class classification
task that requires the matcher to recognize individual entities. WDC Products
is the first benchmark that provides a pair-wise and a multi-class formulation
of the same tasks and thus allows to directly compare the two alternatives. We
evaluate WDC Products using several state-of-the-art matching systems,
including Ditto, HierGAT, and R-SupCon. The evaluation shows that all matching
systems struggle with unseen entities to varying degrees. It also shows that
some systems are more training data efficient than others
ProMap: Datasets for Product Mapping in E-commerce
The goal of product mapping is to decide, whether two listings from two
different e-shops describe the same products. Existing datasets of matching and
non-matching pairs of products, however, often suffer from incomplete product
information or contain only very distant non-matching products. Therefore,
while predictive models trained on these datasets achieve good results on them,
in practice, they are unusable as they cannot distinguish very similar but
non-matching pairs of products. This paper introduces two new datasets for
product mapping: ProMapCz consisting of 1,495 Czech product pairs and ProMapEn
consisting of 1,555 English product pairs of matching and non-matching products
manually scraped from two pairs of e-shops. The datasets contain both images
and textual descriptions of the products, including their specifications,
making them one of the most complete datasets for product mapping.
Additionally, the non-matching products were selected in two phases, creating
two types of non-matches -- close non-matches and medium non-matches. Even the
medium non-matches are pairs of products that are much more similar than
non-matches in other datasets -- for example, they still need to have the same
brand and similar name and price. After simple data preprocessing, several
machine learning algorithms were trained on these and two the other datasets to
demonstrate the complexity and completeness of ProMap datasets. ProMap datasets
are presented as a golden standard for further research of product mapping
filling the gaps in existing ones
AdapterEM: Pre-trained Language Model Adaptation for Generalized Entity Matching using Adapter-tuning
Entity Matching (EM) involves identifying different data representations
referring to the same entity from multiple data sources and is typically
formulated as a binary classification problem. It is a challenging problem in
data integration due to the heterogeneity of data representations.
State-of-the-art solutions have adopted NLP techniques based on pre-trained
language models (PrLMs) via the fine-tuning paradigm, however, sequential
fine-tuning of overparameterized PrLMs can lead to catastrophic forgetting,
especially in low-resource scenarios. In this study, we propose a
parameter-efficient paradigm for fine-tuning PrLMs based on adapters, small
neural networks encapsulated between layers of a PrLM, by optimizing only the
adapter and classifier weights while the PrLMs parameters are frozen.
Adapter-based methods have been successfully applied to multilingual speech
problems achieving promising results, however, the effectiveness of these
methods when applied to EM is not yet well understood, particularly for
generalized EM with heterogeneous data. Furthermore, we explore using (i)
pre-trained adapters and (ii) invertible adapters to capture token-level
language representations and demonstrate their benefits for transfer learning
on the generalized EM benchmark. Our results show that our solution achieves
comparable or superior performance to full-scale PrLM fine-tuning and
prompt-tuning baselines while utilizing a significantly smaller computational
footprint of the PrLM parameters
Reducing the labeling effort for entity resolution using distant supervision and active learning
Entity resolution is the task of identifying records in one or more data sources which refer to the same real-world object. It is often treated as a supervised binary classification task in which a labeled set of matching and non-matching record pairs is used for training a machine learning model. Acquiring labeled data for training machine learning models is expensive and time-consuming, as it typically involves one or more human annotators who need to manually inspect and label the data. It is thus considered a major limitation of supervised entity resolution methods. In this thesis, we research two approaches, relying on distant supervision and active learning, for reducing the labeling effort involved in constructing training sets for entity resolution tasks with different profiling characteristics. Our first approach investigates the utility of semantic annotations found in HTML pages as a source of distant supervision. We profile the adoption growth of semantic annotations over multiple years and focus on product-related schema.org annotations. We develop a pipeline for cleansing and grouping semantically annotated offers describing the same products, thus creating the WDC Product Corpus, the largest publicly available training set for entity resolution. The high predictive performance of entity resolution models trained on offer pairs from the WDC Product Corpus clearly demonstrates the usefulness of semantic annotations as distant supervision for product-related entity resolution tasks. Our second approach focuses on active learning techniques, which have been widely used for reducing the labeling effort for entity resolution in related work. Yet, we identify two research gaps: the inefficient initialization of active learning and the lack of active learning methods tailored to multi-source entity resolution. We address the first research gap by developing an unsupervised method for initializing and further assisting the complete active learning workflow. Compared to active learning baselines that use random sampling or transfer learning for initialization, our method guarantees high anytime performance within a limited labeling budget for tasks with different profiling characteristics. We address the second research gap by developing ALMSER, the first active learning method which uses signals inherent to multi-source entity resolution tasks for query selection and model training. Our evaluation results indicate that exploiting such signals for query selection alone has a varying effect on model performance across different multi-source entity resolution tasks. We further investigate this finding by analyzing the impact of the profiling characteristics of multi-source entity resolution tasks on the performance of active learning methods which use different signals for query selection
Integrating product data using deep learning : Art.-Nr. 11
Product matching is the task of deciding whether two product descriptions refer to the same real-world product. Product matching is a central task in e-commerce applications such as online market places and price comparison portals, as these applications need to find out which offers refer to the same product before they can integrate data from the offers or compare product prices. Product matching is a non-trivial task as merchants describe products in different ways and as small differences in the product descriptions matter for distinguishing between different variants of the same product. A successful approach for dealing with the heterogeneity of product offers is to combine deep learning-based matching techniques with large amounts of training data which can be extracted from Web corpora such as the Common Crawl. Training deep learning methods involving millions of parameters for use cases such as product matching requires access to large compute resources. In this extended abstract, we report how we trained different RNN- and BERT-based models for product matching using the bwHPC infrastructure and how this extended training allowed us to reach peak performance. Afterwards, we describe how we use the bwHPC infrastructure for our ongoing research on table representation learning for data integration
An exploratory study on utilising the web of linked data for product data mining
The Linked Open Data practice has led to a significant growth of structured data on the Web. While this has created an unprecedented opportunity for research in the field of Natural Language Processing, there is a lack of systematic studies on how such data can be used to support downstream NLP tasks. This work focuses on the e-commerce domain and explores how we can use such structured data to create language resources for product data mining tasks. To do so, we process billions of structured data points in the form of RDF n-quads, to create multi-million words of product-related corpora that are later used in three different ways for creating language resources: training word-embedding models, continued pre-training of BERT-like language models, and training machine translation models that are used as a proxy to generate product-related keywords. These language resources are then evaluated in three downstream tasks, product classification, linking, and fake review detection using an extensive set of benchmarks. Our results show word embeddings to be the most reliable and consistent method to improve the accuracy on all tasks (with up to 6.9% points in macro-average F1 on some datasets). Contrary to some earlier studies that suggest a rather simple but effective approach such as building domain-specific language models by pre-training using in-domain corpora, our work serves a lesson that adapting these methods to new domains may not be as easy as it seems. We further analyse our datasets and reflect on how our findings can inform future research and practice
- …