142 research outputs found

    Using ChatGPT for Entity Matching

    Full text link
    Entity Matching is the task of deciding if two entity descriptions refer to the same real-world entity. State-of-the-art entity matching methods often rely on fine-tuning Transformer models such as BERT or RoBERTa. Two major drawbacks of using these models for entity matching are that (i) the models require significant amounts of fine-tuning data for reaching a good performance and (ii) the fine-tuned models are not robust concerning out-of-distribution entities. In this paper, we investigate using ChatGPT for entity matching as a more robust, training data-efficient alternative to traditional Transformer models. We perform experiments along three dimensions: (i) general prompt design, (ii) in-context learning, and (iii) provision of higher-level matching knowledge. We show that ChatGPT is competitive with a fine-tuned RoBERTa model, reaching an average zero-shot performance of 83% F1 on a challenging matching task on which RoBERTa requires 2000 training examples for reaching a similar performance. Adding in-context demonstrations to the prompts further improves the F1 by up to 5% even using only a small set of 20 handpicked examples. Finally, we show that guiding the zero-shot model by stating higher-level matching rules leads to similar gains as providing in-context examples

    WDC Products: A Multi-Dimensional Entity Matching Benchmark

    Get PDF
    The difficulty of an entity matching task depends on a combination of multiple factors such as the amount of corner-case pairs, the fraction of entities in the test set that have not been seen during training, and the size of the development set. Current entity matching benchmarks usually represent single points in the space along such dimensions or they provide for the evaluation of matching methods along a single dimension, for instance the amount of training data. This paper presents WDC Products, an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-word data. The three dimensions are (i) amount of corner-cases (ii) generalization to unseen entities, and (iii) development set size. Generalization to unseen entities is a dimension not covered by any of the existing benchmarks yet but is crucial for evaluating the robustness of entity matching systems. WDC Products is based on heterogeneous product data from thousands of e-shops which mark-up products offers using schema.org annotations. Instead of learning how to match entity pairs, entity matching can also be formulated as a multi-class classification task that requires the matcher to recognize individual entities. WDC Products is the first benchmark that provides a pair-wise and a multi-class formulation of the same tasks and thus allows to directly compare the two alternatives. We evaluate WDC Products using several state-of-the-art matching systems, including Ditto, HierGAT, and R-SupCon. The evaluation shows that all matching systems struggle with unseen entities to varying degrees. It also shows that some systems are more training data efficient than others

    Cross-language learning for product matching

    Get PDF
    Transformer-based entity matching methods have significantly moved the state of the art for less-structured matching tasks such as matching product offers in e-commerce. In order to excel at these tasks, Transformer-based matching methods require a decent amount of training pairs. Providing enough training data can be challenging, especially if a matcher for non-English product descriptions should be learned. This poster explores along the use case of matching product offers from different e-shops to which extent it is possible to improve the performance of Transformer-based matchers by complementing a small set of training pairs in the target language, German in our case, with a larger set of English-language training pairs. Our experiments using different Transformers show that extending the German set with English pairs improves the matching performance in all cases. The impact of adding the English pairs is especially high in low-resource settings in which only a rather small number of non-English pairs is available. As it is often possible to automatically gather English training pairs from the Web by exploiting schema.org annotations, our results are relevant for many product matching scenarios targeting low-resource languages

    Integrating product data using deep learning : Art.-Nr. 11

    Get PDF
    Product matching is the task of deciding whether two product descriptions refer to the same real-world product. Product matching is a central task in e-commerce applications such as online market places and price comparison portals, as these applications need to find out which offers refer to the same product before they can integrate data from the offers or compare product prices. Product matching is a non-trivial task as merchants describe products in different ways and as small differences in the product descriptions matter for distinguishing between different variants of the same product. A successful approach for dealing with the heterogeneity of product offers is to combine deep learning-based matching techniques with large amounts of training data which can be extracted from Web corpora such as the Common Crawl. Training deep learning methods involving millions of parameters for use cases such as product matching requires access to large compute resources. In this extended abstract, we report how we trained different RNN- and BERT-based models for product matching using the bwHPC infrastructure and how this extended training allowed us to reach peak performance. Afterwards, we describe how we use the bwHPC infrastructure for our ongoing research on table representation learning for data integration

    Dual-objective fine-tuning of BERT for entity matching

    Get PDF
    An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a subset of the entity descriptions to be integrated while such identifiers are not available for others. The challenge in these settings is to learn a matcher for entity descriptions without identifiers using the entity descriptions containing identifiers as training data. The task can be approached by learning a binary classifier which distinguishes pairs of entity descriptions for the same real-world entity from descriptions of different entities. The task can also be modeled as a multi-class classification problem by learning classifiers for identifying descriptions of individual entities. We present a dual-objective training method for BERT, called JointBERT, which combines binary matching and multi-class classification, forcing the model to predict the entity identifier for each entity description in a training pair in addition to the match/non-match decision. Our evaluation across five entity matching benchmark datasets shows that dual-objective training can increase the matching performance for seen products by 1% to 5% F1 compared to single-objective Transformer-based methods, given that enough training data is available for both objectives. In order to gain a deeper understanding of the strengths and weaknesses of the proposed method, we compare JointBERT to several other BERT-based matching methods as well as baseline systems along a set of specific matching challenges. This evaluation shows that JointBERT, given enough training data for both objectives, outperforms the other methods on tasks involving seen products, while it underperforms for unseen products. Using a combination of LIME explanations and domain-specific word classes, we analyze the matching decisions of the different deep learning models and conclude that BERT-based models are better at focusing on relevant word classes compared to RNN-based models

    Роберт Конквест про голокост та голодомор

    Get PDF
    Purpose – Previous research has demonstrated strong relations between work characteristics (e.g. job demands and job resources) and work outcomes such as work performance and work engagement. So far, little attention has been given to the role of authenticity (i.e. employees’ ability to experience their true selves) in these relations. The purpose of this paper is to explore the relationship of state authenticity at work with job demands and resources on the one hand and work engagement, job satisfaction, and subjective performance on the other hand. Design/methodology/approach – In total, 680 Dutch bank employees participated to the study. Structural equation modelling was used to test the goodness-of-fit of the hypothesized model. Bootstrapping (Preacher and Hayes, 2008) was used to examine the meditative effect of state authenticity. Findings – Results showed that job resources were positively associated with authenticity and, in turn, that authenticity was positively related to work engagement, job satisfaction, and performance. Moreover, state authenticity partially mediated the relationship between job resources and three occupational outcomes. Research limitations/implications – Main limitations to this study were the application of selfreport questionnaires, utilization of cross-sectional design, and participation of a homogeneous sample. However, significant relationship between workplace characteristics, occupational outcomes, and state authenticity enhances our current understanding of the JD-R Model. Practical implications – Managers might consider enhancing state authenticity of employees by investing in job resources, since high levels of authenticity was found to be strongly linked to positive occupational outcomes. Originality/value – This study is among the first to examine the role of authenticity at workplace and highlights the importance of state authenticity for work-related outcomes

    SOTAB: The WDC Schema.org table annotation benchmark

    Get PDF
    Understanding the semantics of table elements is a prerequisite for many data integration and data discovery tasks. Table annotation is the task of labeling table elements with terms from a given vocabulary. This paper presents the WDC Schema.org Table Annotation Benchmark (SOTAB) for comparing the performance of table annotation systems. SOTAB covers the column type annotation (CTA) and columns property annotation (CPA) tasks. SOTAB provides ∼50,000 annotated tables for each of the tasks containing Schema.org data from different websites. The tables cover 17 different types of entities such as movie, event, local business, recipe, job posting, or product. The tables stem from the WDC Schema.org Table Corpus which was created by extracting Schema.org annotations from the Common Crawl. Consequently, the labels used for annotating columns in SOTAB are part of the Schema.org vocabulary. The benchmark covers 91 types for CTA and 176 properties for CPA distributed across textual, numerical and date/time columns. The tables are split into fixed training, validation and test sets. The test sets are further divided into subsets focusing on specific challenges, such as columns with missing values or different value formats, in order to allow a more fine-grained comparison of annotation systems. The evaluation of SOTAB using Doduo and TURL shows that the benchmark is difficult to solve for current state-of-the-art systems

    Effects of two submerged macrophyte species on microbes and metazoans in rooftop water-storage ponds with different labile carbon loadings

    Get PDF
    Nature-based solutions including rooftop-water storage ponds are increasingly adopted in cities as new ecodesigns to address climate change issues, such as water scarcity and storm-water runoff. Macrophytes may be valuable additions for treating stored rooftop waters and provisioning other services, including aquaponics, esthetic and wildlife-conservation values. However, the efficacy of macrophyte treatments has not been tested with influxes of different labile carbon loadings such as those occurring in storms. Moreover, little is known about how macrophytes affect communities of metazoans and microbes, including protozoans, which are key players in the water-treatment process. Here, we experimentally investigated the effectiveness of two widely distributed macrophytes, Ceratophyllum demersum and Egeria densa, for treating drained rooftop water fed with two types of leaf litter, namely Quercus robur (high C lability) and Quercus rubra (low C lability). C. demersum was better than E. densa at reducing water conductivity (by 10̶ 40 μS/cm), TDS (by 10-18 mg/L), DOC (by 4-5 mg/L) and at increasing water transparency (by 4-9%), water O2 levels (by 19-27%) and daylight pH (by 0.9-1.3) compared to leaf-litter only microcosms after 30 days. Each treatment developed a different community of algae, protozoa and metazoa. Greater plant mass and epiphytic chlorophyll-a suggested that C. demersum was better at providing supporting habitat than E. densa. The two macrophytes did not differ in detritus accumulation, but E. densa was more prone to develop filamentous bacteria, which cause sludge bulking in water-treatment systems. Our study highlights the superior capacity of C. demersum and the usefulness of whole-ecosystem experiments in choosing the most adequate macrophyte species for nature-based engineered solutions

    Correlations of health status indicators with perceived neuropsychological impairment and cognitive processing speed in multiple sclerosis

    Get PDF
    Background: Comorbidity and health behaviours may explain heterogeneity regarding cognitive performance in multiple sclerosis. Patient-reported cognitive difficulties have impact but do not consistently correlate with objective cognitive performance. Our study aims to investigate whether health status indicators including comorbidities, body mass index, physical activity, smoking, sleeping behaviour and consumption patterns for fish, alcohol and caffeinated drinks are associated with measures of subjective and objective cognitive performance. Methods: Survey data on self-reported cognitive performance, assessed with the MS Neuropsychological Screening Questionnaire (MSNQ), were related to the presence of arterial hypertension, diabetes mellitus, cardiovascular and chronic renal diseases, hypercholesterolemia, depression based on 2-question screening tool, health and consumption behaviors. We included the Symbol Digit Modalities Test when available within 6 months as an objective, performance-based metric of cognitive processing speed. We investigated the interrelation between all variables with a Spearman correlation matrix and corrected for multiple testing. Regression models were built and controlled for age, sex and phenotype. Results: We used available data from 751 patients with definite MS, including 290 SDMT scores within a time window of 6 months, to study relations between variables. MSNQ and SDMT scores were not significantly correlated. Correlation patterns for subjective and objective performance differed. Age, disease duration and physical disability correlated with SDMT scores only. Regression analyses could be performed for MSNQ scores in 595/751 (79.2%) and for SDMT scores in 234/751 (31.2%) participants. After restricting variables to avoid collinearity and adjusting for the number of variables, regression models explained 15% of the variance for subjective and 14% of the variance for objective cognitive performance. A higher number of physical comorbidities, reporting depressive symptoms, sleeping 9 h or more and daily use of sleeping medication were associated with lower subjective cognitive performance, whereas increasing age was associated with reduced processing speed. These associations persisted after correction for multiple testing. Conclusion: Increasing age is associated with reduced cognitive processing speed whereas comorbidities and sleep behaviors contribute to subjective cognitive performance

    Unusually Rapid Development of Pulmonary Hypertension and Right Ventricular Failure after COVID-19 Pneumonia

    Get PDF
    COVID-19 is a novel viral disease caused by SARS-CoV-2. The mid- and long-term outcomes have not yet been determined. COVID-19 infection is increasingly being associated with systemic and multi-organ involvement, encompassing cytokine release syndrome and thromboembolic, vascular and cardiac events. The patient described experienced unusually rapid development of pulmonary hypertension (PH) and right ventricular failure after recent severe COVID-19 pneumonia with cytokine release syndrome, which initially was successfully treated with methylprednisolone and tocilizumab. The development of pulmonary hypertension and right ventricular failure – in the absence of emboli on multiple CT angiograms – was most likely caused by progressive pulmonary parenchymal abnormalities combined with microvascular damage of the pulmonary arteries (group III and IV pulmonary hypertension, respectively). To the best of our knowledge, these complications have not previously been described and therefore awareness of PH as a complication of COVID-19 is warranted
    corecore