96 research outputs found

    RefConcile – automated online reconciliation of bibliographic references

    Get PDF
    Comprehensive bibliographies often rely on community contributions. In such a setting, de-duplication is mandatory for the bibliography to be useful. Ideally, it works online, i.e., during the addition of new references, so the bibliography remains duplicate-free at all times. While de-duplication is well researched, generic approaches do not achieve the result quality required for automated reconciliation. To overcome this problem, we propose a new duplicate detection and reconciliation technique called RefConcile. Aimed specifically at bibliographic references, it uses dedicated blocking and matching techniques tailored to this type of data. Our evaluation based on a large real-world collection of bibliographic references shows that RefConcile scales well, and that it detects and reconciles duplicates highly accurately

    On Feeding Business Systems with Linked Resources from the Web of Data

    Get PDF
    Business systems that are fed with data from the Web of Data require transparent interoperability. The Linked Data principles establish that different resources that represent the same real-world entities must be linked for such purpose. Link rules are paramount to transparent interoperability since they produce the links between resources. State-of-the-art link rules are learnt by genetic programming and build on comparing the values of the attributes of the resources. Unfortunately, this approach falls short in cases in which resources have similar values for their attributes, but represent different real-world entities. In this paper, we present a proposal that leverages a genetic programming that learns link rules and an ad-hoc filtering technique that boosts them to decide whether the links that they produce must be selected or not. Our analysis of the literature reveals that our approach is novel and our experimental analysis confirms that it helps improve the F1 score by increasing precision without a significant penalty on recall.Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016- 75394-

    Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data

    Get PDF
    Probabilistic record linkage is a well established topic in the literature. Fellegi-Sunter probabilistic record linkage and its enhanced versions are commonly used methods, which calculate match and non- match weights for each pair of records. Bayesian network classifiers – naive Bayes classifier and TAN have also been successfully used here. Recently, an extended version of TAN (called ETAN) has been developed and proved superior in classification accuracy to conventional TAN. However, no previous work has applied ETAN to record linkage and investigated the benefits of using naturally existing hierarchical feature level information and parsed fields of the datasets. In this work, we ex- tend the naive Bayes classifier with such hierarchical feature level information. Finally we illustrate the benefits of our method over previously proposed methods on 4 datasets in terms of the linkage performance (F1 score). We also show the results can be further improved by evaluating the benefit provided by additionally parsing the fields of these datasets

    Brazilian Consensus on Photoprotection

    Full text link
    corecore