6 research outputs found

    Record Duplication Detection in Database: A Review

    Get PDF
    The recognition of similar entities in databases has gained substantial attention in many application areas. Despite several techniques proposed to recognize and locate duplication of database records, there is a dearth of studies available which rate the effectiveness of the diverse techniques used for duplicate record detection. Calculating time complexity of the proposed methods reveals their performance rating. The time complexity calculation showed that the efficiency of these methods improved when blocking and windowing is applied. Some domain-specific methods train systems to optimize results and improve efficiency and scalability, but they are prone to errors. Most of the existing methods fail to either discuss, or lack thoroughness in consideration of scalability. The process of sorting and searching form an essential part of duplication detection, but they are time-consuming. Therefore this paper proposes the possibility of eliminating the sorting process by utilization of tree structure to improve the record duplication detection. This has added benefits of reducing time required, and offers a probable increase in scalability. For database system, scalability is an inherent feature for any proposed solution, due to the fact that the data size is huge. Improving the efficiency in identifying duplicate records in databases is an essential step for data cleaning and data integration methods. This paper reveals that the current proposed methods lack in providing solutions that are scalable, high accurate, and reduce the processing time during detecting duplication of records in database. The ability to provide solutions to this problem will improve the quality of data that are used for decision making process

    Enriching product ads with Metadata from HTML annotations

    Full text link

    APFA: Automated Product Feature Alignment for Duplicate Detection

    Get PDF
    To keep up with the growing interest of using Web shops for product comparison, we have developed a method that targets the problem of product duplicate detection. If duplicates can be discovered correctly and quickly, customers can compare products in an efficient manner. We build upon the state-of-the-art Multi-component Similarity Method (MSM) for product duplicate detection by developing an automated pre-processing phase that occurs before the similarities between products are calculated. Specifically, in this prior phase the features of products are aligned between Web shops, using metrics such as the data type, coverage, and diversity of each key, as well as the distribution and used measurement units of their corresponding values. With this information, the values of these keys can be more meaningfully and efficiently employed in the process of comparing products. Applying our method to a real-world dataset of 1629 TV's across 4 Web shops, we find that we increase the speed of the product similarity phase by roughly a factor 3 due to fewer meaningless comparisons, an improved brand analyzer, and a renewed title analyzer. Moreover, in terms of quality of duplicate detection, we significantly outperform MSM with an F 1-measure of 0.746 versus 0.525. </p

    An exploratory study on utilising the web of linked data for product data mining

    Get PDF
    The Linked Open Data practice has led to a significant growth of structured data on the Web. While this has created an unprecedented opportunity for research in the field of Natural Language Processing, there is a lack of systematic studies on how such data can be used to support downstream NLP tasks. This work focuses on the e-commerce domain and explores how we can use such structured data to create language resources for product data mining tasks. To do so, we process billions of structured data points in the form of RDF n-quads, to create multi-million words of product-related corpora that are later used in three different ways for creating language resources: training word-embedding models, continued pre-training of BERT-like language models, and training machine translation models that are used as a proxy to generate product-related keywords. These language resources are then evaluated in three downstream tasks, product classification, linking, and fake review detection using an extensive set of benchmarks. Our results show word embeddings to be the most reliable and consistent method to improve the accuracy on all tasks (with up to 6.9% points in macro-average F1 on some datasets). Contrary to some earlier studies that suggest a rather simple but effective approach such as building domain-specific language models by pre-training using in-domain corpora, our work serves a lesson that adapting these methods to new domains may not be as easy as it seems. We further analyse our datasets and reflect on how our findings can inform future research and practice

    Intelligent Information Systems for Web Product Search

    Get PDF
    Over the last few years, we have experienced an increase in online shopping. Consequently, there is a need for efficient and effective product search engines. The rapid growth of e-commerce, however, has also introduced some challenges. Studies show that users can get overwhelmed by the information and offerings presented online while searching for products. In an attempt to lighten this information overload burden on consumers, there are several product search engines that aggregate product descriptions and price information from the Web and allow the user to easily query this information. Most of these search engines expect to receive the data from the participating Web shops in a specific format, which means Web shops need to transform their data more than once, as each product search engine requires a different format. Because currently most product information aggregation services rely on Web shops to send them their data, there is a big opportunity for solutions that aim to tackle this problem using a more automated approach. This dissertation addresses key aspects of implementing such a system, including hierarchical product classification, entity resolution, ontology population and schema mapping, and lastly, the optimization of faceted user interfaces. The findings of this work show us how one can design Web product search engines that automatically aggregate product information while allowing users to perform effective and efficient queries
    corecore