425 research outputs found

    On Demand Quality of web services using Ranking by multi criteria

    Get PDF
    In the Web database scenario, the records to match are highly query-dependent, since they can only be obtained through online queries. Moreover, they are only a partial and biased portion of all the data in the source Web databases. Consequently, hand-coding or offline-learning approaches are not appropriate for two reasons. First, the full data set is not available beforehand, and therefore, good representative data for training are hard to obtain. Second, and most importantly, even if good representative data are found and labeled for learning, the rules learned on the representatives of a full data set may not work well on a partial and biased part of that data set. Keywords: SOA, Web Services, Network

    INDEPENDENT DE-DUPLICATION IN DATA CLEANING

    Get PDF
    Many organizations collect large amounts of data to support their business and decision-making processes. The data originate from a variety of sources that may have inherent data-quality problems. These problems become more pronounced when heterogeneous data sources are integrated (for example, in data warehouses). A major problem that arises from integrating different databases is the existence of duplicates. The challenge of de-duplication is identifying “equivalent” records within the database. Most published research in de-duplication propose techniques that rely heavily on domain knowledge. A few others propose solutions that are partially domain-independent. This paper identifies two levels of domain-independence in de-duplication namely: domain-independence at the attribute level, and domain-independence at the record level. The paper then proposes a positional algorithm that achieves domain-independent de-duplication at the attribute level, and a technique for field weighting by data profiling, which, when used with the positional algorithm, achieves domain-independence at the record level. Experiments show that the proposed techniques achieve more accurate de-duplication than the existing algorithms

    Improving Database Quality through Eliminating Duplicate Records

    Get PDF
    Redundant or duplicate data are the most troublesome problem in database management and applications. Approximate field matching is the key solution to resolve the problem by identifying semantically equivalent string values in syntactically different representations. This paper considers token-based solutions and proposes a general field matching framework to generalize the field matching problem in different domains. By introducing a concept of String Matching Points (SMP) in string comparison, string matching accuracy and efficiency are improved, compared with other commonly-applied field matching algorithms. The paper discusses the development of field matching algorithms from the developed general framework. The framework and corresponding algorithm are tested on a public data set of the NASA publication abstract database. The approach can be applied to address the similar problems in other databases

    The role of Industry 4.0 enabling technologies for safety management: A systematic literature review

    Get PDF
    Abstract Innovations introduced during the Industry 4.0 era consist in the integration of the so called "nine pillars of technologies" in manufacturing, transforming the conventional factory in a smart factory. The aim of this study is to investigate enabling technologies of Industry 4.0, focusing on technologies that have a greater impact on safety management. Main characteristics of such technologies will be identified and described according to their use in an industrial environment. In order to do this, we chose a systematic literature review (SLR) to answer the research question in a comprehensively way. Results show that articles can be grouped according to different criteria. Moreover, we found that Industry 4.0 can increase safety levels in warehouse and logistic, as well as several solutions are available for building sector

    A METHOD TO IDENTIFY DUPLICATE REFRESH RECORDS WITH CONTINUOUS QUERY BASED MULTIPLE WEB DATABASES

    Get PDF
    Record matching, which identifies the records that represent the same real world entity is an important step for data integration. In information retrieval, one of the main problems is to retrieve a set of documents that is semantically related to a given user query. Most existing work requires human-labelled training data (positive, negative, or both), which places a heavy burden on users. Existing supervised record matching methods require users to provide training data and therefore cannot be applied for web databases where query results are generated on-the-fly. A new record matching method named Unsupervised Duplicate Refresh Elimination (UDRE) is proposed for identifying and eliminating duplicates among refresh records in dynamic query results. The idea of this research is to adjust the weights classifier record fields in calculating similarities among refresh records.  Three classifiers namely weight component similarity summing time bound classifier, support vector machine classifier and threshold-based support vector machine classifier are iteratively employed with UDRE where the first classifier utilizes the weights concentrated on string similarity measures for comparing records from different data sources. We also design a new record alignment algorithm that aligns the attributes in Identify Duplicate Refresh Records

    A METHOD TO IDENTIFY DUPLICATE REFRESH RECORDS WITH CONTINUOUS QUERY BASED MULTIPLE WEB DATABASES

    Get PDF
    Record matching, which identifies the records that represent the same real world entity is an important step for data integration. In information retrieval, one of the main problems is to retrieve a set of documents that is semantically related to a given user query. Most existing work requires human-labelled training data (positive, negative, or both), which places a heavy burden on users. Existing supervised record matching methods require users to provide training data and therefore cannot be applied for web databases where query results are generated on-the-fly. A new record matching method named Unsupervised Duplicate Refresh Elimination (UDRE) is proposed for identifying and eliminating duplicates among refresh records in dynamic query results. The idea of this research is to adjust the weights classifier record fields in calculating similarities among refresh records.  Three classifiers namely weight component similarity summing time bound classifier, support vector machine classifier and threshold-based support vector machine classifier are iteratively employed with UDRE where the first classifier utilizes the weights concentrated on string similarity measures for comparing records from different data sources. We also design a new record alignment algorithm that aligns the attributes in Identify Duplicate Refresh Records

    Quality and complexity measures for data linkage and deduplication

    Get PDF
    Summary. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive quality results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity. Key words: data or record linkage, data integration and matching, deduplication, data mining pre-processing, quality and complexity measures

    Correlation-based methods for data cleaning, with application to biological databases

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    The role of Industry 4.0 enabling technologies for safety management: A systematic literature review

    Get PDF
    Innovations introduced during the Industry 4.0 era consist in the integration of the so called "nine pillars of technologies" in manufacturing, transforming the conventional factory in a smart factory. The aim of this study is to investigate enabling technologies of Industry 4.0, focusing on technologies that have a greater impact on safety management. Main characteristics of such technologies will be identified and described according to their use in an industrial environment. In order to do this, we chose a systematic literature review (SLR) to answer the research question in a comprehensively way. Results show that articles can be grouped according to different criteria. Moreover, we found that Industry 4.0 can increase safety levels in warehouse and logistic, as well as several solutions are available for building sector
    corecore