An Improving Genetic Programming Approach Based Deduplication Using KFINDMR

Abstract

Abstract-The record deduplication is the task of identifying, in a data repository, records that refer to the same real world entity or object in spite of misspelling words, types, different writing styles or even different schema representations or data types. In existing system aims at providing Unsupervised Duplication Detection (UDD) method which can be used to identify and remove the duplicate records from different data sources. Starting from the non duplicate set, the two cooperating classifiers, a Weighted Component Similarity Summing Classifier (WCSS) and Support Vector Machine (SVM) are used to iteratively identify the duplicate records from the non duplicate record and present a genetic programming (GP) approach to record deduplication. Their GP-based approach is also able to automatically find effective deduplication functions. The genetic programming approach is time consuming task so we propose new algorithm KFINDMR (KFIND using Most Represented data samples) to find the most represented data samples to improve the accuracy of the classifier. The proposed system calculates the mean value of the most represented data samples in centroid of the record members; it selects the first most represented data sample that closest to the mean value calculates the minimum distance. The system Remove the duplicate dataset samples in the system and find the optimization solution to deduplication of records or data samples

    Similar works