Design an Approach for Finding the Similarity between the Documents

Abstract

Now a days Data Management is very important issue. Data on cloud is very large in size. Web users need tools to manage information easily. If tried to do manually this is cumbersome and time consuming process because there are many near-duplicate results. The efficient detection of near-duplicate articles is very important in many applications that have a large amount of data available for a specific requirement depending upon the task in hand. We are introducing algorithm for extracting key-phrases and matching signatures for nearduplicate articles detection. Based on N-gram (i.e. bigram & trigram) algorithm for key phrase extraction & JACCARD similarity for finding similarity between documents. Algorithms are applied on article and text Documents and result shows that our proposed methods are more effective than other existing method

    Similar works