6 research outputs found

    Efficient partial-duplicate detection based on sequence matching

    Full text link

    Opal: In Vivo Based Preservation Framework for Locating Lost Web Pages

    Get PDF
    We present Opal, a framework for interactively locating missing web pages (http status code 404). Opal is an example of in vivo preservation: harnessing the collective behavior of web archives, commercial search engines, and research projects for the purpose of preservation. Opal servers learn from their experiences and are able to share their knowledge with other Opal servers using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Using cached copies that can be found on the web, Opal creates lexical signatures which are then used to search for similar versions of the web page. Using the OAI-PMH to facilitate inter-Opal learning extends the utilization of OAI-PMH in a novel manner. We present the architecture of the Opal framework, discuss a reference implementation of the framework, and present a quantitative analysis of the framework that indicates that Opal could be effectively deployed

    Detecting the Origin of Text Segments Efficiently

    Get PDF
    In the origin detection problem an algorithm is given a set S of documents, ordered by creation time, and a query document D. It needs to output for every consecutive sequence of k alphanumeric terms in D the earliest document in S in which the sequence appeared (if such a document exists). Algorithms for the origin detection problem can, for example, be used to detect the "origin" of text segments in D and thus to detect novel content in D. They can also find the document from which the author of D has copied the most (or show that D is mostly original). We propose novel algorithm for this problem and evaluate them together with a large number of previously published algorithms. Our results show that (1) detecting the origin of text segments efficiently can be done with very high accuracy even when the space used is less than 1% of the size of the documents in S, (2) the precision degrades smoothly with the amount of available space, (3) various estimation techniques can be used to increase the performance of the algorithms

    Identifying the Effects of Unexpected Change in a Distributed Collection of Web Documents

    Get PDF
    It is not unusual for digital collections to degrade and suffer from problems associated with unexpected change. In previous analyses, I have found that categorizing the degree of change affecting a digital collection over time is a difficult task. More specifically, I found that categorizing this degree of change is not a binary problem where documents are either unchanged or they have changed so dramatically that they do not fit within the scope of the collection. It is, in part, a characterization of the intent of the change. In this dissertation, I present a study that compares change detection methods based on machine learning algorithms against the assessment made by human subjects in a user study. Consequently, this dissertation focuses on two research questions. First, how can we categorize the various degrees of change that documents can endure? This point becomes increasingly interesting if we take into account that the resources found in a digital library are often curated and maintained by experts with affiliations to professionally managed institutions. And second, how do the automatic detection methods fare against the human assessment of change in the ACM conference list? The results of this dissertation are threefold. First, I provide a categorization framework that highlights the different instances of change that I found in an analysis of the Association for Computing Machinery conference list. Second, I focus on a set of procedures to classify the documents according to the characteristics of change that they exhibit. Finally, I evaluate the classification procedures against the assessment of human subjects. Taking into account the results of the user evaluation and the inability of the test subjects to recognize some instances of change, the main conclusion that I derive from my dissertation is that managing the effects of unexpected change is a more serious problem than had previously been anticipated, thus requiring the immediate attention of collection managers and curators

    Detection of near-duplicates in large image collections

    Get PDF
    The vast numbers of images on the Web include many duplicates, and an even larger number of near-duplicate variants derived from the same original. These include thumbnails stored by search engines, copies shared by various news portals, and images that appear on multiple web sites, legitimately or otherwise. Such near-duplicates appear in the results of many web image searches, and constitute redundancy, and may also represent infringements of copyright. Digital images can be easily altered through simple digital manipulation such as conversion to grey-scale, colour balance change, rescaling, rotation, and cropping. Any of these operations defeat simple duplicate detection methods such as bit-level hashing. The ability to detect such variants with a reasonable degree of reliability and accuracy would support reduction of redundancy in collections and in presentation of search results, and also allow detection of possible copyright violations. Some existing methods for identifying near-duplicates are derived from computer vision techniques; these have shown high effectiveness for this domain, but are computationally expensive, and therefore impractical for large image collections. Other methods address the problem using conventional CBIR approaches that are more efficient but are typically not as robust. None of the previous methods have addressed the problem in its entirety, and none have addressed the large scale near-duplicate problem on the Web; there has been no analysis of the kinds of alterations that are common on the Web, nor any or evaluation of whether real cases of near-duplication can in fact be identified. In this thesis, we analyse the different types of alterations and near-duplicates existent in a range of popular web image searches, and establish a collection and evaluation ground truth using real-world near-duplicate examples. We present a simple ranking approach to reduce the number of local-descriptors, and therefore improve the efficiency of the descriptor-based retrieval method for near-duplicate detection. The descriptor-based method has been shown to produce near-perfect detection of near-duplicates, but was previously computationally very expensive. We show that while maintaining comparable effectiveness, our method scales well for large collections of hundreds of thousands of images. We also explore a more compact indexing structure to support near duplicate image detection. We develop a method to automatically detect the pair-wise near-duplicate relationship of images without the use of a query. We adapt the hash-based probabilistic counting method --- originally used for near-duplicate text document detection --- with the local descriptors; our adaptation offers the first effective and efficient non-query-based approach to this domain. We further incorporate our pair-wise detection approach for clustering of near-duplicates. We present a clustering method specifically for near-duplicate images, where our method is arguably the first clustering method to achieve a high level of effectiveness in this domain. We also show that near-duplicates within a large collection of a million images can be effectively clustered using our approach in less than an hour using relatively modest computational resources. Overall, our proposed methods provide practical approaches to the detection and management of near-duplicate images in large collection
    corecore