4 research outputs found

    Detection and management of redundancy for information retrieval

    Get PDF
    The growth of the web, authoring software, and electronic publishing has led to the emergence of a new type of document collection that is decentralised, amorphous, dynamic, and anarchic. In such collections, redundancy is a significant issue. Documents can spread and propagate across such collections without any control or moderation. Redundancy can interfere with the information retrieval process, leading to decreased user amenity in accessing information from these collections, and thus must be effectively managed. The precise definition of redundancy varies with the application. We restrict ourselves to documents that are co-derivative: those that share a common heritage, and hence contain passages of common text. We explore document fingerprinting, a well-known technique for the detection of co-derivative document pairs. Our new lossless fingerprinting algorithm improves the effectiveness of a range of document fingerprinting approaches. We empirically show that our algorithm can be highly effective at discovering co-derivative document pairs in large collections. We study the occurrence and management of redundancy in a range of application domains. On the web, we find that document fingerprinting is able to identify widespread redundancy, and that this redundancy has a significant detrimental effect on the quality of search results. Based on user studies, we suggest that redundancy is most appropriately managed as a postprocessing step on the ranked list and explain how and why this should be done. In the genomic area of sequence homology search, we explain why the existing techniques for redundancy discovery are increasingly inefficient, and present a critique of the current approaches to redundancy management. We show how document fingerprinting with a modified version of our algorithm provides significant efficiency improvements, and propose a new approach to redundancy management based on wildcards. We demonstrate that our scheme provides the benefits of existing techniques but does not have their deficiencies. Redundancy in distributed information retrieval systems - where different parts of the collection are searched by autonomous servers - cannot be effectively managed using traditional fingerprinting techniques. We thus propose a new data structure, the grainy hash vector, for redundancy detection and management in this environment. We show in preliminary tests that the grainy hash vector is able to accurately detect a good proportion of redundant document pairs while maintaining low resource usage

    Algorithmic Analysis of Infinite-State Systems

    Get PDF
    Many important software systems, including communication protocols and concurrent and distributed algorithms generate infinite state-spaces. Model-checking which is the most prominent algorithmic technique for the verification of concurrent systems is restricted to the analysis of finite-state models. Algorithmic analysis of infinite-state models is complicated--most interesting properties are undecidable for sufficiently expressive classes of infinite-state models. In this thesis, we focus on the development of algorithmic analysis techniques for two important classes of infinite-state models: FIFO Systems and Parameterized Systems. FIFO systems consisting of a set of finite-state machines that communicate via unbounded, perfect, FIFO channels arise naturally in the analysis of distributed protocols. We study the problem of computing the set of reachable states of a FIFO system composed of piecewise components. This problem is closely related to calculating the set of all possible channel contents, i.e. the limit language. We present new algorithms for calculating the limit language of a system with a single communication channel and important subclasses of multi-channel systems. We also discuss the complexity of these algorithms. Furthermore, we present a procedure that translates a piecewise FIFO system to an abridged structure, representing an expressive abstraction of the system. We show that we can analyze the infinite computations of the more concrete model by analyzing the computations of the finite, abridged model. Parameterized systems are a common model of computation for concurrent systems consisting of an arbitrary number of homogenous processes. We study the reachability problem in parameterized systems of infinite-state processes. We describe a framework that combines Abstract Interpretation with a backward-reachability algorithm. Our key idea is to create an abstract domain in which each element (a) represents the lower bound on the number of processes at a control location and (b) employs a numeric abstract domain to capture arithmetic relations among variables of the processes. We also provide an extrapolation operator for the domain to guarantee sound termination of the backward-reachability algorithm

    Program Analysis as Model Checking

    Get PDF
    corecore