3,983 research outputs found

    Universal Compressed Text Indexing

    Get PDF
    The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let γ\gamma be the size of a string attractor for a text of length nn. Our index takes O(γlog(n/γ))O(\gamma\log(n/\gamma)) words of space and supports locating the occocc occurrences of any pattern of length mm in O(mlogn+occlogϵn)O(m\log n + occ\log^{\epsilon}n) time, for any constant ϵ>0\epsilon>0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment

    VISUALIZATION OF GENETIC ALGORITHM BASED ON 2-D GRAPH TO ACCELERATE THE SEARCHING WITH HUMAN INTERVENTIONS.

    Get PDF
    The Genetic Algorithm is an area in the field of Artificial Intelligence that is founded on the principles of biological evolution. Visualization techniques help in understanding the searching behaviour of Genetic Algorithm. lt also makes possible the user interactions during the searching process. It is noted that active user intervention increases the acceleration of Genetic Algorithm towards an optimal solution. In proposed research work, the user is aided by a visualization based on the representation of multidimensional Genetic Algorithm data on 2-0 space. The aim of the proposed approach is to study the benefit of using visualization techniques to explorer Genetic Algorithm data based on gene values. The user participates in the search by proposing a new individual. This is difTerent from existing Interactive Genetic Algorithm in which selection and evaluation of solutions is done by the users. A tool termed as VIGA-20 (Visualization of Genetic Algorithm using 2-0 Graph) is implemented to accomplish this goal. This visual tool enables the display of the evolution of gene values from generation to generation to observing and analysing the behaviour of the search space with user interactions. Individuals for the next generation are selected by using the objective function. Hence, a novel humanmachine interaction is developed in the proposed approach. The efficiency of the proposed approach is evaluated by two benchmark functions. The analysis and comparison of VIGA-20 is based on convergence test against the results obtained from the Simple Genetic Algorithm. This comparison is based on the same parameters except for the interactions of the user. The application of proposed approach is the modelling the branching structures by deriving a rule from best solution of VIGA-20. The comparison of results is based on the different user's perceptions, their involvement in the VIGA-20 and the difference of the fitness convergence as compared to Simple Genetic Algorithm

    Reimagining the SSMinT Software Package

    Get PDF
    We examine two proposed indexing algorithms taking advantage of the new SSMinT libraries. The two algorithms primarily differ in their selection of documents for learning. The batch indexing method selects some random number of documents for learning. The iterative indexing method uses a single randomly selected document to discover semantic signatures, which are then used to find additional related documents. The batch indexing method discovers one to three semantic signatures per document, resulting in poor clustering performance as evaluated by human cross-validation of clusters using the Adjusted Rand Index. The iterative indexing method discovers more semantic signatures per document, resulting in far better clustering performance using the same cross-validation method.;Our new tools enable faster development of new experiments, forensic applications, and more. The experiments show that SSMinT can provide effective indexing for text data such as e-mail or web pages. We conclude with areas of future research which may benefit from utilizing SSMinT. (Abstract shortened by ProQuest.)

    Signature Files: An Integrated Access Method for Formatted and Unformatted Databases

    Get PDF
    The signature file approach is one of the most powerful information storage and retrieval techniques which is used for finding the data objects that are relevant to the user queries. The main idea of all signature based schemes is to reflect the essence of the data items into bit pattern (descriptors or signatures) and store them in a separate file which acts as a filter to eliminate the non aualifvine data items for an information reauest. It provides an integrated access method for both formattid and formatted databases. A complative overview and discussion of the proposed signatnre generation methods and the major signature file organization schemes are presented. Applications of the signature techniques to formatted and unformatted databases, single and multiterm query cases, serial and paratlei architecture. static and dynamic environments are provided with a special emphasis on the multimedia databases where the pioneering prototype systems using signatnres yield highly encouraging results
    corecore