2,832 research outputs found

    Universal Indexes for Highly Repetitive Document Collections

    Get PDF
    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Indexing Highly Repetitive String Collections

    Full text link
    Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed in order to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey we cover the algorithmic developments that have led to these data structures. We describe the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that have been proposed, comparing them both in theoretical and practical aspects. We conclude with the current challenges in this fascinating field

    WAQS : a web-based approximate query system

    Get PDF
    The Web is often viewed as a gigantic database holding vast stores of information and provides ubiquitous accessibility to end-users. Since its inception, the Internet has experienced explosive growth both in the number of users and the amount of content available on it. However, searching for information on the Web has become increasingly difficult. Although query languages have long been part of database management systems, the standard query language being the Structural Query Language is not suitable for the Web content retrieval. In this dissertation, a new technique for document retrieval on the Web is presented. This technique is designed to allow a detailed retrieval and hence reduce the amount of matches returned by typical search engines. The main objective of this technique is to allow the query to be based on not just keywords but also the location of the keywords within the logical structure of a document. In addition, the technique also provides approximate search capabilities based on the notion of Distance and Variable Length Don\u27t Cares. The proposed techniques have been implemented in a system, called Web-Based Approximate Query System, which contains an SQL-like query language called Web-Based Approximate Query Language. Web-Based Approximate Query Language has also been integrated with EnviroDaemon, an environmental domain specific search engine. It provides EnviroDaemon with more detailed searching capabilities than just keyword-based search. Implementation details, technical results and future work are presented in this dissertation

    THE USE OF RECOMMENDER SYSTEMS IN WEB APPLICATIONS – THE TROI CASE

    Get PDF
    Avoiding digital marketing, surveys, reviews and online users behavior approaches on digital age are the key elements for a powerful businesses to fail, there are some systems that should preceded some artificial intelligence techniques. In this direction, the use of data mining for recommending relevant items as a new state of the art technique is increasing user satisfaction as well as the business revenues. And other related information gathering approaches in order to our systems thing and acts like humans. To do so there is a Recommender System that will be elaborated in this thesis. How people interact, how to calculate accurately and identify what people like or dislike based on their online previous behaviors. The thesis includes also the methodologies recommender system uses, how math equations helps Recommender Systems to calculate user’s behavior and similarities. The filters are important on Recommender System, explaining if similar users like the same product or item, which is the probability of neighbor user to like also. Here comes collaborative filters, neighborhood filters, hybrid recommender system with the use of various algorithms the Recommender Systems has the ability to predict whether a particular user would prefer an item or not, based on the user’s profile and their activities. The use of Recommender Systems are beneficial to both service providers and users. Thesis cover also the strength and weaknesses of Recommender Systems and how involving Ontology can improve it. Ontology-based methods can be used to reduce problems that content-based recommender systems are known to suffer from. Based on Kosovar’s GDP and youngsters job perspectives are desirable for improvements, the demand is greater than the offer. I thought of building an intelligence system that will be making easier for Kosovars to find the appropriate job that suits their profile, skills, knowledge, character and locations. And that system is called TROI Search engine that indexes and merge all local operating job seeking websites in one platform with intelligence features. Thesis will present the design, implementation, testing and evaluation of a TROI search engine. Testing is done by getting user experiments while using running environment of TROI search engine. Results show that the functionality of the recommender system is satisfactory and helpful

    Compressing and indexing aligned readsets

    Get PDF
    Compressed full-text indexes are one of the main success stories of bioinformatics data structures but even they struggle to handle some DNA readsets. This may seem surprising since, at least when dealing with short reads from the same individual, the readset will be highly repetitive and, thus, highly compressible. If we are not careful, however, this advantage can be more than offset by two disadvantages: first, since most base pairs are included in at least tens reads each, the uncompressed readset is likely to be at least an order of magnitude larger than the individual's uncompressed genome; second, these indexes usually pay some space overhead for each string they store, and the total overhead can be substantial when dealing with millions of reads. The most successful compressed full-text indexes for readsets so far are based on the Extended Burrows-Wheeler Transform (EBWT) and use a sorting heuristic to try to reduce the space overhead per read, but they still treat the reads as separate strings and thus may not take full advantage of the readset's structure. For example, if we have already assembled an individual's genome from the readset, then we can usually use it to compress the readset well: e.g., we store the gap-coded list of reads' starting positions; we store the list of their lengths, which is often highly compressible; and we store information about the sequencing errors, which are rare with short reads. There is nowhere, however, where we can plug an assembled genome into the EBWT. In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19%, from 220 million to 178 million, and using the XBWT reduces it by a further 15%, to 150 million
    • …
    corecore