6 research outputs found

    Universal Indexes for Highly Repetitive Document Collections

    Get PDF
    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Web Archive Services Framework for Tighter Integration Between the Past and Present Web

    Get PDF
    Web archives have contained the cultural history of the web for many years, but they still have a limited capability for access. Most of the web archiving research has focused on crawling and preservation activities, with little focus on the delivery methods. The current access methods are tightly coupled with web archive infrastructure, hard to replicate or integrate with other web archives, and do not cover all the users\u27 needs. In this dissertation, we focus on the access methods for archived web data to enable users, third-party developers, researchers, and others to gain knowledge from the web archives. We build ArcSys, a new service framework that extracts, preserves, and exposes APIs for the web archive corpus. The dissertation introduces a novel categorization technique to divide the archived corpus into four levels. For each level, we will propose suitable services and APIs that enable both users and third-party developers to build new interfaces. The first level is the content level that extracts the content from the archived web data. We develop ArcContent to expose the web archive content processed through various filters. The second level is the metadata level; we extract the metadata from the archived web data and make it available to users. We implement two services, ArcLink for temporal web graph and ArcThumb for optimizing the thumbnail creation in the web archives. The third level is the URI level that focuses on using the URI HTTP redirection status to enhance the user query. Finally, the highest level in the web archiving service framework pyramid is the archive level. In this level, we define the web archive by the characteristics of its corpus and building Web Archive Profiles. The profiles are used by the Memento Aggregator for query optimization

    Indexing methods for web archives

    Get PDF
    There have been numerous efforts recently to digitize previously published content and preserving born-digital content leading to the widespread growth of large text reposi- tories. Web archives are such continuously growing text collections which contain ver- sions of documents spanning over long time periods. Web archives present many op- portunities for historical, cultural and political analyses. Consequently there is a grow- ing need for tools which can efficiently access and search them. In this work, we are interested in indexing methods for supporting text-search work- loads over web archives like time-travel queries and phrase queries. To this end we make the following contributions: • Time-travel queries are keyword queries with a temporal predicate, e.g., “mpii saarland” @ [06/2009], which return versions of documents in the past. We in- troduce a novel index organization strategy, called index sharding, for efficiently supporting time-travel queries without incurring additional index-size blowup. We also propose index-maintenance approaches which scale to such continuously growing collections. • We develop query-optimization techniques for time-travel queries called partition selection which maximizes recall at any given query-execution stage. • We propose indexing methods to support phrase queries, e.g., “to be or not to be that is the question”. We index multi-word sequences and devise novel query- optimization methods over the indexed sequences to efficiently answer phrase queries. We demonstrate the superior performance of our approaches over existing methods by extensive experimentation on real-world web archives.In der jüngsten Vergangenheit gab es zahlreiche Bemühungen zuvor veröffentlichte Inhalte zu digitalisieren und elektronisch erstellte Inhalte zu erhalten. Dies führte zu einem weit verbreitenden Anstieg großer Textdatenbestände. Webarchive sind eine solche Art konstant ansteigender Textdatensammlung. Sie enthalten mehrere Versionen von Dokumenten, welche sich über längere Zeiträume erstrecken. Darüber hinaus bieten sie viele Möglichkeiten für historische, kulturelle und politische Analysen. Infolgedessen gibt es einen wachsenden Bedarf an Werkzeugen, die eine effiziente Suche in Webarchiven und einen effizienten Zugriff auf die Daten erlauben. Der Fokus dieser Arbeit liegt auf Indexierungsverfahren, um die Arbeitslast von Textsuche auf Webarchiven zu unterstützen, wie zum Beispiel time-travel queries oder phrase queries. Zu diesem Zweck leisten wir folgende Beiträge: • Time-travel queries sind Suchwortanfragen mit einem temporalen Prädikat. Zum Beispiel liefert die Anfrage “mpii saarland” @ [06/2009] Versionen des Dokuments aus der Vergangenheit als Ergebnis. Zur effizienten Unterstützung solcher Anfragen ohne die Indexgröße aufzublasen, stellen wir eine neue Strategie zur Organisation von Indizes dar, so genanntes index sharding. Des Weiteren schlagen wir Wartungsverfahren für Indizes vor, die für solch konstant wachsende Datensätze skalieren. • WirentwickelnTechnikenzurAnfrageoptimierungvontime-travelqueries, nachstehend partition selection genannt. Diese maximieren den Recall in jeder Phase der Anfrageverarbeitung. • Wir stellen Indexierungsmethoden vor, die phrase queries unterstützen, z. B. “Sein oder Nichtsein, das ist hier die Frage”. Wir indexieren Sequenzen bestehend aus mehreren Wörtern und entwerfen neue Optimierungsverfahren für die indexierten Sequenzen, um phrase queries effizient zu beantworten. Die Performanz dieser Verfahren wird anhand von ausführlichen Experimenten auf realen Webarchiven demonstriert
    corecore