57 research outputs found

    Inverted index compression based on term and document identifier reassignment

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2008.Thesis (Master's) -- Bilkent University, 2008.Includes bibliographical references leaves 43-46.Compression of inverted indexes received great attention in recent years. An inverted index consists of lists of document identifiers, also referred as posting lists, for each term. Compressing an inverted index reduces the size of the index, which also improves the query performance due to the reduction on disk access times. In recent studies, it is shown that reassigning document identifiers has great effect in compression of an inverted index. In this work, we propose a novel technique that reassigns both term and document identifiers of an inverted index by transforming the matrix representation of the index into a block-diagonal form, which improves the compression ratio dramatically. We adapted row-net hypergraph-partitioning model for the transformation into block-diagonal form, which improves the compression ratio by as much as 50%. To the best of our knowledge, this method performs more effectively than previous inverted index compression techniques.Baykan, İzzet ÇağrıM.S

    Index compression for information retrielval systems

    Get PDF
    [Abstract] Given the increasing amount of information that is available today, there is a clear need for Information Retrieval (IR) systems that can process this information in an efficient and effective way. Efficient processing means minimising the amount of time and space required to process data, whereas effective processing means identifying accurately which information is relevant to the user and which is not. Traditionally, efficiency and effectiveness are at opposite ends (what is beneficial to efficiency is usually harmful to effectiveness, and vice versa), so the challenge of IR systems is to find a compromise between efficient and effective data processing. This thesis investigates the efficiency of IR systems. It suggests several novel strategies that can render IR systems more efficient by reducing the index size of IR systems, referred to as index compression. The index is the data structure that stores the information handled in the retrieval process. Two different approaches are proposed for index compression, namely document reordering and static index pruning. Both of these approaches exploit document collection characteristics in order to reduce the size of indexes, either by reassigning the document identifiers in the collection in the index, or by selectively discarding information that is less relevant to the retrieval process by pruning the index. The index compression strategies proposed in this thesis can be grouped into two categories: (i) Strategies which extend state of the art in the field of efficiency methods in novel ways. (ii) Strategies which are derived from properties pertaining to the effectiveness of IR systems; these are novel strategies, because they are derived from effectiveness as opposed to efficiency principles, and also because they show that efficiency and effectiveness can be successfully combined for retrieval. The main contributions of this work are in indicating principled extensions of state of the art in index compression, and also in suggesting novel theoretically-driven index compression techniques which are derived from principles of IR effectiveness. All these techniques are evaluated extensively, in thorough experiments involving established datasets and baselines, which allow for a straight-forward comparison with state of the art. Moreover, the optimality of the proposed approaches is addressed from a theoretical perspective.[Resumen] Dada la creciente cantidad de información disponible hoy en día, existe una clara necesidad de sistemas de Recuperación de Información (RI) que sean capaces de procesar esa información de una manera efectiva y eficiente. En este contexto, eficiente significa cantidad de tiempo y espacio requeridos para procesar datos, mientras que efectivo significa identificar de una manera precisa qué información es relevante para el usuario y cual no lo es. Tradicionalmente, eficiencia y efectividad se encuentran en polos opuestos - lo que es beneficioso para la eficiencia, normalmente perjudica la efectividad y viceversa - así que un reto para los sistemas de RI es encontrar un compromiso adecuado entre el procesamiento efectivo y eficiente de los datos. Esta tesis investiga el problema de la eficiencia de los sistemas de RI. Sugiere diferentes estrategias novedosas que pueden permitir la reducción de los índices de los sistemas de RI, enmarcadas dentro da las técnicas conocidas como compresión de índices. El índice es la estructura de datos que almacena la información utilizada en el proceso de recuperación. Se presentan dos aproximaciones diferentes para la compresión de los índices, referidas como reordenación de documentos y pruneado estático del índice. Ambas aproximaciones explotan características de colecciones de documentos para reducir el tamaño final de los índices, mediante la reasignación de los identificadores de los documentos de la colección o bien descartando selectivamente la información que es "menos relevante" para el proceso de recuperación. Las estrategias de compresión propuestas en este tesis se pueden agrupar en dos categorías: (i) estrategias que extienden el estado del arte en la eficiencia de una manera novedosa y (ii) estrategias derivadas de propiedades relacionadas con los principios de la efectividad en los sistemas de RI; estas estrategias son novedosas porque son derivadas desde principios de la efectividad como contraposición a los de la eficiencia, e porque revelan como la eficiencia y la efectividad pueden ser combinadas de una manera efectiva para la recuperación de información. Las contribuciones de esta tesis abarcan la elaboración de técnicas del estado del arte en compresión de índices y también en la derivación de técnicas de compresión basadas en fundamentos teóricos derivados de los principios de la efectividad de los sistemas de RI. Todas estas técnicas han sido evaluadas extensamente con numerosos experimentos que involucran conjuntos de datos y técnicas de referencia bien establecidas en el campo, las cuales permiten una comparación directa con el estado del arte. Finalmente, la optimalidad de las aproximaciones presentadas es tratada desde una perspectiva teórica

    Integrating Skips and Bitvectors for List Intersection

    Get PDF
    This thesis examines space-time optimizations of in-memory search engines. Search engines can answer queries quickly, but this is accomplished using significant resources in the form of multiple machines running concurrently. Improving the performance of search engines means reducing the resource costs, such as hardware, energy, and cooling. These saved resources can then be used to improve the effectiveness of the search engine or provide additional added value to the system. We improve the space-time performance for search engines in the context of in-memory conjunctive intersection of ordered document identifier lists. We show that reordering of document identifiers can produce dense regions in these lists, where bitvectors can be used to improve the efficiency of conjunctive list intersection. Since the process of list intersection is a fundamental building block and a major performance bottleneck for search engines, this work will be important for all search engine researchers and developers. Our results are presented in three stages. First, we show how to combine multiple existing techniques for list intersection to improve space-time performance. We combine bitvectors for large lists with skips over compressed values for the other lists. When the skips are large and overlaid on the compressed lists, space-time performance is superior to existing techniques, such as using skips or bitvectors separately. Second, we show that grouping documents by size and ordering by URL within groups combines the skewed clustering that results from document size ordering with the tight clustering that results from URL ordering. We propose a new semi-bitvector data structure that encodes the front of a list, including groups with large documents, as a bitvector and the rest of the list as skips over compressed values. This combination produces significant space-time performance gains on top of the gains from the first stage. Third, we show how partitioning by document size into separate indexes can also produce high density regions that can be exploited by bitvectors, resulting in benefits similar to grouping by document size within one index. This partitioning technique requires no modification of the intersection algorithms, and it is therefore broadly applicable. We further show that any of our partitioning approaches can be combined with semi-bitvectors and grouping within each partition to effectively exploit skewed clustering and tight clustering in our dataset. A hierarchy of partitioning approaches may be required to exploit clustering in very large document collections

    Search engine optimisation using past queries

    Get PDF
    World Wide Web search engines process millions of queries per day from users all over the world. Efficient query evaluation is achieved through the use of an inverted index, where, for each word in the collection the index maintains a list of the documents in which the word occurs. Query processing may also require access to document specific statistics, such as document length; access to word statistics, such as the number of unique documents in which a word occurs; and collection specific statistics, such as the number of documents in the collection. The index maintains individual data structures for each these sources of information, and repeatedly accesses each to process a query. A by-product of a web search engine is a list of all queries entered into the engine: a query log. Analyses of query logs have shown repetition of query terms in the requests made to the search system. In this work we explore techniques that take advantage of the repetition of user queries to improve the accuracy or efficiency of text search. We introduce an index organisation scheme that favours those documents that are most frequently requested by users and show that, in combination with early termination heuristics, query processing time can be dramatically reduced without reducing the accuracy of the search results. We examine the stability of such an ordering and show that an index based on as little as 100,000 training queries can support at least 20 million requests. We show the correlation between frequently accessed documents and relevance, and attempt to exploit the demonstrated relationship to improve search effectiveness. Finally, we deconstruct the search process to show that query time redundancy can be exploited at various levels of the search process. We develop a model that illustrates the improvements that can be achieved in query processing time by caching different components of a search system. This model is then validated by simulation using a document collection and query log. Results on our test data show that a well-designed cache can reduce disk activity by more than 30%, with a cache that is one tenth the size of the collection

    Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

    Get PDF
    El actual diluvio de datos está inundando la web con grandes volúmenes de datos representados en RDF, dando lugar a la denominada 'Web de Datos'. En esta tesis proponemos, en primer lugar, un estudio profundo de aquellos textos que nos permitan abordar un conocimiento global de la estructura real de los conjuntos de datos RDF, HDT, que afronta la representación eficiente de grandes volúmenes de datos RDF a través de estructuras optimizadas para su almacenamiento y transmisión en red. HDT representa efizcamente un conjunto de datos RDF a través de su división en tres componentes: la cabecera (Header), el diccionario (Dictionary) y la estructura de sentencias RDF (Triples). A continuación, nos centramos en proveer estructuras eficientes de dichos componentes, ocupando un espacio comprimido al tiempo que se permite el acceso directo a cualquier dat

    Compact and efficient representations of graphs

    Get PDF
    [Resumen] En esta tesis estudiamos el problema de la creación de representaciones compactas y eficientes de grafos. Proponemos nuevas estructuras para persistir y consultar grafos de diferentes dominios, prestando especial atención al diseño de soluciones eficientes para grafos generales y grafos RDF. Hemos diseñado una nueva herramienta para generar grafos a partir de fuentes de datos heterogéneas mediante un sistema de definición de reglas. Es una herramienta de propósito general y, hasta nuestro conocimiento, no existe otra herramienta de estas características en el Estado del Arte. Otra contribución de este trabajo es una representación compacta de grafos generales, que soporta el acceso eficiente a los atributos y aristas del grafo. Así mismo, hemos estudiado el problema de la distribución de grafos en un entorno paralelo, almacenados sobre estructuras compactas, y hemos propuesto nueve alternativas diferentes que han sido evaluadas experimentalmente. También hemos propuesto un nuevo índice para RDF que soporta la resolución básica de SPARQL de forma comprimida. Por último, presentamos una nueva estructura compacta para almacenar relaciones ternarias cuyo diseño se enfoca a la representación eficiente de datos RDF. Todas estas propuestas han sido experimentalmente validadas con conjuntos de datos ampliamente aceptados, obteniéndose resultados competitivos comparadas con otras alternativas del Estado del Arte.[Resumo] Na presente tese estudiamos o problema da creación de representacións compactas e eficientes de grafos. Para isto propoñemos novas estruturas para persistir e consultar grafos de diferentes dominios, facendo especial fincapé no deseño de solucións eficientes nos casos de grafos xerais e grafos RDF. Deseñamos unha nova ferramenta para a xeración de grafos a partires de fontes de datos heteroxéneas mediante un sistema de definición de regras. Trátase dunha ferramenta de propósito xeral e, até onde chega o noso coñecemento, non existe outra ferramenta semellante no Estado do Arte. Outra das contribucións do traballo é unha representación compacta de grafos xerais, con soporte para o acceso eficiente aos atributos e aristas do grafo. Así mesmo, estudiamos o problema da distribución de grafos nun contorno paralelo, almacenados sobre estruturas compactas, e propoñemos nove alternativas diferentes que foron avaliadas de xeito experimental. Propoñemos tamén un novo índice para RDF que soporta a resolución básica de SPARQL de xeito comprimido. Para rematar, presentamos unha nova estrutura compacta para almacenar relacións ternarias, cun diseño especialmente enfocado á representación eficiente de datos RDF. Todas estas propostas foron validadas experimentalmente con conxuntos de datos amplamente aceptados, obténdose resultados competitivos comparadas con outras alternativas do Estado do Arte.[Abstract] In this thesis we study the problem of creating compact and efficient representations of graphs. We propose new data structures to store and query graph data from diverse domains, paying special attention to the design of efficient solutions for attributed and RDF graphs. We have designed a new tool to generate graphs from arbitrary data through a rule definition system. It is a general-purpose solution that, to the best of our knowledge, is the first with these characteristics. Another contribution of this work is a very compact representation for attributed graphs, providing efficient access to the properties and links of the graph. We also study the problem of graph distribution on a parallel environment using compact structures, proposing nine different alternatives that are experimentally compared. We also propose a novel RDF indexing technique that supports efficient SPARQL solution in compressed space. Finally, we present a new compact structure to store ternary relationships whose design is focused on the efficient representation of RDF data. All of these proposals were experimentally evaluated with widely accepted datasets, obtaining competitive results when they are compared against other alternatives of the State of the Art

    A Search Engine Architecture Based on Collection Selection

    Get PDF
    In this thesis, we present a distributed architecture for a Web search engine, based on the concept of collection selection. We introduce a novel approach to partition the collection of documents, able to greatly improve the effectiveness of standard collection selection techniques (CORI), and a new selection function outperforming the state of the art. Our technique is based on the novel query-vector (QV) document model, built from the analysis of query logs, and on our strategy of co-clustering queries and documents at the same time. Incidentally, our partitioning strategy is able to identify documents that can be safely moved out of the main index (into a supplemental index), with a minimal loss in result accuracy. In our test, we could move 50\% of the collection to the supplemental index with a minimal loss in recall. By suitably partitioning the documents in the collection, our system is able to select the subset of servers containing the most relevant documents for each query. Instead of broadcasting the query to every server in the computing platform, only the most relevant will be polled, this way reducing the average computing cost to solve a query. We introduce a novel strategy to use the instant load at each server to drive the query routing. Also, we describe a new approach to caching, able to incrementally improve the quality of the stored results. Our caching strategy is effectively both in reducing computing load and in improving result quality. By combining these innovations, we can achieve extremely high figures of precision, with a reduced load w.r.t.~full query broadcasting. Our system can cover 65\% results offered by a centralized reference index (competitive recall at 5), with a computing load of only 15.6\%, \ie a peak of 156 queries out of a shifting window of 1000 queries. This means about 1/4 of the peak load reached when broadcasting queries. The system, with a slightly higher load (24.6\%), can cover 78\% of the reference results. The proposed architecture, overall, presents a trade-off between computing cost and result quality, and we show how to guarantee very precise results in face of a dramatic reduction to computing load. This means that, with the same computing infrastructure, our system can serve more users, more queries and more documents

    Scalable succinct indexing for large text collections

    Get PDF
    Self-indexes save space by emulating operations of traditional data structures using basic operations on bitvectors. Succinct text indexes provide full-text search functionality which is traditionally provided by suffix trees and suffix arrays for a given text, while using space equivalent to the compressed representation of the text. Succinct text indexes can therefore provide full-text search functionality over inputs much larger than what is viable using traditional uncompressed suffix-based data structures. Fields such as Information Retrieval involve the processing of massive text collections. However, the in-memory space requirements of succinct text indexes during construction have hampered their adoption for large text collections. One promising approach to support larger data sets is to avoid constructing the full suffix array by using alternative indexing representations. This thesis focuses on several aspects related to the scalability of text indexes to larger data sets. We identify practical improvements in the core building blocks of all succinct text indexing algorithms, and subsequently improve the index performance on large data sets. We evaluate our findings using several standard text collections and demonstrate: (1) the practical applications of our improved indexing techniques; and (2) that succinct text indexes are a practical alternative to inverted indexes for a variety of top-k ranked document retrieval problems

    Assigning document identifiers to enhance compressibility of Web Search Engines indexes

    No full text
    Granting efficient accesses to the index is a key issue for the performances of Web Search Engines (WSE). In order to en-hance memory utilization and favor fast query resolution, WSEs use Inverted File (IF) indexes where the posting lists are stored as sequences of d gaps (i.e. differences among successive doc-ument identifiers) compressed using variable length encoding methods. This paper describes the use of a lightweight cluster-ing algorithm aimed at assigning the identifiers to documents in a way that minimizes the average values of d gaps. The simulations performed on a real dataset, i.e. the Google con-test collection, show that our approach allows to obtain an IF index which is, depending on the d gap encoding chosen, up to 23 % smaller than the one built over randomly assigned doc-ument identifiers. Moreover, we will show, both analytically and empirically, that the complexity of our algorithm is linear in space and time
    corecore