7 research outputs found
Diversification Based Static Index Pruning - Application to Temporal Collections
Nowadays, web archives preserve the history of large portions of the web. As
medias are shifting from printed to digital editions, accessing these huge
information sources is drawing increasingly more attention from national and
international institutions, as well as from the research community. These
collections are intrinsically big, leading to index files that do not fit into
the memory and an increase query response time. Decreasing the index size is a
direct way to decrease this query response time.
Static index pruning methods reduce the size of indexes by removing a part of
the postings. In the context of web archives, it is necessary to remove
postings while preserving the temporal diversity of the archive. None of the
existing pruning approaches take (temporal) diversification into account.
In this paper, we propose a diversification-based static index pruning
method. It differs from the existing pruning approaches by integrating
diversification within the pruning context. We aim at pruning the index while
preserving retrieval effectiveness and diversity by pruning while maximizing a
given IR evaluation metric like DCG. We show how to apply this approach in the
context of web archives. Finally, we show on two collections that search
effectiveness in temporal collections after pruning can be improved using our
approach rather than diversity oblivious approaches
Static index pruning in web search engines: Combining term and document popularities with query views
Cataloged from PDF version of article.Static index pruning techniques permanently remove a presumably redundant part of an inverted file, to reduce the file size and query processing time. These techniques differ in deciding which parts of an index can be removed safely; that is, without changing the top-ranked query results. As defined in the literature, the query view of a document is the set of query terms that access to this particular document, that is, retrieves this document among its top results. In this paper, we first propose using query views to improve the quality of the top results compared against the original results. We incorporate query views in a number of static pruning strategies, namely term-centric, document-centric, term popularity based and document access popularity based approaches, and show that the new strategies considerably outperform their counterparts especially for the higher levels of pruning and for both disjunctive and conjunctive query processing. Additionally, we combine the notions of term and document access popularity to form new pruning strategies, and further extend these strategies with the query views. The new strategies improve the result quality especially for the conjunctive query processing, which is the default and most common search mode of a search engine
Static index pruning in web search engines
Static index pruning techniques permanently remove a presumably redundant part of an inverted file, to reduce the file size and query processing time. These techniques differ in deciding which parts of an index can be removed safely; that is, without changing the top-ranked query results. As defined in the literature, the query view of a document is the set of query terms that access to this particular document, that is, retrieves this document among its top results. In this paper, we first propose using query views to improve the quality of the top results compared against the original results. We incorporate query views in a number of static pruning strategies, namely term-centric, document-centric, term popularity based and document access popularity based approaches, and show that the new strategies considerably outperform their counterparts especially for the higher levels of pruning and for both disjunctive and conjunctive query processing. Additionally, we combine the notions of term and document access popularity to form new pruning strategies, and further extend these strategies with the query views. The new strategies improve the result quality especially for the conjunctive query processing, which is the default and most common search mode of a search engine
Static index pruning in web search engines: Combining term and document popularities with query views
Static index pruning techniques permanently remove a presumably redundant part of an inverted file, to reduce the file size and query processing time. These techniques differ in deciding which parts of an index can be removed safely; that is, without changing the top-ranked query results. As defined in the literature, the query view of a document is the set of query terms that access to this particular document, that is, retrieves this document among its top results. In this paper, we first propose using query views to improve the quality of the top results compared against the original results. We incorporate query views in a number of static pruning strategies, namely term-centric, document-centric, term popularity based and document access popularity based approaches, and show that the new strategies considerably outperform their counterparts especially for the higher levels of pruning and for both disjunctive and conjunctive query processing. Additionally,we combine the notions of term and document access popularity to form new pruning strategies, and further extend these strategies with the query views. The new strategies improve the result quality especially for the conjunctive query processing, which is the default and most common search mode of a search engine. © 2012 ACM
Query-driven indexing in large-scale distributed systems
Efficient and effective search in large-scale data repositories requires complex indexing solutions deployed on a large number of servers. Web search engines such as Google and Yahoo! already rely upon complex systems to be able to return relevant query results and keep processing times within the comfortable sub-second limit. Nevertheless, the exponential growth of the amount of content on the Web poses serious challenges with respect to scalability. Coping with these challenges requires novel indexing solutions that not only remain scalable but also preserve the search accuracy. In this thesis we introduce and explore the concept of query-driven indexing – an index construction strategy that uses caching techniques to adapt to the querying patterns expressed by users. We suggest to abandon the strict difference between indexing and caching, and to build a distributed indexing structure, or a distributed cache, such that it is optimized for the current query load. Our experimental and theoretical analysis shows that employing query-driven indexing is especially beneficial when the content is (geographically) distributed in a Peer-to-Peer network. In such a setting extensive bandwidth consumption has been identified as one of the major obstacles for efficient large-scale search. Our indexing mechanisms combat this problem by maintaining the query popularity statistics and by indexing (caching) intermediate query results that are requested frequently. We present several indexing strategies for processing multi-keyword and XPath queries over distributed collections of textual and XML documents respectively. Experimental evaluations show significant overall traffic reduction compared to the state-of-the-art approaches. We also study possible query-driven optimizations for Web search engine architectures. Contrary to the Peer-to-Peer setting, Web search engines use centralized caching of query results to reduce the processing load on the main index. We analyze real search engine query logs and show that the changes in query traffic that such a results cache induces fundamentally affect indexing performance. In particular, we study its impact on index pruning efficiency. We show that combination of both techniques enables efficient reduction of the query processing costs and thus is practical to use in Web search engines
Improving the efficiency of search engines : strategies for focused crawling, searching, and index pruning
Ankara : The Department of Computer Engineering and the Instıtute of Engineering and Science of Bilkent University, 2009.Thesis (Ph. D.) -- Bilkent University, 2009.Includes bibliographical references leaves 157-169.Search engines are the primary means of retrieval for text data that is abundantly
available on the Web. A standard search engine should carry out three
fundamental tasks, namely; crawling the Web, indexing the crawled content, and
finally processing the queries using the index. Devising efficient methods for these
tasks is an important research topic. In this thesis, we introduce efficient strategies
related to all three tasks involved in a search engine. Most of the proposed
strategies are essentially applicable when a grouping of documents in its broadest
sense (i.e., in terms of automatically obtained classes/clusters, or manually
edited categories) is readily available or can be constructed in a feasible manner.
Additionally, we also introduce static index pruning strategies that are based on
the query views.
For the crawling task, we propose a rule-based focused crawling strategy that
exploits interclass rules among the document classes in a topic taxonomy. These
rules capture the probability of having hyperlinks between two classes. The rulebased
crawler can tunnel toward the on-topic pages by following a path of off-topic
pages, and thus yields higher harvest rate for crawling on-topic pages.
In the context of indexing and query processing tasks, we concentrate on conducting
efficient search, again, using document groups; i.e., clusters or categories.
In typical cluster-based retrieval (CBR), first, clusters that are most similar to a
given free-text query are determined, and then documents from these clusters are
selected to form the final ranked output. For efficient CBR, we first identify and
evaluate some alternative query processing strategies. Next, we introduce a new
index organization, so-called cluster-skipping inverted index structure (CS-IIS).
It is shown that typical-CBR with CS-IIS outperforms previous CBR strategies
(with an ordinary index) for a number of datasets and under varying search parameters.
In this thesis, an enhanced version of CS-IIS is further proposed, in
which all information to compute query-cluster similarities during query evaluation
is stored. We introduce an incremental-CBR strategy that operates on top
of this latter index structure, and demonstrate its search efficiency for different
scenarios.
Finally, we exploit query views that are obtained from the search engine query
logs to tailor more effective static pruning techniques. This is also related to the
indexing task involved in a search engine. In particular, query view approach
is incorporated into a set of existing pruning strategies, as well as some new
variants proposed by us. We show that query view based strategies significantly
outperform the existing approaches in terms of the query output quality, for both
disjunctive and conjunctive evaluation of queries.Altıngövde, İsmail SengörPh.D
Index compression for information retrielval systems
[Abstract]
Given the increasing amount of information that is available today, there is a clear need for Information
Retrieval (IR) systems that can process this information in an efficient and effective way. Efficient
processing means minimising the amount of time and space required to process data, whereas
effective processing means identifying accurately which information is relevant to the user and
which is not. Traditionally, efficiency and effectiveness are at opposite ends (what is beneficial to
efficiency is usually harmful to effectiveness, and vice versa), so the challenge of IR systems is to find
a compromise between efficient and effective data processing.
This thesis investigates the efficiency of IR systems. It suggests several novel strategies that
can render IR systems more efficient by reducing the index size of IR systems, referred to as index
compression. The index is the data structure that stores the information handled in the retrieval
process. Two different approaches are proposed for index compression, namely document reordering
and static index pruning. Both of these approaches exploit document collection characteristics in
order to reduce the size of indexes, either by reassigning the document identifiers in the collection in
the index, or by selectively discarding information that is less relevant to the retrieval process by
pruning the index.
The index compression strategies proposed in this thesis can be grouped into two categories: (i)
Strategies which extend state of the art in the field of efficiency methods in novel ways. (ii) Strategies
which are derived from properties pertaining to the effectiveness of IR systems; these are novel
strategies, because they are derived from effectiveness as opposed to efficiency principles, and also
because they show that efficiency and effectiveness can be successfully combined for retrieval.
The main contributions of this work are in indicating principled extensions of state of the art in
index compression, and also in suggesting novel theoretically-driven index compression techniques
which are derived from principles of IR effectiveness. All these techniques are evaluated extensively, in
thorough experiments involving established datasets and baselines, which allow for a straight-forward
comparison with state of the art. Moreover, the optimality of the proposed approaches is addressed
from a theoretical perspective.[Resumen] Dada la creciente cantidad de información disponible hoy en día, existe una clara necesidad de sistemas de Recuperación de Información (RI) que sean capaces de procesar esa información de una manera efectiva y eficiente. En este contexto, eficiente significa cantidad de tiempo y espacio requeridos para procesar datos, mientras que efectivo significa identificar de una manera precisa qué información es relevante para el usuario y cual no lo es. Tradicionalmente, eficiencia y efectividad se encuentran en polos opuestos - lo que es beneficioso para la eficiencia, normalmente perjudica la efectividad y viceversa - así que un reto para los sistemas de RI es encontrar un compromiso adecuado entre el procesamiento efectivo y eficiente de los datos.
Esta tesis investiga el problema de la eficiencia de los sistemas de RI. Sugiere diferentes estrategias novedosas que pueden permitir la reducción de los índices de los sistemas de RI, enmarcadas dentro da las técnicas conocidas como compresión de índices. El índice es la estructura de datos que almacena la información utilizada en el proceso de recuperación. Se presentan dos aproximaciones diferentes para la compresión de los índices, referidas como reordenación de documentos y pruneado estático del índice. Ambas aproximaciones explotan características de colecciones de documentos para reducir el tamaño final de los índices, mediante la reasignación de los identificadores de los documentos de la colección o bien descartando selectivamente la información que es "menos relevante" para el proceso de recuperación.
Las estrategias de compresión propuestas en este tesis se pueden agrupar en dos categorías: (i) estrategias que extienden el estado del arte en la eficiencia de una manera novedosa y (ii) estrategias derivadas de propiedades relacionadas con los principios de la efectividad en los sistemas de RI; estas estrategias son novedosas porque son derivadas desde principios de la efectividad como contraposición a los de la eficiencia, e porque revelan como la eficiencia y la efectividad pueden ser combinadas de una manera efectiva para la recuperación de información.
Las contribuciones de esta tesis abarcan la elaboración de técnicas del estado del arte en compresión de índices y también en la derivación de técnicas de compresión basadas en fundamentos teóricos derivados de los principios de la efectividad de los sistemas de RI. Todas estas técnicas han sido evaluadas extensamente con numerosos experimentos que involucran conjuntos de datos y técnicas de referencia bien establecidas en el campo, las cuales permiten una comparación directa con el estado del arte. Finalmente, la optimalidad de las aproximaciones presentadas es tratada desde una perspectiva teórica