11 research outputs found
Incremental cluster-based retrieval using compressed cluster-skipping inverted files
We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our incremental-CBR strategy, during query evaluation, both best(-matching) clusters and the best(-matching) documents of such clusters are computed together with a single posting-list access per query term. As we switch from term to term, the best clusters are recomputed and can dynamically change. During query-document matching, only relevant portions of the posting lists corresponding to the best clusters are considered and the rest are skipped. The proposed approach is essentially tailored for environments where inverted files are compressed, and provides substantial efficiency improvement while yielding comparable, or sometimes better, effectiveness figures. Our experiments with various collections show that the incremental-CBR strategy using a compressed cluster-skipping inverted file significantly improves CPU time efficiency, regardless of query length. The new compressed inverted file imposes an acceptable storage overhead in comparison to a typical inverted file. We also show that our approach scales well with the collection size. © 2008 ACM
Site-based dynamic pruning for query processing in search engines
Web search engines typically index and retrieve at the page level. In this study, we investigate a dynamic pruning strategy that allows the query processor to first determine the most promising websites and then proceed with the similarity computations for those pages only within these sites
Efficient processing of category-restricted queries for web directories
We show that a cluster-skipping inverted index (CS-IIS) is a practical and efficient file structure to support category-restricted queries for searching Web directories. The query processing strategy with CS-IIS improves CPU time efficiency without imposing any limitations on the directory size. © 2008 Springer-Verlag Berlin Heidelberg
Incremental cluster-based retrieval using compressed cluster-skipping inverted files
We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our incremental-CBR strategy, during query evaluation, both best(-matching) clusters and the best(-matching) documents of such clusters are computed together with a single posting-list access per query term. As we switch from term to term, the best clusters are recomputed and can dynamically change. During query-document matching, only relevant portions of the posting lists corresponding to the best clusters are considered and the rest are skipped. The proposed approach is essentially tailored for environments where inverted files are compressed, and provides substantial efficiency improvement while yielding comparable, or sometimes better, effectiveness figures. Our experiments with various collections show that the incremental- CBR strategy using a compressed cluster-skipping inverted file significantly improves CPU time efficiency, regardless of query length. The new compressed inverted file imposes an acceptable storage overhead in comparison to a typical inverted file. We also show that our approach scales well with the collection size
Incremental Cluster-Based Retrieval using Compressed Cluster-Skipping Inverted Files
We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our incremental-CBR strategy, during query evaluation both best(-matching) clusters and best(-matching) documents of such clusters are computed together with a single posting list access per query term. As we switch from term to term, best clusters are recomputed and can dynamically change. During query-document matching, only relevant portions of the posting lists corresponding to the best clusters are considered and the rest is skipped. The proposed approach is essentially tailored for environments where inverted files are compressed, and provides substantial efficiency improvements while yielding comparable or sometimes better effectiveness figures. Our experiments with various collections show that, the incremental-CBR strategy using compressed cluster-skipping inverted file significantly improves CPU time efficiency regardless of the query length. The new compressed inverted file imposes an acceptable storage overhead in comparison to a typical inverted file. We also show that our approach scales well with the collection size
Performance comparison of query evaluation techniques in parallel text retrieval systems
Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2008.Thesis (Master's) -- Bilkent University, 2008.Includes bibliographical references leaves 47-52.Today’s state-of-the-art search engines utilize the inverted index data structure
for fast text retrieval on large document collections. To parallelize the retrieval
process, the inverted index should be distributed among multiple index servers.
Generally the distribution of the inverted index is done in either a term-based or a
document-based fashion. The performances of both schemes depend on the total
number of disk accesses and the total volume of communication in the system.
The classical approach for both distributions is to use the Central Broker
Query Evaluation Scheme (CB) for parallel text retrieval. It is known that in this
approach the central broker is heavily loaded and becomes a bottleneck. Recently,
an alternative query evaluation technique, named Pipelined Query Evaluation
Scheme (PPL), has been proposed to alleviate this problem by performing the
merge operation on the index servers. In this study, we analyze the scalability
and relative performances of the CB and PPL under various query loads to report
the benefits and drawbacks of each method.Tokuç, A AylinM.S
Bilkent News Portal : a system with new event detection and tracking capabilities
Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2009.Thesis (Master's) -- Bilkent University, 2009.Includes bibliographical references leaves 65-71.News portal services such as browsing, retrieving, and filtering have become an
important research and application area as a result of information explosion on the
Internet. In this work, we give implementation details of Bilkent News Portal that
contains various novel features ranging from personalization to new event detection and
tracking capabilities aiming at addressing the needs of news-consumers. The thesis
presents the architecture, data and file structures, and experimental foundations of the
news portal. For the implementation and evaluation of the new event detection and
tracking component, we developed a test collection: BilCol2005. The collection
contains 209,305 documents from the entire year of 2005 and involves several events in
which eighty of them are annotated by humans. It enables empirical assessment of new
event detection and tracking algorithms on Turkish. For the construction of our test
collection, a web application, ETracker, is developed by following the guidelines of the
TDT research initiative. Furthermore, we experimentally evaluated the impact of
various parameters in information retrieval (IR) that has to be decided during the
implementation of a news portal that provides filtering and retrieval capabilities. For
this purpose, we investigated the effects of stemming, document length, query length,
and scalability issues.Öcalan, Hüseyin ÇağdaşM.S
New event detection and tracking in Turkish
Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2009.Thesis (Master's) -- Bilkent University, 2009.Includes bibliographical references leaves 66-73.The amount of information and the number of information resources on the Internet
have been growing rapidly for over a decade. This is also true for on-line
news and news providers. To overcome information overload news consumers
prefer to track the topics that they are interested in. Topic detection and tracking
(TDT) applications aim to organize the temporally ordered stories of a news
stream according to the events. Two major problems in TDT are new event
detection (NED) and topic tracking (TT). These problems respectively focus on
finding the first stories of previously unseen new events and all subsequent stories
on a certain topic defined by a small number of initial stories. In this thesis,
the NED and TT problems are investigated in detail using the first large-scale
test collection (BilCol2005) developed by Bilkent Information Retrieval Group.
The collection contains 209,305 documents from the entire year of 2005 and involves
several events in which eighty of them are annotated by humans. The
experimental results show that a simple word truncation stemming method can
statistically compete with a sophisticated stemming approach that pays attention
to the morphological structure of the language. Our statistical findings illustrate
that word stopping and the contents of the associated stopword list are important
and removing the stopwords from content can significantly improve the system
performance. We demonstrate that the confidence scores of two different similarity
measures can be combined in a straightforward manner for improving the
effectiveness.Kardaş, SüleymanM.S