21 research outputs found

    Dublin City University at the TREC 2005 terabyte track

    Get PDF
    For the 2005 Terabyte track in TREC Dublin City University participated in all three tasks: Adhoc, E±ciency and Named Page Finding. Our runs for TREC in all tasks were primarily focussed on the application of "Top Subset Retrieval" to the Terabyte Track. This retrieval utilises different types of sorted inverted indices so that less documents are processed in order to reduce query times, and is done so in a way that minimises loss of effectiveness in terms of query precision. We also compare a distributed version of our Físréal search system [1][2] against the same system deployed on a single machine

    Dublin City University at the TREC 2006 terabyte track

    Get PDF
    For the 2006 Terabyte track in TREC, Dublin City University’s participation was focussed on the ad hoc search task. As per the pervious two years [7, 4], our experiments on the Terabyte track have concentrated on the evaluation of a sorted inverted index, the aim of which is to sort the postings within each posting list in such a way, that allows only a limited number of postings to be processed from each list, while at the same time minimising the loss of effectiveness in terms of query precision. This is done using the Físréal search system, developed at Dublin City University [4, 8]

    Setting per-field normalisation hyper-parameters for the named-page finding search task

    Get PDF
    Per-field normalisation has been shown to be effective for Web search tasks, e.g. named-page finding. However, per-field normalisation also suffers from having hyper-parameters to tune on a per-field basis. In this paper, we argue that the purpose of per-field normalisation is to adjust the linear relationship between field length and term frequency. We experiment with standard Web test collections, using three document fields, namely the body of the document, its title, and the anchor text of its incoming links. From our experiments, we find that across different collections, the linear correlation values, given by the optimised hyper-parameter settings, are proportional to the maximum negative linear correlation. Based on this observation, we devise an automatic method for setting the per-field normalisation hyper-parameter values without the use of relevance assessment for tuning. According to the evaluation results, this method is shown to be effective for the body and title fields. In addition, the difficulty in setting the per-field normalisation hyper-parameter for the anchor text field is explained

    Distributed Information Retrieval using Keyword Auctions

    Get PDF
    This report motivates the need for large-scale distributed approaches to information retrieval, and proposes solutions based on keyword auctions

    Multi-Stage Search Architectures for Streaming Documents

    Get PDF
    The web is becoming more dynamic due to the increasing engagement and contribution of Internet users in the age of social media. A more dynamic web presents new challenges for web search--an important application of Information Retrieval (IR). A stream of new documents constantly flows into the web at a high rate, adding to the old content. In many cases, documents quickly lose their relevance. In these time-sensitive environments, finding relevant content in response to user queries requires a real-time search service; immediate availability of content for search and a fast ranking, which requires an optimized search architecture. These aspects of today's web are at odds with how academic IR researchers have traditionally viewed the web, as a collection of static documents. Moreover, search architectures have received little attention in the IR literature. Therefore, academic IR research, for the most part, does not provide a mechanism to efficiently handle a high-velocity stream of documents, nor does it facilitate real-time ranking. This dissertation addresses the aforementioned shortcomings. We present an efficient mech- anism to index a stream of documents, thereby enabling immediate availability of content. Our indexer works entirely in main memory and provides a mechanism to control inverted list con- tiguity, thereby enabling faster retrieval. Additionally, we consider document ranking with a machine-learned model, dubbed "Learning to Rank" (LTR), and introduce a novel multi-stage search architecture that enables fast retrieval and allows for more design flexibility. The stages of our architecture include candidate generation (top k retrieval), feature extraction, and docu- ment re-ranking. We compare this architecture with a traditional monolithic architecture where candidate generation and feature extraction occur together. As we lay out our architecture, we present optimizations to each stage to facilitate low-latency ranking. These optimizations include a fast approximate top k retrieval algorithm, document vectors for feature extraction, architecture- conscious implementations of tree ensembles for LTR using predication and vectorization, and algorithms to train tree-based LTR models that are fast to evaluate. We also study the efficiency- effectiveness tradeoffs of these techniques, and empirically evaluate our end-to-end architecture on microblog document collections. We show that our techniques improve efficiency without degrading quality

    The Feasibility of Brute Force Scans for Real-Time Tweet Search

    Full text link
    The real-time search problem requires making ingested doc-uments immediately searchable, which presents architectural challenges for systems built around inverted indexing. In this paper, we explore a radical proposition: What if we abandon document inversion and instead adopt an architec-ture based on brute force scans of document representations? In such a design, “indexing ” simply involves appending the parsed representation of an ingested document to an exist-ing buffer, which is simple and fast. Quite surprisingly, ex-periments with TREC Microblog test collections show that query evaluation with brute force scans is feasible and per-formance compares favorably to a traditional search archi-tecture based on an inverted index, especially if we take ad-vantage of vectorized SIMD instructions and multiple cores in modern processor architectures. We believe that such a novel design is worth further exploration by IR researchers and practitioners

    Out of the box phrase indexing

    Get PDF
    Abstract. We present a method for optimizing inverted index based search engines with respect to phrase querying performance. Our approach adds carefully selected two-term phrases to an existing index. While competitive previous work is mainly based on the analysis of query logs, our approach comes out of the box and uses just the information already contained in the index. Even so, our method can compete with previous work in terms of querying performance and actually, it can get ahead of those for difficult queries. Moreover, our selection process gives performance guarantees for arbitrary queries. In a further step, we propose to use a phrase index as a substitute for the positional index of an in-memory search engine containing just short documents. We confirm all of our considerations by experiments on a high-performance mainmemory search engine. However, we believe that our approach can be applied to classical disk based systems as well

    Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing

    Full text link
    Abstract. Document-centric static index pruning methods provide smaller indexes and faster query times by dropping some within-document term information from inverted lists. We present a method of pruning in-verted lists derived from the formulation of unigram language models for retrieval. Our method is based on the statistical significance of term frequency ratios: using the two-sample two-proportion (2P2N) test, we statistically compare the frequency of occurrence of a word within a given document to the frequency of its occurrence in the collection to de-cide whether to prune it. Experimental results show that this technique can be used to significantly decrease the size of the index and querying speed with less compromise to retrieval effectiveness than similar heuris-tic methods. Furthermore, we give a formal statistical justification for such methods.