2 research outputs found

    Overview JHU/APL at TREC 2004: Robust and Terabyte Tracks

    No full text
    For initial ranked retrieval, we continue to use a statistical language model to compute query/document similarity values. Hiemstra and de Vries [3] describe such a linguistically motivated probabilistic model and explain how it relates to both the Boolean and vector space models. The model has also been cast as a rudimentary Hidden Markov Model [4]. Although the model does not explicitly incorporate inverse document frequency, it does favor documents that contain more of the rare query terms. The similarity measure can be computed as t ∈q Sim(q,d) = α ⋅ f (t,d) + (1−α) ⋅ f (t,C) Equation 1. Similarity calculation. where α is the probability that a query word is generated by a document-specific model, and (1- α) is the probability that it is generated by a generic language model. f(t,C) denotes the mean relative document frequency of term t. We have observed that aggregate performance using this model is fairly insensitive to the precise value of α that is used; however, higher values of alpha tend to result in selecting documents that contain a greater number of the query terms. Robust Trac

    Search engine optimisation using past queries

    Get PDF
    World Wide Web search engines process millions of queries per day from users all over the world. Efficient query evaluation is achieved through the use of an inverted index, where, for each word in the collection the index maintains a list of the documents in which the word occurs. Query processing may also require access to document specific statistics, such as document length; access to word statistics, such as the number of unique documents in which a word occurs; and collection specific statistics, such as the number of documents in the collection. The index maintains individual data structures for each these sources of information, and repeatedly accesses each to process a query. A by-product of a web search engine is a list of all queries entered into the engine: a query log. Analyses of query logs have shown repetition of query terms in the requests made to the search system. In this work we explore techniques that take advantage of the repetition of user queries to improve the accuracy or efficiency of text search. We introduce an index organisation scheme that favours those documents that are most frequently requested by users and show that, in combination with early termination heuristics, query processing time can be dramatically reduced without reducing the accuracy of the search results. We examine the stability of such an ordering and show that an index based on as little as 100,000 training queries can support at least 20 million requests. We show the correlation between frequently accessed documents and relevance, and attempt to exploit the demonstrated relationship to improve search effectiveness. Finally, we deconstruct the search process to show that query time redundancy can be exploited at various levels of the search process. We develop a model that illustrates the improvements that can be achieved in query processing time by caching different components of a search system. This model is then validated by simulation using a document collection and query log. Results on our test data show that a well-designed cache can reduce disk activity by more than 30%, with a cache that is one tenth the size of the collection
    corecore