2,591 research outputs found

    Real-time Text Queries with Tunable Term Pair Indexes

    No full text
    Term proximity scoring is an established means in information retrieval for improving result quality of full-text queries. Integrating such proximity scores into efficient query processing, however, has not been equally well studied. Existing methods make use of precomputed lists of documents where tuples of terms, usually pairs, occur together, usually incurring a huge index size compared to term-only indexes. This paper introduces a joint framework for trading off index size and result quality, and provides optimization techniques for tuning precomputed indexes towards either maximal result quality or maximal query processing performance, given an upper bound for the index size. The framework allows to selectively materialize lists for pairs based on a query log to further reduce index size. Extensive experiments with two large text collections demonstrate runtime improvements of several orders of magnitude over existing text-based processing techniques with reasonable index sizes

    Efficient Indexing for Structured and Unstructured Data

    Get PDF
    The collection of digital data is growing at an exponential rate. Data originates from wide range of data sources such as text feeds, biological sequencers, internet traffic over routers, through sensors and many other sources. To mine intelligent information from these sources, users have to query the data. Indexing techniques aim to reduce the query time by preprocessing the data. Diversity of data sources in real world makes it imperative to develop application specific indexing solutions based on the data to be queried. Data can be structured i.e., relational tables or unstructured i.e., free text. Moreover, increasingly many applications need to seamlessly analyze both kinds of data making data integration a central issue. Integrating text with structured data needs to account for missing values, errors in the data etc. Probabilistic models have been proposed recently for this purpose. These models are also useful for applications where uncertainty is inherent in data e.g. sensor networks. This dissertation aims to propose efficient indexing solutions for several problems that lie at the intersection of database and information retrieval such as joining ranked inputs, full-text documents searching etc. Other well-known problems of ranked retrieval and pattern matching are also studied under probabilistic settings. For each problem, the worst-case theoretical bounds of the proposed solutions are established and/or their practicality is demonstrated by thorough experimentation

    A DFT-Based Running Time Prediction Algorithm for Web Queries

    Get PDF
    Web search engines are built from components capable of processing large amounts of user queries per second in a distributed way. Among them, the index service computes the topk documents that best match each incoming query by means of a document ranking operation. To achieve high performance, dynamic pruning techniques such as the WAND and BM-WAND algorithms are used to avoid fully processing all of the documents related to a query during the ranking operation. Additionally, the index service distributes the ranking operations among clusters of processors wherein in each processor multi-threading is applied to speed up query solution. In this scenario, a query running time prediction algorithm has practical applications in the efficient assignment of processors and threads to incoming queries. We propose a prediction algorithm for the WAND and BM-WAND algorithms. We experimentally show that our proposal is able to achieve accurate prediction results while significantly reducing execution time and memory consumption as compared against an alternative prediction algorithm. Our proposal applies the discrete Fourier transform (DFT) to represent key features affecting query running time whereas the resulting vectors are used to train a feed-forward neural network with back-propagation.Fil: Rojas, Oscar. Universidad de Santiago de Chile; ChileFil: Gil Costa, Graciela Verónica. Universidad Nacional de San Luis. Facultad de Ciencias Físico- Matemáticas y Naturales; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaFil: Marín, Mauricio. Universidad de Chile; Chil

    Diversification Based Static Index Pruning - Application to Temporal Collections

    Full text link
    Nowadays, web archives preserve the history of large portions of the web. As medias are shifting from printed to digital editions, accessing these huge information sources is drawing increasingly more attention from national and international institutions, as well as from the research community. These collections are intrinsically big, leading to index files that do not fit into the memory and an increase query response time. Decreasing the index size is a direct way to decrease this query response time. Static index pruning methods reduce the size of indexes by removing a part of the postings. In the context of web archives, it is necessary to remove postings while preserving the temporal diversity of the archive. None of the existing pruning approaches take (temporal) diversification into account. In this paper, we propose a diversification-based static index pruning method. It differs from the existing pruning approaches by integrating diversification within the pruning context. We aim at pruning the index while preserving retrieval effectiveness and diversity by pruning while maximizing a given IR evaluation metric like DCG. We show how to apply this approach in the context of web archives. Finally, we show on two collections that search effectiveness in temporal collections after pruning can be improved using our approach rather than diversity oblivious approaches

    COMET: A Recipe for Learning and Using Large Ensembles on Massive Data

    Full text link
    COMET is a single-pass MapReduce algorithm for learning on large-scale data. It builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. This approach is appropriate when learning from massive-scale data that is too large to fit on a single machine. To get the best accuracy, IVoting should be used instead of bagging to generate the training subset for each decision tree in the random forest. Experiments with two large datasets (5GB and 50GB compressed) show that COMET compares favorably (in both accuracy and training time) to learning on a subsample of data using a serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble evaluation which dynamically decides how many ensemble members to evaluate per data point; this can reduce evaluation cost by 100X or more

    Efficient query processing for scalable web search

    Get PDF
    Search engines are exceptionally important tools for accessing information in today’s world. In satisfying the information needs of millions of users, the effectiveness (the quality of the search results) and the efficiency (the speed at which the results are returned to the users) of a search engine are two goals that form a natural trade-off, as techniques that improve the effectiveness of the search engine can also make it less efficient. Meanwhile, search engines continue to rapidly evolve, with larger indexes, more complex retrieval strategies and growing query volumes. Hence, there is a need for the development of efficient query processing infrastructures that make appropriate sacrifices in effectiveness in order to make gains in efficiency. This survey comprehensively reviews the foundations of search engines, from index layouts to basic term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing strategies, while also providing the latest trends in the literature in efficient query processing, including the coherent and systematic reviews of techniques such as dynamic pruning and impact-sorted posting lists as well as their variants and optimisations. Our explanations of query processing strategies, for instance the WAND and BMW dynamic pruning algorithms, are presented with illustrative figures showing how the processing state changes as the algorithms progress. Moreover, acknowledging the recent trends in applying a cascading infrastructure within search systems, this survey describes techniques for efficiently integrating effective learned models, such as those obtained from learning-to-rank techniques. The survey also covers the selective application of query processing techniques, often achieved by predicting the response times of the search engine (known as query efficiency prediction), and making per-query tradeoffs between efficiency and effectiveness to ensure that the required retrieval speed targets can be met. Finally, the survey concludes with a summary of open directions in efficient search infrastructures, namely the use of signatures, real-time, energy-efficient and modern hardware and software architectures
    • …
    corecore