14 research outputs found

    Efficient query processing for scalable web search

    Get PDF
    Search engines are exceptionally important tools for accessing information in today’s world. In satisfying the information needs of millions of users, the effectiveness (the quality of the search results) and the efficiency (the speed at which the results are returned to the users) of a search engine are two goals that form a natural trade-off, as techniques that improve the effectiveness of the search engine can also make it less efficient. Meanwhile, search engines continue to rapidly evolve, with larger indexes, more complex retrieval strategies and growing query volumes. Hence, there is a need for the development of efficient query processing infrastructures that make appropriate sacrifices in effectiveness in order to make gains in efficiency. This survey comprehensively reviews the foundations of search engines, from index layouts to basic term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing strategies, while also providing the latest trends in the literature in efficient query processing, including the coherent and systematic reviews of techniques such as dynamic pruning and impact-sorted posting lists as well as their variants and optimisations. Our explanations of query processing strategies, for instance the WAND and BMW dynamic pruning algorithms, are presented with illustrative figures showing how the processing state changes as the algorithms progress. Moreover, acknowledging the recent trends in applying a cascading infrastructure within search systems, this survey describes techniques for efficiently integrating effective learned models, such as those obtained from learning-to-rank techniques. The survey also covers the selective application of query processing techniques, often achieved by predicting the response times of the search engine (known as query efficiency prediction), and making per-query tradeoffs between efficiency and effectiveness to ensure that the required retrieval speed targets can be met. Finally, the survey concludes with a summary of open directions in efficient search infrastructures, namely the use of signatures, real-time, energy-efficient and modern hardware and software architectures

    Algorithms for Constructing Exact Nearest Neighbor Graphs

    Get PDF
    University of Minnesota Ph.D. dissertation.June 2016. Major: Computer Science. Advisor: George Karypis. 1 computer file (PDF); xi, 151 pages.Nearest neighbor graphs (NNGs) contain the set of closest neighbors, and their similarities, for each of the objects in a set of objects. They are widely used in many real-world applications, such as clustering, online advertising, recommender systems, data cleaning, and query refinement. A brute-force method for constructing the graph requires O(n^2) similarity comparisons for a set of n objects. One way to reduce the number of comparisons is to ignore object pairs with low similarity, which are unimportant in many domains. Current methods for construction of the graph tackle the problem by either pruning the similarity search space, avoiding comparisons of objects that can be determined to not meet the similarity bounding conditions, or they solve the problem approximately, which can miss some of the neighbors. This thesis addresses the problem of efficiently constructing the exact nearest neighbor graph for a large set of objects, i.e., the graph that would be found by comparing each object against all other objects in the set. In this context, we address two specific problems. The epsilon-nearest neighbor graph (epsilon-NNG) construction problem, also known as all-pairs similarity search (APSS), seeks to find, for each object, all other objects with a similarity of at least some threshold epsilon. On the other hand, the k-nearest neighbor graph (k-NNG) construction problem seeks to find the k closest other objects to each object in the set. For both problems, we propose filtering techniques that are more effective than previous ones, and efficient serial and parallel algorithms to construct the graph. Our methods are ideally suited for sparse high dimensional data

    IMPROVING BWA-MEM WITH GPU PARALLEL COMPUTING

    Get PDF
    Due to the many advances made in designing algorithms, especially the ones used in bioinformatics, it is becoming harder and harder to improve their efficiencies. Therefore, hardware acceleration using General-Purpose computing on Graphics Processing Unit has become a popular choice. BWA-MEM is an important part of the BWA software package for sequence mapping. Because of its high speed and accuracy, we choose to parallelize the popular short DNA sequence mapper. BWA has been a prevalent single node tool in genome alignment, and it has been widely studied for acceleration for a long time since the first version of the BWA package came out. This thesis presents the Big Data GPGPU distributed BWA-MEM, a tool that combines GPGPU acceleration and distributed computing. The four hardware parallelization techniques used are CPU multi-threading, GPU paralleled, CPU distributed, and GPU distributed. The GPGPU distributed software typically outperforms other parallelization versions. The alignment is performed on a distributed network, and each node in the network executes a separate GPGPU paralleled version of the software. We parallelize the chain2aln function in three levels. In Level 1, the function ksw\_extend2, an algorithm based on Smith-Waterman, is parallelized to handle extension on one side of the seed. In Level 2, the function chain2aln is parallelized to handle chain extension, where all seeds within the same chain are extended. In Level 3, part of the function mem\_align1\_core is parallelized for extending multiple chains. Due to the program's complexity, the parallelization work was limited at the GPU version of ksw\_extend2 parallelization Level 3. However, we have successfully combined Spark with BWA-MEM and ksw\_extend2 at parallelization Level 1, which has shown that the proposed framework is possible. The paralleled Level 3 GPU version of ksw\_extend2 demonstrated noticeable speed improvement with the test data set

    Diverse Intrusion-tolerant Systems

    Get PDF
    Over the past 20 years, there have been indisputable advances on the development of Byzantine Fault-Tolerant (BFT) replicated systems. These systems keep operational safety as long as at most f out of n replicas fail simultaneously. Therefore, in order to maintain correctness it is assumed that replicas do not suffer from common mode failures, or in other words that replicas fail independently. In an adversarial setting, this requires that replicas do not include similar vulnerabilities, or otherwise a single exploit could be employed to compromise a significant part of the system. The thesis investigates how this assumption can be substantiated in practice by exploring diversity when managing the configurations of replicas. The thesis begins with an analysis of a large dataset of vulnerability information to get evidence that diversity can contribute to failure independence. In particular, we used the data from a vulnerability database to devise strategies for building groups of n replicas with different Operating Systems (OS). Our results demonstrate that it is possible to create dependable configurations of OSes, which do not share vulnerabilities over reasonable periods of time (i.e., a few years). Then, the thesis proposes a new design for a firewall-like service that protects and regulates the access to critical systems, and that could benefit from our diversity management approach. The solution provides fault and intrusion tolerance by implementing an architecture based on two filtering layers, enabling efficient removal of invalid messages at early stages in order to decrease the costs associated with BFT replication in the later stages. The thesis also presents a novel solution for managing diverse replicas. It collects and processes data from several data sources to continuously compute a risk metric. Once the risk increases, the solution replaces a potentially vulnerable replica by another one, trying to maximize the failure independence of the replicated service. Then, the replaced replica is put on quarantine and updated with the available patches, to be prepared for later re-use. We devised various experiments that show the dependability gains and performance impact of our prototype, including key benchmarks and three BFT applications (a key-value store, our firewall-like service, and a blockchain).Unidade de investigação LASIGE (UID/CEC/00408/2019) e o projeto PTDC/EEI-SCR/1741/2041 (Abyss

    Pretrained Transformers for Text Ranking: BERT and Beyond

    Get PDF
    The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i.e., result quality) and efficiency (e.g., query latency, model and index size). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading

    Correction to: RNA Bioinformatics.

    Get PDF
    n/

    XXI Workshop de Investigadores en Ciencias de la Computación - WICC 2019: libro de actas

    Get PDF
    Trabajos presentados en el XXI Workshop de Investigadores en Ciencias de la Computación (WICC), celebrado en la provincia de San Juan los días 25 y 26 de abril 2019, organizado por la Red de Universidades con Carreras en Informática (RedUNCI) y la Facultad de Ciencias Exactas, Físicas y Naturales de la Universidad Nacional de San Juan.Red de Universidades con Carreras en Informátic

    XX Workshop de Investigadores en Ciencias de la Computación - WICC 2018 : Libro de actas

    Get PDF
    Actas del XX Workshop de Investigadores en Ciencias de la Computación (WICC 2018), realizado en Facultad de Ciencias Exactas y Naturales y Agrimensura de la Universidad Nacional del Nordeste, los dìas 26 y 27 de abril de 2018.Red de Universidades con Carreras en Informática (RedUNCI
    corecore