182 research outputs found

    Comprehensive and Practical Policy Compliance in Data Retrieval Systems

    Get PDF
    Data retrieval systems such as online search engines and online social networks process many data items coming from different sources, each subject to its own data use policy. Ensuring compliance with these policies in a large and fast-evolving system presents a significant technical challenge since bugs, misconfigurations, or operator errors can cause (accidental) policy violations. To prevent such violations, researchers and practitioners develop policy compliance systems. Existing policy compliance systems, however, are either not comprehensive or not practical. To be comprehensive, a compliance system must be able to enforce users' policies regarding their personal privacy preferences, the service provider's own policies regarding data use such as auditing and personalization, and regulatory policies such as data retention and censorship. To be practical, a compliance system needs to meet stringent requirements: (1) runtime overhead must be low; (2) existing applications must run with few modifications; and (3) bugs, misconfigurations, or actions by unprivileged operators must not cause policy violations. In this thesis, we present the design and implementation of two comprehensive and practical compliance systems: Thoth and Shai. Thoth relies on pure runtime monitoring: it tracks data flows by intercepting processes' I/O, and then it checks the associated policies to allow only policy-compliant flows at runtime. Shai, on the other hand, combines offline analysis and light-weight runtime monitoring: it pushes as many policy checks as possible to an offline (flow) analysis by predicting the policies that data-handling processes will be subject to at runtime, and then it compiles those policies into a set of fine-grained I/O capabilities that can be enforced directly by the underlying operating system

    High-Performance Computing Algorithms for Constructing Inverted Files on Emerging Multicore Processors

    Get PDF
    Current trends in processor architectures increasingly include more cores on a single chip and more complex memory hierarchies, and such a trend is likely to continue in the foreseeable future. These processors offer unprecedented opportunities for speeding up demanding computations if the available resources can be effectively utilized. Simultaneously, parallel programming languages such as OpenMP and MPI have been commonly used on clusters of multicore CPUs while newer programming languages such as OpenCL and CUDA have been widely adopted on recent heterogeneous systems and GPUs respectively. The main goal of this dissertation is to develop techniques and methodologies for exploiting these emerging parallel architectures and parallel programming languages to solve large scale irregular applications such as the construction of inverted files. The extraction of inverted files from large collections of documents forms a critical component of all information retrieval systems including web search engines. In this problem, the disk I/O throughput is the major performance bottleneck especially when intermediate results are written onto disks. In addition to the I/O bottleneck, a number of synchronization and consistency issues must be resolved in order to build the dictionary and postings lists efficiently. To address these issues, we introduce a dictionary data structure using a hybrid of trie and B-trees and a high-throughput pipeline strategy that completely avoids the use of disks as temporary storage for intermediate results, while ensuring the consumption of the input data at a high rate. The high-throughput pipelined strategy produces parallel parsed streams that are consumed at the same rate by parallel indexers. The pipelined strategy is implemented on a single multicore CPU as well as on a cluster of such nodes. We were able to achieve a throughput of more than 262MB/s on the ClueWeb09 dataset on a single node. On a cluster of 32 nodes, our experimental results show scalable performance using different metrics, significantly improving on prior published results. On the other hand, we develop a new approach for handling time-evolving documents using additional small temporal indexing structures. The lifetime of the collection is partitioned into multiple time windows, which guarantees a very fast temporal query response time at a small space overhead relative to the non-temporal case. Extensive experimental results indicate that the overhead in both indexing and querying is small in this more complicated case, and the query performance can indeed be improved using finer temporal partitioning of the collection. Finally, we employ GPUs to accelerate the indexing process for building inverted files and to develop a very fast algorithm for the highly irregular list ranking problem. For the indexing problem, the workload is split between CPUs and GPUs in such a way that the strengths of both architectures are exploited. For the list ranking problem involved in the decompression of inverted files, an optimized GPU algorithm is introduced by reducing the problem to a large number of fine grain computations in such a way that the processing cost per element is shown to be close to the best possible

    Question Type Recognition Using Natural Language Input

    Get PDF
    Recently, numerous specialists are concentrating on the utilization of Natural Language Processing (NLP) systems in various domains, for example, data extraction and content mining. One of the difficulties with these innovations is building up a precise Question and Answering (QA) System. Question type recognition is the most significant task in a QA system, for example, chat bots. Organization such as National Institute of Standards (NIST) hosts a conference series called as Text REtrieval Conference (TREC) series which keeps a competition every year to encourage and improve the technique of information retrieval from a large corpus of text. When a user asks a question, he/she expects a correct form of answer in reply. The undertaking of classifying a question type is to anticipate the sort of a question which is composed in common dialect. The question is then classified to one of the predefined question types. The objective of this project is to build a question type recognition system using big data and machine learning techniques. The system will comprise of a supervised learning model that will receive a question in a natural language input and it can recognize and classify a given question based upon its question type. Extracting important textual features and building a model using those features is the most important task of this project. The training and testing data has been obtained from the TREC website. Training data comprises of a corpus of unique questions and the labels associated with it. The model is tested and evaluated using the testing data. This project also achieves the goal of making a scalable system using big data technologies

    Question Type Recognition Using Natural Language Input

    Get PDF
    Recently, numerous specialists are concentrating on the utilization of Natural Language Processing (NLP) systems in various domains, for example, data extraction and content mining. One of the difficulties with these innovations is building up a precise Question and Answering (QA) System. Question type recognition is the most significant task in a QA system, for example, chat bots. Organization such as National Institute of Standards (NIST) hosts a conference series called as Text REtrieval Conference (TREC) series which keeps a competition every year to encourage and improve the technique of information retrieval from a large corpus of text. When a user asks a question, he/she expects a correct form of answer in reply. The undertaking of classifying a question type is to anticipate the sort of a question which is composed in common dialect. The question is then classified to one of the predefined question types. The objective of this project is to build a question type recognition system using big data and machine learning techniques. The system will comprise of a supervised learning model that will receive a question in a natural language input and it can recognize and classify a given question based upon its question type. Extracting important textual features and building a model using those features is the most important task of this project. The training and testing data has been obtained from the TREC website. Training data comprises of a corpus of unique questions and the labels associated with it. The model is tested and evaluated using the testing data. This project also achieves the goal of making a scalable system using big data technologies

    Code Generation for an Application-Specific VLIW Processor With Clustered, Addressable Register Files

    Get PDF
    International audienceModern compilers integrate recent advances in compiler construction, intermediate representations, algorithms and programming language front-ends. Yet code generation for appli\-cation-specific architectures benefits only marginally from this trend, as most of the effort is oriented towards popular general-purpose architectures. Historically, non-orthogonal architectures have relied on custom compiler technologies, some retargettable, but largely decoupled from the evolution of mainstream tool flows. Very Long Instruction Word (VLIW) architectures have introduced a variety of interesting problems such as clusterization, packetization or bundling, instruction scheduling for exposed pipelines, long delay slots, software pipelining, etc. These have been addressed in the literature, with a focus on the exploitation of Instruction Level Parallelism (ILP). While these are well known solutions already embedded into existing compilers, they rely on common hardware functionalities that are expected to be present in a fairly large subset of VLIW architectures. This paper presents our work on back-end compiler for Mephisto, a high performance low-power application-specific processor, based on LLVM. Mephisto is specialized enough to challenge established code generation solutions for VLIW and DSP processors, calling for an innovative compilation flow. Conversely, even though Mephisto might be seen a somewhat exotic processor, its hardware characteristics such as addressable register files benefit from existing analyses and transformations in LLVM. We describe our model of the Mephisto architecture, the difficulties we encountered, and the associated compilation methods, some of them new and specific to Mephisto

    EHI: End-to-end Learning of Hierarchical Index for Efficient Dense Retrieval

    Full text link
    Dense embedding-based retrieval is now the industry standard for semantic search and ranking problems, like obtaining relevant web documents for a given query. Such techniques use a two-stage process: (a) contrastive learning to train a dual encoder to embed both the query and documents and (b) approximate nearest neighbor search (ANNS) for finding similar documents for a given query. These two stages are disjoint; the learned embeddings might be ill-suited for the ANNS method and vice-versa, leading to suboptimal performance. In this work, we propose End-to-end Hierarchical Indexing -- EHI -- that jointly learns both the embeddings and the ANNS structure to optimize retrieval performance. EHI uses a standard dual encoder model for embedding queries and documents while learning an inverted file index (IVF) style tree structure for efficient ANNS. To ensure stable and efficient learning of discrete tree-based ANNS structure, EHI introduces the notion of dense path embedding that captures the position of a query/document in the tree. We demonstrate the effectiveness of EHI on several benchmarks, including de-facto industry standard MS MARCO (Dev set and TREC DL19) datasets. For example, with the same compute budget, EHI outperforms state-of-the-art (SOTA) in by 0.6% (MRR@10) on MS MARCO dev set and by 4.2% (nDCG@10) on TREC DL19 benchmarks

    Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data

    Get PDF
    Thesis (Ph.D.) - Indiana University, Computer Sciences, 2015As Big Data processing problems evolve, many modern applications demonstrate special characteristics. Data exists in the form of both large historical datasets and high-speed real-time streams, and many analysis pipelines require integrated parallel batch processing and stream processing. Despite the large size of the whole dataset, most analyses focus on specific subsets according to certain criteria. Correspondingly, integrated support for efficient queries and post- query analysis is required. To address the system-level requirements brought by such characteristics, this dissertation proposes a scalable architecture for integrated queries, batch analysis, and streaming analysis of Big Data in the cloud. We verify its effectiveness using a representative application domain - social media data analysis - and tackle related research challenges emerging from each module of the architecture by integrating and extending multiple state-of-the-art Big Data storage and processing systems. In the storage layer, we reveal that existing text indexing techniques do not work well for the unique queries of social data, which put constraints on both textual content and social context. To address this issue, we propose a flexible indexing framework over NoSQL databases to support fully customizable index structures, which can embed necessary social context information for efficient queries. The batch analysis module demonstrates that analysis workflows consist of multiple algorithms with different computation and communication patterns, which are suitable for different processing frameworks. To achieve efficient workflows, we build an integrated analysis stack based on YARN, and make novel use of customized indices in developing sophisticated analysis algorithms. In the streaming analysis module, the high-dimensional data representation of social media streams poses special challenges to the problem of parallel stream clustering. Due to the sparsity of the high-dimensional data, traditional synchronization method becomes expensive and severely impacts the scalability of the algorithm. Therefore, we design a novel strategy that broadcasts the incremental changes rather than the whole centroids of the clusters to achieve scalable parallel stream clustering algorithms. Performance tests using real applications show that our solutions for parallel data loading/indexing, queries, analysis tasks, and stream clustering all significantly outperform implementations using current state-of-the-art technologies

    Sample Collection and DNA Extraction Methods for Environmental DNA Metabarcoding in Headwater Streams

    Get PDF
    DNA is found to be free and ubiquitous in the environment where it is no longer associated with the source organism, and is also known as environmental-DNA (eDNA). Methods optimized for specific environments may be able supplement insight to local taxa richness. With the advent of high throughput sequencing and the proliferation of sequence data in public repositories, insights to the biodiversity of communities at the molecular level have been possible. Thus, this study compared commonly used DNA capture (water precipitation and filtration) and extraction (MoBio\u27s PowerWater, Qiagen\u27s DNeasy Blood and Tissue Kit, and a CTAB protocol) methods in their ability to isolate eDNA for the purpose of metabarcoding a section of the ribosomal small subunit 18 S (18s) and the cytochrome oxidase I (COI) gene regions. The 18s sequence data is non-reportable due to lack of sequence quality, and MoBio\u27s PowerWater did not yield DNA suitable concentrations. CTAB and DNeasy extractions yielded successful PCR reactions and high-throughput sequencing (HTS). When combined with their respective replicates, CTAB and DNeasy were determined to have genus richness ( -diversity) of 25 and 24, respectively of benthic macroinvertebrates with 20 taxonomic determinations being shared between the two methods. After conducting Jaccard\u27s dissimilarity index and constructing ordination plots using non-metric multidimensional scaling (NMDS), this study was not able to reveal differences in the amount of taxa richness between CTAB and DNeasy, which implied extraction methods may not be a limiting factor in detected taxa richness

    Multi-Stage Search Architectures for Streaming Documents

    Get PDF
    The web is becoming more dynamic due to the increasing engagement and contribution of Internet users in the age of social media. A more dynamic web presents new challenges for web search--an important application of Information Retrieval (IR). A stream of new documents constantly flows into the web at a high rate, adding to the old content. In many cases, documents quickly lose their relevance. In these time-sensitive environments, finding relevant content in response to user queries requires a real-time search service; immediate availability of content for search and a fast ranking, which requires an optimized search architecture. These aspects of today's web are at odds with how academic IR researchers have traditionally viewed the web, as a collection of static documents. Moreover, search architectures have received little attention in the IR literature. Therefore, academic IR research, for the most part, does not provide a mechanism to efficiently handle a high-velocity stream of documents, nor does it facilitate real-time ranking. This dissertation addresses the aforementioned shortcomings. We present an efficient mech- anism to index a stream of documents, thereby enabling immediate availability of content. Our indexer works entirely in main memory and provides a mechanism to control inverted list con- tiguity, thereby enabling faster retrieval. Additionally, we consider document ranking with a machine-learned model, dubbed "Learning to Rank" (LTR), and introduce a novel multi-stage search architecture that enables fast retrieval and allows for more design flexibility. The stages of our architecture include candidate generation (top k retrieval), feature extraction, and docu- ment re-ranking. We compare this architecture with a traditional monolithic architecture where candidate generation and feature extraction occur together. As we lay out our architecture, we present optimizations to each stage to facilitate low-latency ranking. These optimizations include a fast approximate top k retrieval algorithm, document vectors for feature extraction, architecture- conscious implementations of tree ensembles for LTR using predication and vectorization, and algorithms to train tree-based LTR models that are fast to evaluate. We also study the efficiency- effectiveness tradeoffs of these techniques, and empirically evaluate our end-to-end architecture on microblog document collections. We show that our techniques improve efficiency without degrading quality
    corecore