232 research outputs found

    Opportunistic linked data querying through approximate membership metadata

    Get PDF
    Between URI dereferencing and the SPARQL protocol lies a largely unexplored axis of possible interfaces to Linked Data, each with its own combination of trade-offs. One of these interfaces is Triple Pattern Fragments, which allows clients to execute SPARQL queries against low-cost servers, at the cost of higher bandwidth. Increasing a client's efficiency means lowering the number of requests, which can among others be achieved through additional metadata in responses. We noted that typical SPARQL query evaluations against Triple Pattern Fragments require a significant portion of membership subqueries, which check the presence of a specific triple, rather than a variable pattern. This paper studies the impact of providing approximate membership functions, i.e., Bloom filters and Golomb-coded sets, as extra metadata. In addition to reducing HTTP requests, such functions allow to achieve full result recall earlier when temporarily allowing lower precision. Half of the tested queries from a WatDiv benchmark test set could be executed with up to a third fewer HTTP requests with only marginally higher server cost. Query times, however, did not improve, likely due to slower metadata generation and transfer. This indicates that approximate membership functions can partly improve the client-side query process with minimal impact on the server and its interface

    Approximate Data Analytics Systems

    Get PDF
    Today, most modern online services make use of big data analytics systems to extract useful information from the raw digital data. The data normally arrives as a continuous data stream at a high speed and in huge volumes. The cost of handling this massive data can be significant. Providing interactive latency in processing the data is often impractical due to the fact that the data is growing exponentially and even faster than Moore’s law predictions. To overcome this problem, approximate computing has recently emerged as a promising solution. Approximate computing is based on the observation that many modern applications are amenable to an approximate, rather than the exact output. Unlike traditional computing, approximate computing tolerates lower accuracy to achieve lower latency by computing over a partial subset instead of the entire input data. Unfortunately, the advancements in approximate computing are primarily geared towards batch analytics and cannot provide low-latency guarantees in the context of stream processing, where new data continuously arrives as an unbounded stream. In this thesis, we design and implement approximate computing techniques for processing and interacting with high-speed and large-scale stream data to achieve low latency and efficient utilization of resources. To achieve these goals, we have designed and built the following approximate data analytics systems: • StreamApprox—a data stream analytics system for approximate computing. This system supports approximate computing for low-latency stream analytics in a transparent way and has an ability to adapt to rapid fluctuations of input data streams. In this system, we designed an online adaptive stratified reservoir sampling algorithm to produce approximate output with bounded error. • IncApprox—a data analytics system for incremental approximate computing. This system adopts approximate and incremental computing in stream processing to achieve high-throughput and low-latency with efficient resource utilization. In this system, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. • PrivApprox—a data stream analytics system for privacy-preserving and approximate computing. This system supports high utility and low-latency data analytics and preserves user’s privacy at the same time. The system is based on the combination of privacy-preserving data analytics and approximate computing. • ApproxJoin—an approximate distributed joins system. This system improves the performance of joins — critical but expensive operations in big data systems. In this system, we employed a sketching technique (Bloom filter) to avoid shuffling non-joinable data items through the network as well as proposed a novel sampling mechanism that executes during the join to obtain an unbiased representative sample of the join output. Our evaluation based on micro-benchmarks and real world case studies shows that these systems can achieve significant performance speedup compared to state-of-the-art systems by tolerating negligible accuracy loss of the analytics output. In addition, our systems allow users to systematically make a trade-off between accuracy and throughput/latency and require no/minor modifications to the existing applications

    Efficient processing of similarity queries with applications

    Get PDF
    Today, a myriad of data sources, from the Internet to business operations to scientific instruments, produce large and different types of data. Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, call for identifying and processing similarities in big data. As a result, it is imperative to develop new similarity query processing approaches and systems that scale from low dimensional data to high dimensional data, from single machine to clusters of hundreds of machines, and from disk-based to memory-based processing. This dissertation introduces and studies several similarity-aware query operators, analyzes and optimizes their performance. The first contribution of this dissertation is an SQL-based Similarity Group-by operator (SGB, for short) that extends the semantics of the standard SQL Group-by operator to group data with similar but not necessarily equal values. We realize these SGB operators by extending the Standard SQL Group-by and introduce two new SGB operators for multi-dimensional data. We implement and test the new SGB operators and their algorithms inside an open-source centralized database server (PostgreSQL). In the second contribution of this dissertation, we study how to efficiently process Hamming-distance-based similarity queries (Hamming-distance select and Hamming-distance join) that are crucial to many applications. We introduce a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries (namely, the selects and joins). In the third and last contribution of this dissertation, we develop a system for similarity query processing and optimization in an in-memory and distributed setup for big spatial data. We propose a query scheduler and a distributed query optimizer that use a new cost model to optimize the cost of similarity query processing in this in-memory distributed setup. The scheduler and query optimizer generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bloom filters to forward queries to the appropriate local sites. The proposed query processing and optimization techniques are prototyped inside Spark, a distributed main-memory computation system

    맵리듀스 클러스터에서 필터링 기법을 사용한 조인 처리

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 2. 김형주.The join operation is one of the essential operations for data analysis because it is necessary to join large datasets to analyze heterogeneous data collected from different sources. MapReduce is a very useful framework for large-scale data analysis, but it is not suitable for joining multiple datasets. This is because it may produce a large number of redundant intermediate results, irrespective of the size of the joined records. Several existing approaches have been employed to improve the join performance, but they can only be used in specific circumstances or they may require multiple MapReduce jobs. To alleviate this problem, MFR-Join is proposed in this dissertation, which is a general join framework for processing equi-joins with filtering techniques in MapReduce. MFR-Join filters out redundant intermediate records within a single MapReduce job by applying filters in the map phase. To achieve this, the MapReduce framework is modified in two ways. First, map tasks are scheduled according to the processing order of the input datasets. Second, filters are created dynamically with the join keys of the datasets in a distributed manner. Various filtering techniques that support specific desirable operations can be plugged into MFR-Join. If the performance of join processing with filters is worse than that without filters, adaptive join processing methods are also proposed. The filters can be applied according to their performance, which is estimated in terms of the false positive rate. Furthermore, two map task scheduling policies are also provided: synchronous and asynchronous scheduling. The concept of filtering techniques is extended to multi-way joins. Methods for filter applications are proposed for the two types of multi-way joins: common attribute joins and distinct attribute joins. The experimental results showed that the proposed approach outperformed existing join algorithms and reduced the size of intermediate results when small portions of input datasets were joined.Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction 1 1.1 Research Background and Motivation . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Join Processing with Filtering Techniques in MapReduce . . . . . . 4 1.2.2 Adaptive Join Processing with Filtering Techniques in MFR-Join . 5 1.2.3 Multi-way Join Processing in MFR-Join . . . . . . . . . . . . . . . 6 1.3 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Preliminaries and Related Work 9 2.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Parallel and Distributed Join Algorithms in DBMS . . . . . . . . . . . . . 11 2.3 Join Algorithms in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Map-side joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Reduce-side joins . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Multi-way Joins in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Filtering Techniques for Join Processing . . . . . . . . . . . . . . . . . . . 19 3 MFR-Join: A General Join Framework with Filtering Techniques in MapReduce 23 3.1 MFR-Join Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.1 Execution Overview . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.2 Map Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.3 Filter Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.4 Filtering Techniques Applicable to MFR-Join . . . . . . . . . . . . 29 3.1.5 API and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.1 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.2 Effects of the Filters . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 Adaptive Join Processing with Filtering Techniques in MFR-Join 53 4.1 Adaptive join processing in MFR-Join . . . . . . . . . . . . . . . . . . . . 54 4.1.1 Execution Overview . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1.2 Additional Filter Operations for Adaptive Joins . . . . . . . . . . . 57 4.1.3 Early Detection of FPR Threshold Being Exceeded . . . . . . . . . 58 4.1.4 Map Task Scheduling Policies . . . . . . . . . . . . . . . . . . . . 59 4.1.5 Additional Parameters for Adaptive Joins . . . . . . . . . . . . . . 60 4.2 Join Cost and FPR Threshold Analysis . . . . . . . . . . . . . . . . . . . . 61 4.2.1 Cost of Adaptive Join . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.2 Effects of FPR Threshold . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.3 Effects of Map Task Scheduling Policy . . . . . . . . . . . . . . . 63 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 65 5 Multi-way Join Processing in MFR-Join 77 5.1 Applying filters to multi-way joins . . . . . . . . . . . . . . . . . . . . . . 78 5.1.1 Common Attribute Joins . . . . . . . . . . . . . . . . . . . . . . . 79 5.1.2 Distinct Attribute Joins . . . . . . . . . . . . . . . . . . . . . . . . 80 5.1.3 General Multi-way Joins . . . . . . . . . . . . . . . . . . . . . . . 83 5.1.4 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2.1 Partition Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2.2 MapReduce Functions . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3.1 Common Attribute Joins . . . . . . . . . . . . . . . . . . . . . . . 90 5.3.2 Distinct attribute joins . . . . . . . . . . . . . . . . . . . . . . . . 91 6 Conclusions and Future Work 99 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.2.1 Integration with Data Warehouse Systems . . . . . . . . . . . . . . 100 6.2.2 Join-based Applications . . . . . . . . . . . . . . . . . . . . . . . 101 6.2.3 Improving Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 102 References 105 Summary (in Korean) 113Docto

    The End of Slow Networks: It's Time for a Redesign

    Full text link
    Next generation high-performance RDMA-capable networks will require a fundamental rethinking of the design and architecture of modern distributed DBMSs. These systems are commonly designed and optimized under the assumption that the network is the bottleneck: the network is slow and "thin", and thus needs to be avoided as much as possible. Yet this assumption no longer holds true. With InfiniBand FDR 4x, the bandwidth available to transfer data across network is in the same ballpark as the bandwidth of one memory channel, and it increases even further with the most recent EDR standard. Moreover, with the increasing advances of RDMA, the latency improves similarly fast. In this paper, we first argue that the "old" distributed database design is not capable of taking full advantage of the network. Second, we propose architectural redesigns for OLTP, OLAP and advanced analytical frameworks to take better advantage of the improved bandwidth, latency and RDMA capabilities. Finally, for each of the workload categories, we show that remarkable performance improvements can be achieved
    corecore