Search CORE

119 research outputs found

Scalable and fault-tolerant data stream processing on multi-core architectures

Author: Theodorakis George Rafail
Publication venue: Computing, Imperial College London
Publication date: 01/06/2023
Field of study

With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state. While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures. Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces

Spiral - Imperial College Digital Repository

Annotation of logic programs for independent AND-Parallelism by partial evaluation

Author: Benkerimi
Gallagher
Gallagher
GERMAN VIDAL
Jones
Jones
Leuschel
Sahlin
Sperber
Surati
Vidal
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/07/2012
Field of study

Traditional approaches to automatic AND-parallelization of logic programs rely on some static analysis to identify independent goals that can be safely and efficiently run in parallel in any possible execution. In this paper, we present a novel technique for generating annotations for independent AND-parallelism that is based on partial evaluation. Basically, we augment a simple partial evaluation procedure with (run-time) groundness and variable sharing information so that parallel conjunctions are added to the residual clauses when the conditions for independence are met. In contrast to previous approaches, our partial evaluator is able to transform the source program in order to expose more opportunities for parallelism. To the best of our knowledge, we present the first approach to a parallelizing partial evaluator.This work has been partially supported by the Spanish Ministerio de Economia y Competitividad (Secretaria de Estado de Investigacion, Desarrollo e Innovacion) under grant TIN2008-06622-C03-02 and by the Generalitat Valenciana under grant PROMETEO/2011/052.Vidal Oriola, GF. (2012). Annotation of logic programs for independent AND-Parallelism by partial evaluation. Theory and Practice of Logic Programming. 12(4-5):583-600. https://doi.org/10.1017/S1471068412000191S583600124-5Gras, D. C., & Hermenegildo, M. V. (2009). Non-strict independence-based program parallelization using sharing and freeness information. Theoretical Computer Science, 410(46), 4704-4723. doi:10.1016/j.tcs.2009.07.044Muthukumar, K., Bueno, F., García de la Banda, M., & Hermenegildo, M. (1999). Automatic compile-time parallelization of logic programs for restricted, goal level, independent and parallelism. The Journal of Logic Programming, 38(2), 165-218. doi:10.1016/s0743-1066(98)10022-5Leuschel, M., Elphick, D., Varea, M., Craig, S.-J., & Fontaine, M. (2006). The Ecce and Logen partial evaluators and their web interfaces. Proceedings of the 2006 ACM SIGPLAN symposium on Partial evaluation and semantics-based program manipulation - PEPM ’06. doi:10.1145/1111542.1111557SWI Prolog 2012. URL: http://www.swi-prolog.org/.Consel, C., & Danvy, O. (1993). Partial evaluation in parallel. Lisp and Symbolic Computation, 5(4), 327-342. doi:10.1007/bf01806309De Schreye, D., Glück, R., Jørgensen, J., Leuschel, M., Martens, B., & Sørensen, M. H. (1999). Conjunctive partial deduction: foundations, control, algorithms, and experiments. The Journal of Logic Programming, 41(2-3), 231-277. doi:10.1016/s0743-1066(99)00030-8Debois, S. (2004). Imperative program optimization by partial evaluation. Proceedings of the 2004 ACM SIGPLAN symposium on Partial evaluation and semantics-based program manipulation - PEPM ’04. doi:10.1145/1014007.1014019Debray, S. K. (1989). Static inference of modes and data dependencies in logic programs. ACM Transactions on Programming Languages and Systems, 11(3), 418-450. doi:10.1145/65979.65983Gallagher, J. P. (1993). Tutorial on specialisation of logic programs. Proceedings of the 1993 ACM SIGPLAN symposium on Partial evaluation and semantics-based program manipulation - PEPM ’93. doi:10.1145/154630.154640Gurr C. 1994. A Self-Applicable Partial Evaluator for the Logic Programming Language Goedel. PhD Thesis, Department of Computer Science, University of Bristol.Jones, N. D. (2004). Transformation by interpreter specialisation. Science of Computer Programming, 52(1-3), 307-339. doi:10.1016/j.scico.2004.03.010Leuschel, M., & Vidal, G. (2009). Fast Offline Partial Evaluation of Large Logic Programs. Lecture Notes in Computer Science, 119-134. doi:10.1007/978-3-642-00515-2_9Lloyd, J. W., & Shepherdson, J. C. (1991). Partial evaluation in logic programming. The Journal of Logic Programming, 11(3-4), 217-242. doi:10.1016/0743-1066(91)90027-

arXiv.org e-Print Archive

Crossref

RiuNet

Recommended from our members

GPU-Acceleration of In-Memory Data Analytics

Author: Sitaridi Evangelia
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2016
Field of study

Hardware advances strongly influence the database system design. The flattening speed of CPU cores makes many-core accelerators, such as GPUs, a vital alternative to explore for processing the ever-increasing amounts of data. GPUs have a significantly higher degree of parallelism than multi-core CPUs but their cores are simpler. As a result, they do not face the power constraints limiting the parallelism of CPUs. Their trade-off, however, is the increased implementation complexity. This thesis adapts and redesigns data analytics operators to better exploit the GPU special memory and threading model. Due to the increasing memory capacity and also the user's need for fast interaction with the data, we focus on in-memory analytics. Our techniques span different steps of the data processing pipeline: (1) Data preprocessing, (2) Query compilation, and (3) Algorithmic optimization of the operators. Our data preprocessing techniques adapt the data layout for numeric and string columns to maximize the achieved GPU memory bandwidth. Our query compilation techniques compute the optimal execution plan for conjunctive filters. We formulate \textit{memory divergence} for string matching algorithms and suggest how to eliminate it. Finally, we parallelize decompression algorithms in our compression framework \textit{Gompresso} to fit more data into the limited GPU memory. Gompresso achieves high speed-ups on GPUs over multi-core CPU state-of-the-art libraries and is suitable for any massively parallel processor

Columbia University Academic Commons

Acceleration of Computational Geometry Algorithms for High Performance Computing Based Geo-Spatial Big Data Analysis

Author: Paudel Anmol
Publication venue: e-Publications@Marquette
Publication date: 01/04/2022
Field of study

Geo-Spatial computing and data analysis is the branch of computer science that deals with real world location-based data. Computational geometry algorithms are algorithms that process geometry/shapes and is one of the pillars of geo-spatial computing. Real world map and location-based data can be huge in size and the data structures used to process them extremely big leading to huge computational costs. Furthermore, Geo-Spatial datasets are growing on all V’s (Volume, Variety, Value, etc.) and are becoming larger and more complex to process in-turn demanding more computational resources. High Performance Computing is a way to breakdown the problem in ways that it can run in parallel on big computers with massive processing power and hence reduce the computing time delivering the same results but much faster.This dissertation explores different techniques to accelerate the processing of computational geometry algorithms and geo-spatial computing like using Many-core Graphics Processing Units (GPU), Multi-core Central Processing Units (CPU), Multi-node setup with Message Passing Interface (MPI), Cache optimizations, Memory and Communication optimizations, load balancing, Algorithmic Modifications, Directive based parallelization with OpenMP or OpenACC and Vectorization with compiler intrinsic (AVX). This dissertation has applied at least one of the mentioned techniques to the following problems. Novel method to parallelize plane sweep based geometric intersection for GPU with directives is presented. Parallelization of plane sweep based Voronoi construction, parallelization of Segment tree construction, Segment tree queries and Segment tree-based operations has been presented. Spatial autocorrelation, computation of getis-ord hotspots are also presented. Acceleration performance and speedup results are presented in each corresponding chapter

epublications@Marquette

SCABBARD: single-node fault-tolerant stream processing

Author: Kounelis F
Pietzuch P
Pirk H
Theodorakis G
Publication venue: VLDB Endowment
Publication date: 01/09/2022
Field of study

Single-node multi-core stream processing engines (SPEs) can process hundreds of millions of tuples per second. Yet making them fault-tolerant with exactly-once semantics while retaining this performance is an open challenge: due to the limited I/O bandwidth of a single-node, it becomes infeasible to persist all stream data and operator state during execution. Instead, single-node SPEs rely on upstream distributed systems, such as Apache Kafka, to recover stream data after failure, necessitating complex cluster-based deployments. This lack of built-in fault-tolerance features has hindered the adoption of single-node SPEs.We describe Scabbard, the first single-node SPE that supports exactly-once fault-tolerance semantics despite limited local I/O bandwidth. Scabbard achieves this by integrating persistence operations with the query workload. Within the operator graph, Scabbard determines when to persist streams based on the selectivity of operators: by persisting streams after operators that discard data, it can substantially reduce the required I/O bandwidth. As part of the operator graph, Scabbard supports parallel persistence operations and uses markers to decide when to discard persisted data. The persisted data volume is further reduced using workload-specific compression: Scabbard monitors stream statistics and dynamically generates computationally efficient compression operators. Our experiments show that Scabbard can execute stream queries that process over 200 million tuples per second while recovering from failures with sub-second latencies

Spiral - Imperial College Digital Repository

Cost-Based Optimization of Integration Flows

Author: Böhm Matthias
Publication venue
Publication date: 15/03/2011
Field of study

Integration flows are increasingly used to specify and execute data-intensive integration tasks between heterogeneous systems and applications. There are many different application areas such as real-time ETL and data synchronization between operational systems. For the reasons of an increasing amount of data, highly distributed IT infrastructures, and high requirements for data consistency and up-to-dateness of query results, many instances of integration flows are executed over time. Due to this high load and blocking synchronous source systems, the performance of the central integration platform is crucial for an IT infrastructure. To tackle these high performance requirements, we introduce the concept of cost-based optimization of imperative integration flows that relies on incremental statistics maintenance and inter-instance plan re-optimization. As a foundation, we introduce the concept of periodical re-optimization including novel cost-based optimization techniques that are tailor-made for integration flows. Furthermore, we refine the periodical re-optimization to on-demand re-optimization in order to overcome the problems of many unnecessary re-optimization steps and adaptation delays, where we miss optimization opportunities. This approach ensures low optimization overhead and fast workload adaptation

Technische Universität Dresden: Qucosa

10381 Summary and Abstracts Collection -- Robust Query Processing

Author: Kuno Harumi Anne
Markl Volker
Sattler Kai-Uwe
Publication venue: Dagstuhl Seminar Proceedings. 10381 - Robust Query Processing
Publication date: 01/01/2011
Field of study

Dagstuhl seminar 10381 on robust query processing (held 19.09.10 - 24.09.10) brought together a diverse set of researchers and practitioners with a broad range of expertise for the purpose of fostering discussion and collaboration regarding causes, opportunities, and solutions for achieving robust query processing. The seminar strove to build a unified view across the loosely-coupled system components responsible for the various stages of database query processing. Participants were chosen for their experience with database query processing and, where possible, their prior work in academic research or in product development towards robustness in database query processing. In order to pave the way to motivate, measure, and protect future advances in robust query processing, seminar 10381 focused on developing tests for measuring the robustness of query processing. In these proceedings, we first review the seminar topics, goals, and results, then present abstracts or notes of some of the seminar break-out sessions. We also include, as an appendix, the robust query processing reading list that was collected and distributed to participants before the seminar began, as well as summaries of a few of those papers that were contributed by some participants

Dagstuhl Research Online Publication Server

Cache-Efficient Aggregation: Hashing Is Sorting

Author: Färber Franz
Lacurie Arnaud
Lehner Wolfgang
Müller Ingo
Sanders Peter
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/06/2022
Field of study

For decades researchers have studied the duality of hashing and sorting for the implementation of the relational operators, especially for efficient aggregation. Depending on the underlying hardware and software architecture, the specifically implemented algorithms, and the data sets used in the experiments, different authors came to different conclusions about which is the better approach. In this paper we argue that in terms of cache efficiency, the two paradigms are actually the same. We support our claim by showing that the complexity of hashing is the same as the complexity of sorting in the external memory model. Furthermore we make the similarity of the two approaches obvious by designing an algorithmic framework that allows to switch seamlessly between hashing and sorting during execution. The fact that we mix hashing and sorting routines in the same algorithmic framework allows us to leverage the advantages of both approaches and makes their similarity obvious. On a more practical note, we also show how to achieve very low constant factors by tuning both the hashing and the sorting routines to modern hardware. Since we observe a complementary dependency of the constant factors of the two routines to the locality of the input, we exploit our framework to switch to the faster routine where appropriate. The result is a novel relational aggregation algorithm that is cache-efficient---independently and without prior knowledge of input skew and output cardinality---, highly parallelizable on modern multi-core systems, and operating at a speed close to the memory bandwidth, thus outperforming the state-of-the-art by up to 3.7x

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

Enhancing productivity and performance portability of opencl applications on heterogeneous systems using runtime optimizations

Author: Lutz Thibaut
Publication venue: The University of Edinburgh
Publication date: 29/06/2015
Field of study

Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable Gate Arrays or other specialized processors have very different architectures. This puts an enormous strain on programming models and software developers to take full advantage of the computing power at hand. Because of this diversity and the unachievable flexibility and portability necessary to optimize for each target individually, heterogeneous systems remain typically vastly under-utilized. In this thesis, we explore two distinct ways to tackle this problem. Providing automated, non intrusive methods in the form of compiler tools and implementing efficient abstractions to automatically tune parameters for a restricted domain are two complementary approaches investigated to better utilize compute resources in heterogeneous systems. First, we explore a fully automated compiler based approach, where a runtime system analyzes the computation flow of an OpenCL application and optimizes it across multiple compute kernels. This method can be deployed on any existing application transparently and replaces significant software engineering effort spent to tune application for a particular system. We show that this technique achieves speedups of up to 3x over unoptimized code and an average of 1.4x over manually optimized code for highly dynamic applications. Second, a library based approach is designed to provide a high level abstraction for complex problems in a specific domain, stencil computation. Using domain specific techniques, the underlying framework optimizes the code aggressively. We show that even in a restricted domain, automatic tuning mechanisms and robust architectural abstraction are necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs and 3.6x on four

Edinburgh Research Archive