840 research outputs found

    GreedyDual-Join: Locality-Aware Buffer Management for Approximate Join Processing Over Data Streams

    Full text link
    We investigate adaptive buffer management techniques for approximate evaluation of sliding window joins over multiple data streams. In many applications, data stream processing systems have limited memory or have to deal with very high speed data streams. In both cases, computing the exact results of joins between these streams may not be feasible, mainly because the buffers used to compute the joins contain much smaller number of tuples than the tuples contained in the sliding windows. Therefore, a stream buffer management policy is needed in that case. We show that the buffer replacement policy is an important determinant of the quality of the produced results. To that end, we propose GreedyDual-Join (GDJ) an adaptive and locality-aware buffering technique for managing these buffers. GDJ exploits the temporal correlations (at both long and short time scales), which we found to be prevalent in many real data streams. We note that our algorithm is readily applicable to multiple data streams and multiple joins and requires almost no additional system resources. We report results of an experimental study using both synthetic and real-world data sets. Our results demonstrate the superiority and flexibility of our approach when contrasted to other recently proposed techniques

    Processing Exact Results for Queries over Data Streams

    Get PDF
    In a growing number of information-processing applications, such as network-traffic monitoring, sensor networks, financial analysis, data mining for e-commerce, etc., data takes the form of continuous data streams rather than traditional stored databases/relational tuples. These applications have some common features like the need for real time analysis, huge volumes of data, and unpredictable and bursty arrivals of stream elements. In all of these applications, it is infeasible to process queries over data streams by loading the data into a traditional database management system (DBMS) or into main memory. Such an approach does not scale with high stream rates. As a consequence, systems that can manage streaming data have gained tremendous importance. The need to process a large number of continuous queries over bursty, high volume online data streams, potentially in real time, makes it imperative to design algorithms that should use limited resources. This dissertation focuses on processing exact results for join queries over high speed data streams using limited resources, and proposes several novel techniques for processing join queries incorporating secondary storages and non-dedicated computers. Existing approaches for stream joins either, (a) deal with memory limitations by shedding loads, and therefore can not produce exact or highly accurate results for the stream joins over data streams with time varying arrivals of stream tuples, or (b) suffer from large I/O-overheads due to random disk accesses. The proposed techniques exploit the high bandwidth of a disk subsystem by rendering the data access pattern largely sequential, eliminating small, random disk accesses. This dissertation proposes an I/O-efficient algorithm to process hybrid join queries, that join a fast, time varying or bursty data stream and a persistent disk relation. Such a hybrid join is the crux of a number of common transformations in an active data warehouse. Experimental results demonstrate that the proposed scheme reduces the response time in output results by exploiting spatio-temporal locality within the input stream, and minimizes disk overhead through disk-I/O amortization. The dissertation also proposes an algorithm to parallelize a stream join operator over a shared-nothing system. The proposed algorithm distributes the processing loads across a number of independent, non-dedicated nodes, based on a fixed or predefined communication pattern; dynamically maintains the degree of declustering in order to minimize communication and processing overheads; and presents mechanisms for reducing storage and communication overheads while scaling over a large number of nodes. We present experimental results showing the efficacy of the proposed algorithms

    Metadata-Aware Query Processing over Data Streams

    Get PDF
    Many modern applications need to process queries over potentially infinite data streams to provide answers in real-time. This dissertation proposes novel techniques to optimize CPU and memory utilization in stream processing by exploiting metadata on streaming data or queries. It focuses on four topics: 1) exploiting stream metadata to optimize SPJ query operators via operator configuration, 2) exploiting stream metadata to optimize SPJ query plans via query-rewriting, 3) exploiting workload metadata to optimize parameterized queries via indexing, and 4) exploiting event constraints to optimize event stream processing via run-time early termination. The first part of this dissertation proposes algorithms for one of the most common and expensive query operators, namely join, to at runtime identify and purge no-longer-needed data from the state based on punctuations. Exploitations of the combination of punctuation and commonly-used window constraints are also studied. Extensive experimental evaluations demonstrate both reduction on memory usage and improvements on execution time due to the proposed strategies. The second part proposes herald-driven runtime query plan optimization techniques. We identify four query optimization techniques, design a lightweight algorithm to efficiently detect the optimization opportunities at runtime upon receiving heralds. We propose a novel execution paradigm to support multiple concurrent logical plans by maintaining one physical plan. Extensive experimental study confirms that our techniques significantly reduce query execution times. The third part deals with the shared execution of parameterized queries instantiated from a query template. We design a lightweight index mechanism to provide multiple access paths to data to facilitate a wide range of parameterized queries. To withstand workload fluctuations, we propose an index tuning framework to tune the index configurations in a timely manner. Extensive experimental evaluations demonstrate the effectiveness of the proposed strategies. The last part proposes event query optimization techniques by exploiting event constraints such as exclusiveness or ordering relationships among events extracted from workflows. Significant performance gains are shown to be achieved by our proposed constraint-aware event processing techniques

    Memory-Limited Execution of Windowed Stream Joins

    Get PDF

    Approximate Data Analytics Systems

    Get PDF
    Today, most modern online services make use of big data analytics systems to extract useful information from the raw digital data. The data normally arrives as a continuous data stream at a high speed and in huge volumes. The cost of handling this massive data can be significant. Providing interactive latency in processing the data is often impractical due to the fact that the data is growing exponentially and even faster than Moore’s law predictions. To overcome this problem, approximate computing has recently emerged as a promising solution. Approximate computing is based on the observation that many modern applications are amenable to an approximate, rather than the exact output. Unlike traditional computing, approximate computing tolerates lower accuracy to achieve lower latency by computing over a partial subset instead of the entire input data. Unfortunately, the advancements in approximate computing are primarily geared towards batch analytics and cannot provide low-latency guarantees in the context of stream processing, where new data continuously arrives as an unbounded stream. In this thesis, we design and implement approximate computing techniques for processing and interacting with high-speed and large-scale stream data to achieve low latency and efficient utilization of resources. To achieve these goals, we have designed and built the following approximate data analytics systems: • StreamApprox—a data stream analytics system for approximate computing. This system supports approximate computing for low-latency stream analytics in a transparent way and has an ability to adapt to rapid fluctuations of input data streams. In this system, we designed an online adaptive stratified reservoir sampling algorithm to produce approximate output with bounded error. • IncApprox—a data analytics system for incremental approximate computing. This system adopts approximate and incremental computing in stream processing to achieve high-throughput and low-latency with efficient resource utilization. In this system, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. • PrivApprox—a data stream analytics system for privacy-preserving and approximate computing. This system supports high utility and low-latency data analytics and preserves user’s privacy at the same time. The system is based on the combination of privacy-preserving data analytics and approximate computing. • ApproxJoin—an approximate distributed joins system. This system improves the performance of joins — critical but expensive operations in big data systems. In this system, we employed a sketching technique (Bloom filter) to avoid shuffling non-joinable data items through the network as well as proposed a novel sampling mechanism that executes during the join to obtain an unbiased representative sample of the join output. Our evaluation based on micro-benchmarks and real world case studies shows that these systems can achieve significant performance speedup compared to state-of-the-art systems by tolerating negligible accuracy loss of the analytics output. In addition, our systems allow users to systematically make a trade-off between accuracy and throughput/latency and require no/minor modifications to the existing applications

    A Survey on the Evolution of Stream Processing Systems

    Full text link
    Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between early ('00-'10) and modern ('11-'18) streaming systems, and discuss recent trends and open problems.Comment: 34 pages, 15 figures, 5 table

    Performance Evaluation and Benchmarking of Event Processing Systems

    Get PDF
    Tese de Doutoramento em Ciências e Tecnologias da Informação apresentada à Faculdade de Ciências e Tecnologia da Universidade de Coimbra.Esta dissertação tem por objetivo estudar e comparar o desempenho dos sistemas de processamento de eventos, bem como propor novas técnicas que melhorem sua eficiência e escalabilidade. Nos últimos anos os sistemas de processamento de eventos têm tido uma difusão bastante rápida, tanto no meio acadêmico, onde deram origem a vários projetos de investigação, como na indústria, onde fomentaram o aparecimento de dezenas de startups e fazem-se hoje presentes nos mais diversos domínios de aplicação. No entanto, tem-se observado uma falta generalizada de informação, metodologias de avaliação e ferramentas no que diz respeito ao desempenho das plataformas de processamento de eventos. Até recentemente, não era conhecido ao certo que fatores afetam mais o seu desempenho, se os sistemas seriam capazes de escalar e adaptar-se às mudanças frequentes nas condições de carga, ou se teriam alguma limitação específica. Além disso, a falta de benchmarks padronizados impedia que se estabelecesse qualquer comparação objetiva entre os diversos produtos. Este trabalho visa preencher estas lacunas, e para isso foram abordados quatro tópicos principais. Primeiramente, desenvolvemos o framework FINCoS, um conjunto de ferramentas de benchmarking para a geração de carga e medição de desempenho de sistemas de processamento de eventos. O framework foi especificamente concebido de modo a ser independente dos produtos testados e da carga de trabalho utilizada, permitindo, assim, a sua reutilização em diversos estudos de desempenho e benchmarks. Em seguida, definimos uma série de microbenchmarks e conduzimos um estudo alargado de desempenho envolvendo três sistemas distintos. Essa análise não só permitiu identificar alguns fatores críticos para o desempenho das plataformas de processamento de eventos, como também expôs limitações importantes dos produtos, tais como má utilização de recursos e falhas devido à falta de memória. A partir dos resultados obtidos, passamos a nos dedicar à investigação de melhorias de desempenho. A fim de aprimorar a utilização de recursos, propusemos novos algoritmos e avaliamos esquemas de organização de dados alternativos que não só reduziram substancialmente o consumo de memória, como também se mostraram significativamente mais eficientes ao nível da microarquitetura. Para dirimir o problema de falta de memória, propusemos SlideM, um algoritmo de paginação que seletivamente envia partes do estado de queries contínuas para disco quando a memória física se torna-se insuficiente. Desenvolvemos também uma estratégia baseada no algoritmo SlideM para partilhar recursos computacionais durante o processamento de queries simultâneas. Concluímos esta dissertação propondo o benchmark Pairs. O benchmark visa avaliar a capacidade das plataformas de processamento de eventos em responder rapidamente a números progressivamente maiores de queries e taxas de entrada de dados cada vez mais altas. Para isso, a carga de trabalho do benchmark foi cuidadosamente concebida de modo a exercitar as operações encontradas com maior frequência em aplicações reais de processamento de eventos, tais como agregação, correlação e detecção de padrões. O benchmark Pairs também se diferencia de propostas anteriores em áreas relacionadas por permitir avaliar outros aspectos fundamentais, como adaptabilidade e escalabilidade com relação ao número de queries. De uma forma geral, esperamos que os resultados e propostas apresentados neste trabalho venham a contribuir para ampliar o entendimento acerca do desempenho das plataformas de processamento de eventos, e sirvam como estímulo para novos projetos de investigação que levem a melhorias adicionais à geração atual de sistemas.This thesis aims at studying, comparing, and improving the performance and scalability of event processing (EP) systems. In the last 15 years, event processing systems have gained increased attention from academia and industry, having found application in a number of mission-critical scenarios and motivated the onset of several research projects and specialized startups. Nonetheless, there has been a general lack of information, evaluation methodologies and tools in what concerns the performance of EP platforms. Until recently, it was not clear which factors impact most their performance, if the systems would scale well and adapt to changes in load conditions or if they had any serious limitations. Moreover, the lack of standardized benchmarks hindered any objective comparison among the diverse platforms. In this thesis, we tackle these problems by acting in several fronts. First, we developed FINCoS, a set of benchmarking tools for load generation and performance measurement of event processing systems. The framework has been designed to be independent on any particular workload or product so that it can be reused in multiple performance studies and benchmark kits. FINCoS has been made publicly available under the terms of the GNU General Public License and is also currently hosted at the Standard Performance Evaluation Corporation (SPEC) repository of peer-reviewed tools for quantitative system evaluation and analysis. We then defined a set of microbenchmarks and used them to conduct an extensive performance study on three EP systems. This analysis helped identifying critical factors affecting the performance of event processing platforms and exposed important limitations of the products, such as poor utilization of resources, trashing or failures in the presence of memory shortages, and no/incipient query plan sharing capabilities. With these results in hands, we moved our focus to performance enhancement. To improve resource utilization, we proposed novel algorithms and evaluated alternative data organization schemes that not only reduce substantially memory consumption, but also are significantly more efficient at the microarchitectural level. Our experimental evaluation corroborated the efficacy of the proposed optimizations: together they provided a 6-fold reduction in memory usage and order-of-magnitude increase on query throughput. In addition, we addressed the problem of memory-constrained applications by introducing SlideM, an optimal buffer management algorithm that selectively offloads sliding windows state to disk when main memory becomes insufficient. We also developed a strategy based on SlideM to share computational resources when processing multiple aggregation queries over overlapping sliding windows. Our experimental results demonstrate that, contrary to common sense, storing windows data on disk can be appropriate even for applications with very high event arrival rates. We concluded this thesis by proposing the Pairs benchmark. Pairs was designed to assess the ability of EP platforms in processing increasingly larger numbers of simultaneous queries and event arrival rates while providing quick answers. The benchmark workload exercises several common features that appear repeatedly in most event processing applications, including event filtering, aggregation, correlation and pattern detection. Furthermore, differently from previous proposals in related areas, Pairs allows evaluating important aspects of event processing systems such as adaptivity and query scalability. In general, we expect that the findings and proposals presented in this thesis serve to broaden the understanding on the performance of event processing platforms and open avenues for additional improvements in the current generation of EP systems.FCT Nº 45121/200
    • …
    corecore