1,939 research outputs found

    Recurring Query Processing on Big Data

    Get PDF
    The advances in hardware, software, and networks have enabled applications from business enterprises, scientific and engineering disciplines, to social networks, to generate data at unprecedented volume, variety, velocity, and varsity not possible before. Innovation in these domains is thus now hindered by their ability to analyze and discover knowledge from the collected data in a timely and scalable fashion. To facilitate such large-scale big data analytics, the MapReduce computing paradigm and its open-source implementation Hadoop is one of the most popular and widely used technologies. Hadoop\u27s success as a competitor to traditional parallel database systems lies in its simplicity, ease-of-use, flexibility, automatic fault tolerance, superior scalability, and cost effectiveness due to its use of inexpensive commodity hardware that can scale petabytes of data over thousands of machines. Recurring queries, repeatedly being executed for long periods of time on rapidly evolving high-volume data, have become a bedrock component in most of these analytic applications. Efficient execution and optimization techniques must be designed to assure the responsiveness and scalability of these recurring queries. In this dissertation, we thoroughly investigate topics in the area of recurring query processing on big data. In this dissertation, we first propose a novel scalable infrastructure called Redoop that treats recurring query over big evolving data as first class citizens during query processing. This is in contrast to state-of-the-art MapReduce/Hadoop system experiencing significant challenges when faced with recurring queries including redundant computations, significant latencies, and huge application development efforts. Redoop offers innovative window-aware optimization techniques for recurring query execution including adaptive window-aware data partitioning, window-aware task scheduling, and inter-window caching mechanisms. Redoop retains the fault-tolerance of MapReduce via automatic cache recovery and task re-execution support as well. Second, we address the crucial need to accommodate hundreds or even thousands of recurring analytics queries that periodically execute over frequently updated data sets, e.g., latest stock transactions, new log files, or recent news feeds. For many applications, such recurring queries come with user-specified service-level agreements (SLAs), commonly expressed as the maximum allowed latency for producing results before their merits decay. On top of Redoop, we built a scalable multi-query sharing engine tailored for recurring workloads in the MapReduce infrastructure, called Helix. Helix deploys new sliced window-alignment techniques to create sharing opportunities among recurring queries without introducing additional I/O overheads or unnecessary data scans. Furthermore, Helix introduces a cost/benefit model for creating a sharing plan among the recurring queries, and a scheduling strategy for executing them to maximize the SLA satisfaction. Third, recurring analytics queries tend to be expensive, especially when query processing consumes data sets in the hundreds of terabytes or more. Time sensitive recurring queries, such as fraud detection, often come with tight response time constraints as query deadlines. Data sampling is a popular technique for computing approximate results with an acceptable error bound while reducing high-demand resource consumption and thus improving query turnaround times. In this dissertation, we propose the first fast approximate query engine for recurring workloads in the MapReduce infrastructure, called Faro. Faro introduces two key innovations: (1) a deadline-aware sampling strategy that builds samples from the original data with reduced sample sizes compared to uniform sampling, and (2) adaptive resource allocation strategies that maximally improve the approximate results while assuring to still meet the response time requirements specified in recurring queries. In our comprehensive experimental study of each part of this dissertation, we demonstrate the superiority of the proposed strategies over state-of-the-art techniques in scalability, effectiveness, as well as robustness

    A PROCRUSTEAN APPROACH TO STREAM PROCESSING

    Get PDF
    The increasing demand for real-time data processing and the constantly growing data volume have contributed to the rapid evolution of Stream Processing Engines (SPEs), which are designed to continuously process data as it arrives. Low operational cost and timely delivery of results are both objectives of paramount importance for SPEs. Given the volatile and uncharted nature of data streams, achieving the aforementioned goals under fixed resources is a challenge. This calls for adaptable SPEs, which can react to fluctuations in processing demands. In the past, three techniques have been developed for improving an SPE’s ability to adapt. Those techniques are classified based on applications’ requirements on exact or approximate results: stream partitioning, and re-partitioning target exact, and load shedding targets approximate processing. Stream partitioning strives to balance load among processors, and previous techniques neglected hidden costs of distributed execution. Load Shedding lowers the accuracy of results by dropping part of the input, and previous techniques did not cope with evolving streams. Stream re-partitioning is used to reconfigure execution while processing takes place, and previous techniques did not fully utilize window semantics. In this dissertation, we put stream processing in a procrustean bed, in terms of the manner and the degree that processing takes place. To this end, we present new approaches, for window-based aggregate operators, which are applicable to both exact and approximate stream processing in modern SPEs. Our stream partitioning, re-partitioning, and load shedding solutions offer improvements in performance and accuracy on real-world data by exploiting the semantics of both data and operations. In addition, we present SPEAr, the design of an SPE that accelerates processing by delivering approximate results with accuracy guarantees and avoiding unnecessary load. Finally, we contribute a hybrid technique, ShedPart, which can further improve load balance and performance of an SPE

    Benchmarking RDF Storage Engines

    Get PDF
    In this deliverable, we present version V1.0 of SRBench, the first benchmark for Streaming RDF engines, designed in the context of Task 1.4 of PlanetData, completely based on real-world datasets. With the increasing problem of too much streaming data but not enough knowledge, researchers have set out for solutions in which Semantic Web technologies are adapted and extended for the publishing, sharing, analysing and understanding of such data. Various approaches are emerging. To help researchers and users to compare streaming RDF engines in a standardised application scenario, we propose SRBench, with which one can assess the abilities of a streaming RDF engine to cope with a broad range of use cases typically encountered in real-world scenarios. We offer a set of queries that cover the major aspects of streaming RDF engines, ranging from simple pattern matching queries to queries with complex reasoning tasks. To give a first baseline and illustrate the state of the art, we show results obtained from implementing SRBench using the SPARQLStream query-processing engine developed by UPM

    TOWARDS AN EFFICIENT MULTI-CLOUD OBSERVABILITY FRAMEWORK OF CONTAINERIZED MICROSERVICES IN KUBERNETES PLATFORM

    Get PDF
    A recent trend in software development adopts the paradigm of distributed microservices architecture (MA). Kubernetes, a container-based virtualization platform, has become a de facto environment in which to run MA applications. Organizations may choose to run microservices at several cloud providers to optimize cost and satisfy security concerns. This leads to increased complexity, due to the need to observe the performance characteristics of distributed MA systems. Following a decision guidance models (DGM) approach, this research proposes a decentralized and scalable framework to monitor containerized microservices that run on same or distributed Kubernetes clusters. The framework introduces efficient techniques to gather, distribute, and analyze the observed runtime telemetry data. It offers extensible and cloud-agnostic modules that can exchange data by using a multiplexing, reactive, and non-blocking data streaming approach. An experiment to observe samples of microservices deployed across different cloud platforms was used as a method to evaluate the efficacy and usefulness of the framework. The proposed framework suggests an innovative approach to the development and operations (DevOps) practitioners to observe services across different Kubernetes platforms. It could also serve as a reference architecture for researchers to guide further design options and analysis techniques

    Window Queries Over Data Streams

    Get PDF
    Evaluating queries over data streams has become an appealing way to support various stream-processing applications. Window queries are commonly used in many stream applications. In a window query, certain query operators, especially blocking operators and stateful operators, appear in their windowed versions. Previous research work in evaluating window queries typically requires ordered streams and this order requirement limits the implementations of window operators and also carries performance penalties. This thesis presents efficient and flexible algorithms for evaluating window queries. We first present a new data model for streams, progressing streams, that separates stream progress from physical-arrival order. Then, we present our window semantic definitions for the most commonly used window operators—window aggregation and window join. Unlike previous research that often requires ordered streams when describing window semantics, our window semantic definitions do not rely on physical-stream arrival properties. Based on the window semantic definitions, we present new implementations of window aggregation and window join, WID and OA-Join. Compared to the existing implementations of stream query operators, our implementations do not require special stream-arrival properties, particularly stream order. In addition, for window aggregation, we present two other implementations extended from WID, Paned-WID and AdaptWID, to improve excution time by sharing sub-aggregates and to improve memory usage for input with data distribution skew, respectively. Leveraging our order-insenstive implementations of window operators, we present a new architecture for stream systems, OOP (Out-of- Order Processing). Instead of relying on ordered streams to indicate stream progress, OOP explicitly communicates stream progress to query operators, and thus is more flexible than the previous in-order processing (IOP) approach, which requires maintaining stream order. We implemented our order-insensitive window query operators and the OOP architecture in NiagaraST and Gigascope. Our performance study in both systems confirms the benefits of our window operator implementations and the OOP architecture compared to the commonly used approaches in terms of memory usage, execution time and latency

    Airlift scheduling for the upgraded command and control system of military airlift command.

    Get PDF
    April 1984This report describes a conceptual design for automation of the scheduling of airlift activities as part of the current upgrade of the MAC C2 System. It defines the airlift scheduling problem in generic terms before reviewing the current procedures used by MAC; and then a new scheduling system aimed at handling a very busy and dynamic wartime scenario, is introduced. The new system proposes "Airlift Scheduling Workstations" where MAC Airlift Schedulers would be able to manipulate symbolic information on a computer display to create and quickly modify schedules for aircraft, crews, and stations. For certain sub-problems in generating schedules, automated decision support algorithms would be used interactively to speed the search for feasible and efficient solutions. Airlift Scheduling Workstations are proposed to exist at each "Scheduling Cell", a conceptual organizational unit which has been given sole and complete responsibility for developing the schedule of activities for a specific set of airlift resources-aircraft by tail number, aircrew by name, and stations by location. A Mission Scheduling Database is located at each cell to support the Airlift Scheduling Workstation, and requires information communicated by Airlift Task Planners, and, Airlift Operators at many other locations. These locations would have smaller workstations with local databases, and database management software to assist Task Planners and Operators in viewing current committed and planned schedule information of particular interest to them, and to allow them to send information to the Mission Scheduling Database. The Command and Control processes for Airlift have been structured into a three level hierarchy in this report: Task Planning, Mission Scheduling, and Schedule Execution. Task Planners deal with Airlift Users and Mission Schedulers, but not Airlift Operators. Task Planning has three sub-processes: Processing User Requests; Assigning Requirements and Resources; and Monitoring Task Status. Task planning does not create missions, schedule the missions, or route aircraft. Mission Schedulers deal with Task Planners and Airlift Operators, but not Airlift Users. Mission Scheduling combines several sub-processes to allow efficient schedules to be quickly generated at the ASW (Airlift Scheduling Workstation). These sub-processes are: Mission Generation, Schedule Map Generation (for each type of aircraft), Crew Mission Sequence Generation, Station Schedule Generation, Management of Schedule Status, and Monitoring Schedule Execution and Resource Status. It is important that all these processes be co-located and processed by the Airlift Scheduling Cell. Schedule Execution is performed by Airlift Operators assigned by the scheduling process. It has three sub-processes: Monitor Assigned Schedules, Report Resources Assigned to Schedule, Report Local Capability Status. The assignment of local resources such as aircraft by tail, and crew by name is actually another scheduling process, but has not been studied in this report. Airlift Operators do not deal with Task Planners, but may deal with Airlift Users to finalize details of the scheduled operations. This three level hierarchy is compatible with the current organizational structures of MAC Command and Control. However, it is clear that both the current organizational structures and procedures of MAC Command and Control for both tactical and strategic airlift will be significantly affected by the introduction of the automated scheduling systems envisioned here. These changes will occur in an evolutionary manner after the upgraded MAC C2 system is introduced.Prepared for the Electronic Systems Division, Air Force Systems Command, USAF, Hanscom Air Force Base, Bedford, M

    Generic windowing support for extensible stream processing systems

    Get PDF
    Cataloged from PDF version of article.Stream processing applications process high volume, continuous feeds from live data sources, employ data-in-motion analytics to analyze these feeds, and produce near real-time insights with low latency. One of the fundamental characteristics of such applications is the on-the-fly nature of the computation, which does not require access to disk resident data. Stream processing applications store the most recent history of streams in memory and use it to perform the necessary modeling and analysis tasks. This recent history is often managed using windows. All data stream management systems provide some form of windowing functionality. Windowing makes it possible to implement streaming versions of the traditionally blocking relational operators, such as streaming aggregations, joins, and sorts, as well as any other analytic operator that requires keeping the most recent tuples as state, such as time series analysis operators and signal processing operators. In this paper, we provide a categorization of different window types and policies employed in stream processing applications and give detailed operational semantics for various window configurations. We describe an extensibility mechanism that makes it possible to integrate windowing support into user-defined operators, enabling consistent syntax and semantics across system-provided and third-party toolkits of streaming operators. We describe the design and implementation of a runtime windowing library that significantly simplifies the construction of window-based operators by decoupling the handling of window policies and operator logic from each other. We present our experience using the windowing library to implement a relational operators toolkit and compare the efficacy of the solution to an earlier implementation that did not employ a common windowing library. Copyright (c) 2013 John Wiley & Sons, Ltd

    Approximate Data Analytics Systems

    Get PDF
    Today, most modern online services make use of big data analytics systems to extract useful information from the raw digital data. The data normally arrives as a continuous data stream at a high speed and in huge volumes. The cost of handling this massive data can be significant. Providing interactive latency in processing the data is often impractical due to the fact that the data is growing exponentially and even faster than Moore’s law predictions. To overcome this problem, approximate computing has recently emerged as a promising solution. Approximate computing is based on the observation that many modern applications are amenable to an approximate, rather than the exact output. Unlike traditional computing, approximate computing tolerates lower accuracy to achieve lower latency by computing over a partial subset instead of the entire input data. Unfortunately, the advancements in approximate computing are primarily geared towards batch analytics and cannot provide low-latency guarantees in the context of stream processing, where new data continuously arrives as an unbounded stream. In this thesis, we design and implement approximate computing techniques for processing and interacting with high-speed and large-scale stream data to achieve low latency and efficient utilization of resources. To achieve these goals, we have designed and built the following approximate data analytics systems: • StreamApprox—a data stream analytics system for approximate computing. This system supports approximate computing for low-latency stream analytics in a transparent way and has an ability to adapt to rapid fluctuations of input data streams. In this system, we designed an online adaptive stratified reservoir sampling algorithm to produce approximate output with bounded error. • IncApprox—a data analytics system for incremental approximate computing. This system adopts approximate and incremental computing in stream processing to achieve high-throughput and low-latency with efficient resource utilization. In this system, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. • PrivApprox—a data stream analytics system for privacy-preserving and approximate computing. This system supports high utility and low-latency data analytics and preserves user’s privacy at the same time. The system is based on the combination of privacy-preserving data analytics and approximate computing. • ApproxJoin—an approximate distributed joins system. This system improves the performance of joins — critical but expensive operations in big data systems. In this system, we employed a sketching technique (Bloom filter) to avoid shuffling non-joinable data items through the network as well as proposed a novel sampling mechanism that executes during the join to obtain an unbiased representative sample of the join output. Our evaluation based on micro-benchmarks and real world case studies shows that these systems can achieve significant performance speedup compared to state-of-the-art systems by tolerating negligible accuracy loss of the analytics output. In addition, our systems allow users to systematically make a trade-off between accuracy and throughput/latency and require no/minor modifications to the existing applications
    • …
    corecore