2,686 research outputs found

    Speculative Approximations for Terascale Analytics

    Full text link
    Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurations simultaneously and the lack of support to quickly identify sub-optimal configurations are the principal causes. In this paper, we develop two database-inspired techniques for efficient model calibration. Speculative parameter testing applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. The number of configurations is determined adaptively at runtime, while the configurations themselves are extracted from a distribution that is continuously learned following a Bayesian process. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution. We apply the proposed techniques to distributed gradient descent optimization -- batch and incremental -- for support vector machines and logistic regression models. We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big Data analytics system -- and evaluate their performance over terascale-size synthetic and real datasets. The results confirm that as many as 32 configurations can be evaluated concurrently almost as fast as one, while sub-optimal configurations are detected accurately in as little as a 1/20th1/20^{\text{th}} fraction of the time

    Knowledge-infused and Consistent Complex Event Processing over Real-time and Persistent Streams

    Full text link
    Emerging applications in Internet of Things (IoT) and Cyber-Physical Systems (CPS) present novel challenges to Big Data platforms for performing online analytics. Ubiquitous sensors from IoT deployments are able to generate data streams at high velocity, that include information from a variety of domains, and accumulate to large volumes on disk. Complex Event Processing (CEP) is recognized as an important real-time computing paradigm for analyzing continuous data streams. However, existing work on CEP is largely limited to relational query processing, exposing two distinctive gaps for query specification and execution: (1) infusing the relational query model with higher level knowledge semantics, and (2) seamless query evaluation across temporal spaces that span past, present and future events. These allow accessible analytics over data streams having properties from different disciplines, and help span the velocity (real-time) and volume (persistent) dimensions. In this article, we introduce a Knowledge-infused CEP (X-CEP) framework that provides domain-aware knowledge query constructs along with temporal operators that allow end-to-end queries to span across real-time and persistent streams. We translate this query model to efficient query execution over online and offline data streams, proposing several optimizations to mitigate the overheads introduced by evaluating semantic predicates and in accessing high-volume historic data streams. The proposed X-CEP query model and execution approaches are implemented in our prototype semantic CEP engine, SCEPter. We validate our query model using domain-aware CEP queries from a real-world Smart Power Grid application, and experimentally analyze the benefits of our optimizations for executing these queries, using event streams from a campus-microgrid IoT deployment.Comment: 34 pages, 16 figures, accepted in Future Generation Computer Systems, October 27, 201

    Power efficiency through tuple ranking in wireless sensor network monitoring

    Get PDF
    In this paper, we present an innovative framework for efficiently monitoring Wireless Sensor Networks (WSNs). Our framework, coined KSpot, utilizes a novel top-k query processing algorithm we developed, in conjunction with the concept of in-network views, in order to minimize the cost of query execution. For ease of exposition, consider a set of sensors acquiring data from their environment at a given time instance. The generated information can conceptually be thought as a horizontally fragmented base relation R. Furthermore, the results to a user-defined query Q, registered at some sink point, can conceptually be thought as a view V . Maintaining consistency between V and R is very expensive in terms of communication and energy. Thus, KSpot focuses on a subset V′ (⊆ V ) that unveils only the k highest-ranked answers at the sink, for some user defined parameter k. To illustrate the efficiency of our framework, we have implemented a real system in nesC, which combines the traditional advantages of declarative acquisition frameworks, like TinyDB, with the ideas presented in this work. Extensive real-world testing and experimentation with traces from University of California-Berkeley, the University of Washington and Intel Research Berkeley, show that KSpot provides an up to 66% of energy savings compared to TinyDB, minimizes both the size and number of packets transmitted over the network (up to 77%), and prolongs the longevity of a WSN deployment to new scales

    Handling Tradeoffs between Performance and Query-Result Quality in Data Stream Processing

    Get PDF
    Data streams in the form of potentially unbounded sequences of tuples arise naturally in a large variety of domains including finance markets, sensor networks, social media, and network traffic management. The increasing number of applications that require processing data streams with high throughput and low latency have promoted the development of data stream processing systems (DSPS). A DSPS processes data streams with continuous queries, which are issued once and return query results to users continuously as new tuples arrive. For stream-based applications, both the query-execution performance (in terms of, e.g., throughput and end-to-end latency) and the quality of produced query results (in terms of, e.g., accuracy and completeness) are important. However, a DSPS often needs to make tradeoffs between these two requirements, either because of the data imperfection within the streams, or because of the limited computation capacity of the DSPS itself. Performance versus result-quality tradeoffs caused by data imperfection are inevitable, because the quality of the incoming data is beyond the control of a DSPS, whereas tradeoffs caused by system limitations can be alleviated—even erased—by enhancing the DSPS itself. This dissertation seeks to advance the state of the art on handling the performance versus result-quality tradeoffs in data stream processing caused by the above two aspects of reasons. For tradeoffs caused by data imperfection, this dissertation focuses on the typical data-imperfection problem of stream disorder and proposes the concept of quality-driven disorder handling (QDDH). QDDH enables a DSPS to make flexible and user-configurable tradeoffs between the end-to-end latency and the query-result quality when dealing with stream disorder. Moreover, compared to existing disorder handling approaches, QDDH can significantly reduce the end-to-end latency, and at the same time provide users with desired query-result quality. In this dissertation, a generic buffer-based QDDH framework and three instantiations of the generic framework for distinct query types are presented. For tradeoffs caused by system limitations, this dissertation proposes a system-enhancement approach that combines the row-oriented and the column-oriented data layout and processing techniques in data stream processing to improve the throughput. To fully exploit the potential of such hybrid execution of continuous queries, a static, cost-based query optimizer is introduced. The optimizer works at the operator level and takes the unique property of execution plans of continuous queries—feasibility—into account
    corecore