36 research outputs found

    VerdictDB: Universalizing Approximate Query Processing

    Full text link
    Despite 25 years of research in academia, approximate query processing (AQP) has had little industrial adoption. One of the major causes of this slow adoption is the reluctance of traditional vendors to make radical changes to their legacy codebases, and the preoccupation of newer vendors (e.g., SQL-on-Hadoop products) with implementing standard features. Additionally, the few AQP engines that are available are each tied to a specific platform and require users to completely abandon their existing databases---an unrealistic expectation given the infancy of the AQP technology. Therefore, we argue that a universal solution is needed: a database-agnostic approximation engine that will widen the reach of this emerging technology across various platforms. Our proposal, called VerdictDB, uses a middleware architecture that requires no changes to the backend database, and thus, can work with all off-the-shelf engines. Operating at the driver-level, VerdictDB intercepts analytical queries issued to the database and rewrites them into another query that, if executed by any standard relational engine, will yield sufficient information for computing an approximate answer. VerdictDB uses the returned result set to compute an approximate answer and error estimates, which are then passed on to the user or application. However, lack of access to the query execution layer introduces significant challenges in terms of generality, correctness, and efficiency. This paper shows how VerdictDB overcomes these challenges and delivers up to 171×\times speedup (18.45×\times on average) for a variety of existing engines, such as Impala, Spark SQL, and Amazon Redshift, while incurring less than 2.6% relative error. VerdictDB is open-sourced under Apache License.Comment: Extended technical report of the paper that appeared in Proceedings of the 2018 International Conference on Management of Data, pp. 1461-1476. ACM, 201

    Approximate Data Analytics Systems

    Get PDF
    Today, most modern online services make use of big data analytics systems to extract useful information from the raw digital data. The data normally arrives as a continuous data stream at a high speed and in huge volumes. The cost of handling this massive data can be significant. Providing interactive latency in processing the data is often impractical due to the fact that the data is growing exponentially and even faster than Moore’s law predictions. To overcome this problem, approximate computing has recently emerged as a promising solution. Approximate computing is based on the observation that many modern applications are amenable to an approximate, rather than the exact output. Unlike traditional computing, approximate computing tolerates lower accuracy to achieve lower latency by computing over a partial subset instead of the entire input data. Unfortunately, the advancements in approximate computing are primarily geared towards batch analytics and cannot provide low-latency guarantees in the context of stream processing, where new data continuously arrives as an unbounded stream. In this thesis, we design and implement approximate computing techniques for processing and interacting with high-speed and large-scale stream data to achieve low latency and efficient utilization of resources. To achieve these goals, we have designed and built the following approximate data analytics systems: • StreamApprox—a data stream analytics system for approximate computing. This system supports approximate computing for low-latency stream analytics in a transparent way and has an ability to adapt to rapid fluctuations of input data streams. In this system, we designed an online adaptive stratified reservoir sampling algorithm to produce approximate output with bounded error. • IncApprox—a data analytics system for incremental approximate computing. This system adopts approximate and incremental computing in stream processing to achieve high-throughput and low-latency with efficient resource utilization. In this system, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. • PrivApprox—a data stream analytics system for privacy-preserving and approximate computing. This system supports high utility and low-latency data analytics and preserves user’s privacy at the same time. The system is based on the combination of privacy-preserving data analytics and approximate computing. • ApproxJoin—an approximate distributed joins system. This system improves the performance of joins — critical but expensive operations in big data systems. In this system, we employed a sketching technique (Bloom filter) to avoid shuffling non-joinable data items through the network as well as proposed a novel sampling mechanism that executes during the join to obtain an unbiased representative sample of the join output. Our evaluation based on micro-benchmarks and real world case studies shows that these systems can achieve significant performance speedup compared to state-of-the-art systems by tolerating negligible accuracy loss of the analytics output. In addition, our systems allow users to systematically make a trade-off between accuracy and throughput/latency and require no/minor modifications to the existing applications

    The role of big data analytics in industrial Internet of Things

    Get PDF
    Big data production in industrial Internet of Things (IIoT) is evident due to the massive deployment of sensors and Internet of Things (IoT) devices. However, big data processing is challenging due to limited computational, networking and storage resources at IoT device-end. Big data analytics (BDA) is expected to provide operational- and customer-level intelligence in IIoT systems. Although numerous studies on IIoT and BDA exist, only a few studies have explored the convergence of the two paradigms. In this study, we investigate the recent BDA technologies, algorithms and techniques that can lead to the development of intelligent IIoT systems. We devise a taxonomy by classifying and categorising the literature on the basis of important parameters (e.g. data sources, analytics tools, analytics techniques, requirements, industrial analytics applications and analytics types). We present the frameworks and case studies of the various enterprises that have benefited from BDA. We also enumerate the considerable opportunities introduced by BDA in IIoT.We identify and discuss the indispensable challenges that remain to be addressed as future research directions as well

    The role of big data analytics in industrial internet of things

    Get PDF
    Big data production in industrial Internet of Things (IIoT) is evident due to the massive deployment of sensors and Internet of Things (IoT) devices. However, big data processing is challenging due to limited computational, networking and storage resources at IoT device-end. Big data analytics (BDA) is expected to provide operational- and customer-level intelligence in IIoT systems. Although numerous studies on IIoT and BDA exist, only a few studies have explored the convergence of the two paradigms. In this study, we investigate the recent BDA technologies, algorithms and techniques that can lead to the development of intelligent IIoT systems. We devise a taxonomy by classifying and categorising the literature on the basis of important parameters (e.g. data sources, analytics tools, analytics techniques, requirements, industrial analytics applications and analytics types). We present the frameworks and case studies of the various enterprises that have benefited from BDA. We also enumerate the considerable opportunities introduced by BDA in IIoT. We identify and discuss the indispensable challenges that remain to be addressed, serving as future research directions. © 2019 Elsevier B.V
    corecore