39 research outputs found

    Fully decentralized computation of aggregates over data streams

    Get PDF
    In several emerging applications, data is collected in massive streams at several distributed points of observation. A basic and challenging task is to allow every node to monitor a neighbourhood of interest by issuing continuous aggregate queries on the streams observed in its vicinity. This class of algorithms is fully decentralized and diffusive in nature: collecting all data at few central nodes of the network is unfeasible in networks of low capability devices or in the presence of massive data sets. The main difficulty in designing diffusive algorithms is to cope with duplicate detections. These arise both from the observation of the same event at several nodes of the network and/or receipt of the same aggregated information along multiple paths of diffusion. In this paper, we consider fully decentralized algorithms that answer locally continuous aggregate queries on the number of distinct events, total number of events and the second frequency moment in the scenario outlined above. The proposed algorithms use in the worst case or on realistic distributions sublinear space at every node. We also propose strategies that minimize the communication needed to update the aggregates when new events are observed. We experimentally evaluate for the efficiency and accuracy of our algorithms on realistic simulated scenarios

    Selecting adequate samples for approximate decision support queries

    Get PDF
    For highly selective queries, a simple random sample of records drawn from a large data warehouse may not contain sufficient number of records that satisfy the query conditions. Efficient sampling schemes for such queries require innovative techniques that can access records that are relevant to each specific query. In drawing the sample, it is advantageous to know what would be an adequate sample size for a given query. This paper proposes methods for picking adequate samples that ensure approximate query results with a desired level of accuracy. A special index based on a structure known as the k-MDI Tree is used to draw samples. An unbiased estimator named inverse simple random sampling without replacement is adapted to estimate adequate sample sizes for queries. The methods are evaluated experimentally on a large real life data set. The results of evaluation show that adequate sample sizes can be determined such that errors in outputs of most queries are wtihin the acceptable limit of 5%

    Synopsis data structures for massive data sets

    Full text link

    Efficient Algorithms to Compute Hierarchical Summaries from Big Data Streams

    Full text link
    Many data stream applications have hierarchical data; containing time, geographic locations, product information, clickstreams, server logs, IP addresses. A hierarchical summary of such volumous data offers multiple advantages including compactness, quick understanding, and abstraction. The goal of this thesis is to design algorithmic approaches for summarizing hierarchical data streams. First, this thesis provides a theoretical analysis of the benchmark hierarchical heavy hitters' algorithms and uncovers their shortcomings such as requiring high theoretical memory, updates and coverage problem. To address these shortcomings, this thesis proposes efficient algorithms which offer deterministic estimation accuracy using O(η/Δ) worst-case memory and O(η) worst-case time complexity per item, where Δ ∈ [0,1] is a user defined parameter and η is a small constant derived from the data. The proposed hierarchical heavy hitters' algorithms are shown to have improved significantly over existing algorithms both theoretically as well as empirically. Next, this thesis introduces a new concept called hierarchically correlated heavy hitters, which is different from existing hierarchical summarization techniques. The thesis provides a formal definition of the proposed concept and compares it with existing hierarchical summarization approaches both at definition level and empirically. It also proposes an efficient hierarchy-aware algorithm for computing hierarchically correlated heavy hitters. The proposed algorithm offers deterministic estimation accuracy using O(η / (Δ_p * Δ_s )) worst-case memory and O(η) worst-case time complexity per item, where η is as defined previously, and Δ_p ∈ [0,1], Δ_s ∈ [0,1] are other user defined parameters. Finally, the thesis proposes a special hierarchical data structure and algorithm to summarize spatiotemporal data. It can be used to extract interesting and useful patterns from high-speed spatiotemporal data streams at multiple spatial and temporal granularities. Theoretical and empirical analysis are provided, which show that the proposed data structure is very efficient concerning data storage and response to queries. It updates a single item in O(1) time and responds to a point query in O(1) time. Importantly, the memory requirement of the proposed data structure is independent of the size of the data and only depends on user-supplied parameters ψ ⃗ and φ ⃗. In summary, this thesis provides a general framework consisting of a set of algorithms and data structures to compute hierarchical summaries of the big data streams. All of the proposed algorithms exploit a lattice structure built from the hierarchical attributes of the data to compute different hierarchical summaries, which can be used to address various data analytic issues in many emerging applications

    Technologies and Applications for Big Data Value

    Get PDF
    This open access book explores cutting-edge solutions and best practices for big data and data-driven AI applications for the data-driven economy. It provides the reader with a basis for understanding how technical issues can be overcome to offer real-world solutions to major industrial areas. The book starts with an introductory chapter that provides an overview of the book by positioning the following chapters in terms of their contributions to technology frameworks which are key elements of the Big Data Value Public-Private Partnership and the upcoming Partnership on AI, Data and Robotics. The remainder of the book is then arranged in two parts. The first part “Technologies and Methods” contains horizontal contributions of technologies and methods that enable data value chains to be applied in any sector. The second part “Processes and Applications” details experience reports and lessons from using big data and data-driven approaches in processes and applications. Its chapters are co-authored with industry experts and cover domains including health, law, finance, retail, manufacturing, mobility, and smart cities. Contributions emanate from the Big Data Value Public-Private Partnership and the Big Data Value Association, which have acted as the European data community's nucleus to bring together businesses with leading researchers to harness the value of data to benefit society, business, science, and industry. The book is of interest to two primary audiences, first, undergraduate and postgraduate students and researchers in various fields, including big data, data science, data engineering, and machine learning and AI. Second, practitioners and industry experts engaged in data-driven systems, software design and deployment projects who are interested in employing these advanced methods to address real-world problems

    Technologies and Applications for Big Data Value

    Get PDF
    This open access book explores cutting-edge solutions and best practices for big data and data-driven AI applications for the data-driven economy. It provides the reader with a basis for understanding how technical issues can be overcome to offer real-world solutions to major industrial areas. The book starts with an introductory chapter that provides an overview of the book by positioning the following chapters in terms of their contributions to technology frameworks which are key elements of the Big Data Value Public-Private Partnership and the upcoming Partnership on AI, Data and Robotics. The remainder of the book is then arranged in two parts. The first part “Technologies and Methods” contains horizontal contributions of technologies and methods that enable data value chains to be applied in any sector. The second part “Processes and Applications” details experience reports and lessons from using big data and data-driven approaches in processes and applications. Its chapters are co-authored with industry experts and cover domains including health, law, finance, retail, manufacturing, mobility, and smart cities. Contributions emanate from the Big Data Value Public-Private Partnership and the Big Data Value Association, which have acted as the European data community's nucleus to bring together businesses with leading researchers to harness the value of data to benefit society, business, science, and industry. The book is of interest to two primary audiences, first, undergraduate and postgraduate students and researchers in various fields, including big data, data science, data engineering, and machine learning and AI. Second, practitioners and industry experts engaged in data-driven systems, software design and deployment projects who are interested in employing these advanced methods to address real-world problems

    Data bases and data base systems related to NASA's aerospace program. A bibliography with indexes

    Get PDF
    This bibliography lists 1778 reports, articles, and other documents introduced into the NASA scientific and technical information system, 1975 through 1980
    corecore