6 research outputs found

    Efficient Indexing and Query Processing of Model-View Sensor Data in the Cloud

    Get PDF
    As the number of sensors that pervade our lives increases (e.g., environmental sensors, phone sensors, etc.), the efficient management of massive amount of sensor data is becoming increasingly important. The infinite nature of sensor data poses a serious challenge for query processing even in a cloud infrastructure. Traditional raw sensor data management systems based on relational databases lack scalability to accommodate large-scale sensor data efficiently. Thus, distributed key-value stores in the cloud are becoming a prime tool to manage sensor data. Model-view sensor data management, which stores the sensor data in the form of modeled segments, brings the additional advantages of data compression and value interpolation. However, currently there are no techniques for indexing and/or query optimization of the model-view sensor data in the cloud; full table scan is needed for query processing in the worst case. In this paper, we propose an innovative index for modeled segments in key-value stores, namely KVI-index. KVI-index consists of two interval indices on the time and sensor value dimensions respectively, each of which has an in-memory search tree and a secondary list materialized in the key-value store. Then, we introduce a KVI-index–Scan–MapReduce hybrid approach to perform efficient query processing upon modeled data streams. As proved by a series of experiments at a private cloud infrastructure, our approach outperforms in query-response time and index-updating efficiency both Hadoop-based parallel processing of the raw sensor data and multiple alternative indexing approaches of model-view data

    Time Series Management Systems:A Survey

    Get PDF
    The collection of time series data increases as more monitoring and automation are being deployed. These deployments range in scale from an Internet of things (IoT) device located in a household to enormous distributed Cyber-Physical Systems (CPSs) producing large volumes of data at high velocity. To store and analyze these vast amounts of data, specialized Time Series Management Systems (TSMSs) have been developed to overcome the limitations of general purpose Database Management Systems (DBMSs) for times series management. In this paper, we present a thorough analysis and classification of TSMSs developed through academic or industrial research and documented through publications. Our classification is organized into categories based on the architectures observed during our analysis. In addition, we provide an overview of each system with a focus on the motivational use case that drove the development of the system, the functionality for storage and querying of time series a system implements, the components the system is composed of, and the capabilities of each system with regard to Stream Processing and Approximate Query Processing (AQP). Last, we provide a summary of research directions proposed by other researchers in the field and present our vision for a next generation TSMS.Comment: 20 Pages, 15 Figures, 2 Tables, Accepted for publication in IEEE TKD

    Distributed Time Series Analytics

    Get PDF
    In recent years time series data has become ubiquitous thanks to affordable sensors and advances in embedded technology. Large amount of time-series data are continuously produced in a wide spectrum of applications, such as sensor networks, medical monitoring and so on. Availability of such large scale time series data highlights the importance of of scalable data management, efï¬cient querying and analysis. Meanwhile, in the online setting time series carries invaluable information and knowledge about the real-time status of involved entities or monitored phenomena, which calls for online time series data mining for serving timely decision making or event detection. In this thesis we aim to address these important issues pertaining to scalable and distributed analytics techniques for massive time series data. Concretely, this thesis is centered around the following three topics: As the number of sensors that pervade our lives signiï¬cantly increases (e.g., environmental sensors, mobile phone sensors, IoT applications, etc.), the efï¬cient management of massive amount of time series from such sensors is becoming increasingly important. The inï¬nite nature of sensor data poses a serious challenge for query processing even in a cloud infrastructure. Traditional raw sensor data management systems based on relational databases lack scalability to accommodate large scale sensor data efï¬ciently. Thus, distributed key-value stores in the cloud are becoming a prime tool to manage sensor data. However, currently there are no techniques for indexing and/or query optimization of the model-view sensor time series data in the cloud. In Chapter 2, we propose an innovative index for modeled segments in key-value stores, namely KVI-index. KVI-index consists of two interval indices on the time and sensor value dimensions respectively, each of which has an in-memory search tree and a secondary list materialized in the key-value store. The dramatic increase in the availability of data streams fuels the development of many distributed real-time computation engines (e.g., Storm, Samza, Spark Streaming, S4 etc.). In Chapter 3, we focus on a fundamental time series mining task in such a new computation paradigm, namely continuously mining dynamic (lagged) correlations in time series via a distributed real-time computation engine. Correlations reveal the hidden and temporal interactions across time series and are widely used in scientiï¬c data analysis, data-driven event detection, ï¬nance markets and so on. We propose the P2H framework consisting of a parallelism-partitioning based data shufï¬ing and a hypercube structure based computation pruning method, so as to enhance both the communication and computation efï¬ciency for mining correlations in the distributed context. In numerous real-world applications large datasets collected from observations and measurements of physical entities are inevitably noisy and contain outliers. The outliers in such large and noisy datasets can dramatically degrade the performance of standard distributed machine learning approaches such as s regression trees. In Chapter 4 we present a novel distributed regression tree approach that utilizes robust regression statistics, statistics that are more robust to outliers, for handling large and noisy datasets. Then we present an adaptive gradient learning method for recurrent neural networks (RNN) to forecast streaming time series in the presence of both outliers and change points

    Online indexing and distributed querying model-view sensor data in the cloud

    No full text
    As various kinds of sensors penetrate our daily life (e.g., sensor networks for environmental monitoring, GPS for localization and navigation), the efficient management of massive amount of sensor data becomes increasingly important at present. Many sensor data management systems are implemented based on key-value stores in the cloud; the traditional solutions based on relational database lack scalability to accommodate the large-scale sensor data efficiently. Meanwhile, model-view sensor data management, which stores the sensor data in the form of modelled segments, largely reduces the amount of raw data. However, currently there is no index and query optimizations on these modelled segments in the cloud, which results in full table scan for query processing in the worst case. In this paper, we propose an innovative model index for sensor data segments in key-value stores (KVM-index). KVM-index consists of two interval indices on the time and sensor value dimensions respectively, each of which has an in-memory search tree and a secondary list materialized in the key-value store. This in-memory and key-value composite structure enables to update new incoming sensor data segments with constant network I/O. Second, for time (or value)-range and point queries a MapReduce-based approach is designed to process the discrete predicate-related ranges of the table of KVM-index, thereby eliminating computation and communication overheads incurred by accessing irrelevant parts of the index table in conventional MapReduce programs. Finally, we propose a cost based adaptive strategy for the KVM-index-MapReduce framework to process composite queries on both time and value dimensions. As proved by extensive experiments in a private cloud, our approach outperforms in query response time both MapReduce-based processing of the raw sensor data and multiple alternative approaches of querying model-view sensor data

    Multicloud Resource Allocation:Cooperation, Optimization and Sharing

    Get PDF
    Nowadays our daily life is not only powered by water, electricity, gas and telephony but by "cloud" as well. Big cloud vendors such as Amazon, Microsoft and Google have built large-scale centralized data centers to achieve economies of scale, on-demand resource provisioning, high resource availability and elasticity. However, those massive data centers also bring about many other problems, e.g., bandwidth bottlenecks, privacy, security, huge energy consumption, legal and physical vulnerabilities. One of the possible solutions for those problems is to employ multicloud architectures. In this thesis, our work provides research contributions to multicloud resource allocation from three perspectives of cooperation, optimization and data sharing. We address the following problems in the multicloud: how resource providers cooperate in a multicloud, how to reduce information leakage in a multicloud storage system and how to share the big data in a cost-effective way. More specifically, we make the following contributions: Cooperation in the decentralized cloud. We propose a decentralized cloud model in which a group of SDCs can cooperate with each other to improve performance. Moreover, we design a general strategy function for SDCs to evaluate the performance of cooperation based on different dimensions of resource sharing. Through extensive simulations using a realistic data center model, we show that the strategies based on reciprocity are more effective than other strategies, e.g., those using prediction based on historical data. Our results show that the reciprocity-based strategy can thrive in a heterogeneous environment with competing strategies. Multicloud optimization on information leakage. In this work, we firstly study an important information leakage problem caused by unplanned data distribution in multicloud storage services. Then, we present StoreSim, an information leakage aware storage system in multicloud. StoreSim aims to store syntactically similar data on the same cloud, thereby minimizing the user's information leakage across multiple clouds. We design an approximate algorithm to efficiently generate similarity-preserving signatures for data chunks based on MinHash and Bloom filter, and also design a function to compute the information leakage based on these signatures. Next, we present an effective storage plan generation algorithm based on clustering for distributing data chunks with minimal information leakage across multiple clouds. Finally, we evaluate our scheme using two real datasets from Wikipedia and GitHub. We show that our scheme can reduce the information leakage by up to 60% compared to unplanned placement. Furthermore, our analysis in terms of system attackability demonstrates that our scheme makes attacks on information much more complex. Smart data sharing. Moving large amounts of distributed data into the cloud or from one cloud to another can incur high costs in both time and bandwidth. The optimization on data sharing in the multicloud can be conducted from two different angles: inter-cloud scheduling and intra-cloud optimization. We first present CoShare, a P2P inspired decentralized cost effective sharing system for data replication to optimize network transfer among small data centers. Then we propose a data summarization method to reduce the total size of dataset, thereby reducing network transfer

    Model-Based Time Series Management at Scale

    Get PDF