183,630 research outputs found

    System Support For Stream Processing In Collaborative Cloud-Edge Environment

    Get PDF
    Stream processing is a critical technique to process huge amount of data in real-time manner. Cloud computing has been used for stream processing due to its unlimited computation resources. At the same time, we are entering the era of Internet of Everything (IoE). The emerging edge computing benefits low-latency applications by leveraging computation resources at the proximity of data sources. Billions of sensors and actuators are being deployed worldwide and huge amount of data generated by things are immersed in our daily life. It has become essential for organizations to be able to stream and analyze data, and provide low-latency analytics on streaming data. However, cloud computing is inefficient to process all data in a centralized environment in terms of the network bandwidth cost and response latency. Although edge computing offloads computation from the cloud to the edge of the Internet, there is not a data sharing and processing framework that efficiently utilizes computation resources in the cloud and the edge. Furthermore, the heterogeneity of edge devices brings more difficulty to the development of collaborative cloud-edge applications. To explore and attack the challenges of stream processing system in collaborative cloudedge environment, in this dissertation we design and develop a series of systems to support stream processing applications in hybrid cloud-edge analytics. Specifically, we develop an hierarchical and hybrid outlier detection model for multivariate time series streams that automatically selects the best model for different time series. We optimize one of the stream processing system (i.e., Spark Streaming) to reduce the end-to-end latency. To facilitate the development of collaborative cloud-edge applications, we propose and implement a new computing framework, Firework that allows stakeholders to share and process data by leveraging both the cloud and the edge. A vision-based cloud-edge application is implemented to demonstrate the capabilities of Firework. By combining all these studies, we provide comprehensive system support for stream processing in collaborative cloud-edge environment

    Generalizable Resource Allocation in Stream Processing via Deep Reinforcement Learning

    Full text link
    This paper considers the problem of resource allocation in stream processing, where continuous data flows must be processed in real time in a large distributed system. To maximize system throughput, the resource allocation strategy that partitions the computation tasks of a stream processing graph onto computing devices must simultaneously balance workload distribution and minimize communication. Since this problem of graph partitioning is known to be NP-complete yet crucial to practical streaming systems, many heuristic-based algorithms have been developed to find reasonably good solutions. In this paper, we present a graph-aware encoder-decoder framework to learn a generalizable resource allocation strategy that can properly distribute computation tasks of stream processing graphs unobserved from training data. We, for the first time, propose to leverage graph embedding to learn the structural information of the stream processing graphs. Jointly trained with the graph-aware decoder using deep reinforcement learning, our approach can effectively find optimized solutions for unseen graphs. Our experiments show that the proposed model outperforms both METIS, a state-of-the-art graph partitioning algorithm, and an LSTM-based encoder-decoder model, in about 70% of the test cases.Comment: Accepted by AAAI 202

    Prediction based scaling in a distributed stream processing cluster

    Get PDF
    2020 Spring.Includes bibliographical references.Proliferation of IoT sensors and applications have enabled us to monitor and analyze scientific and social phenomena with continuously arriving voluminous data. To provide real-time processing capabilities over streaming data, distributed stream processing engines (DSPEs) such as Apache STORM and Apache FLINK have been widely deployed. These frameworks support computations over large-scale, high frequency streaming data. However, current on-demand auto-scaling features in these systems may result in an inefficient resource utilization which is closely related to cost effectiveness in popular cloud-based computing environments. We propose ARSTREAM, an auto-scaling computing environment that manages fluctuating throughputs for data from sensor networks, while ensuring efficient resource utilization. We have built an Artificial Neural Network model for predicting data processing queues and this model captures non-linear relationships between data arrival rates, resource utilization, and the size of data processing queue. If a bottleneck is predicted, ARSTREAM scales-out the current cluster automatically for current jobs without halting them at the user level. In addition, ARSTREAM incorporates threshold-based re-balancing to minimize data loss during extreme peak traffic that could not be predicted by our model. Our empirical benchmarks show that ARSTREAM forecasts data processing queue sizes with RMSE of 0.0429 when tested on real-time data

    A Distributed Stream Processing Middleware Framework for Real-Time Analysis of Heterogeneous Data on Big Data Platform: Case of Environmental Monitoring

    Get PDF
    ArticleIn recent years, the application and wide adoption of Internet of Things (IoT)-based technologies have increased the proliferation of monitoring systems, which has consequently exponentially increased the amounts of heterogeneous data generated. Processing and analysing the massive amount of data produced is cumbersome and gradually moving from classical ‘batch’ processing—extract, transform, load (ETL) technique to real-time processing. For instance, in environmental monitoring and management domain, time-series data and historical dataset are crucial for prediction models. However, the environmental monitoring domain still utilises legacy systems, which complicates the real-time analysis of the essential data, integration with big data platforms and reliance on batch processing. Herein, as a solution, a distributed stream processing middleware framework for real-time analysis of heterogeneous environmental monitoring and management data is presented and tested on a cluster using open source technologies in a big data environment. The system ingests datasets from legacy systems and sensor data from heterogeneous automated weather systems irrespective of the data types to Apache Kafka topics using Kafka Connect APIs for processing by the Kafka streaming processing engine. The stream processing engine executes the predictive numerical models and algorithms represented in event processing (EP) languages for real-time analysis of the data streams. To prove the feasibility of the proposed framework, we implemented the system using a case study scenario of drought prediction and forecasting based on the Effective Drought Index (EDI) model. Firstly, we transform the predictive model into a form that could be executed by the streaming engine for real-time computing. Secondly, the model is applied to the ingested data streams and datasets to predict drought through persistent querying of the infinite streams to detect anomalies. As a conclusion of this study, a performance evaluation of the distributed stream processing middleware infrastructure is calculated to determine the real-time effectiveness of the framework

    IoTSim-Stream: Modeling stream graph application in cloud simulation

    Get PDF
    In the era of big data, the high velocity of data imposes the demand for processing such data in real-time to gain real-time insights. Various real-time big data platforms/services (i.e. Apache Storm, Amazon Kinesis) allow to develop real-time big data applications to process continuous data to get incremental results. Composing those applications to form a workflow that is designed to accomplish certain goal is the becoming more important nowadays. However, given the current need of composing those applications into data pipelines forming stream workflow applications (aka stream graph applications) to support decision making, a simulation toolkit is required to simulate the behaviour of this graph application in Cloud computing environment. Therefore, in this paper, we propose an IoT Simulator for Stream processing on the big data (named IoTSim-Stream) that offers an environment to model complex stream graph applications in Multicloud environment, where the large-scale simulation-based studies can be conducted to evaluate and analyse these applications. The experimental results show that IoTSim-Stream is effective in modelling and simulating different structures of complex stream graph applications with excellent performance and scalability

    Geo-distributed Edge and Cloud Resource Management for Low-latency Stream Processing

    Get PDF
    The proliferation of Internet-of-Things (IoT) devices is rapidly increasing the demands for efficient processing of low latency stream data generated close to the edge of the network. Edge Computing provides a layer of infrastructure to fill latency gaps between the IoT devices and the back-end cloud computing infrastructure. A large number of IoT applications require continuous processing of data streams in real-time. Edge computing-based stream processing techniques that carefully consider the heterogeneity of the computing and network resources available in the geo-distributed infrastructure provide significant benefits in optimizing the throughput and end-to-end latency of the data streams. Managing geo-distributed resources operated by individual service providers raises new challenges in terms of effective global resource sharing and achieving global efficiency in the resource allocation process. In this dissertation, we present a distributed stream processing framework that optimizes the performance of stream processing applications through a careful allocation of computing and network resources available at the edge of the network. The proposed approach differentiates itself from the state-of-the-art through its careful consideration of data locality and resource constraints during physical plan generation and operator placement for the stream queries. Additionally, it considers co-flow dependencies that exist between the data streams to optimize the network resource allocation through an application-level rate control mechanism. The proposed framework incorporates resilience through a cost-aware partial active replication strategy that minimizes the recovery cost when applications incur failures. The framework employs a reinforcement learning-based online learning model for dynamically determining the level of parallelism to adapt to changing workload conditions. The second dimension of this dissertation proposes a novel model for allocating computing resources in edge and cloud computing environments. In edge computing environments, it allows service providers to establish resource sharing contracts with infrastructure providers apriori in a latency-aware manner. In geo-distributed cloud environments, it allows cloud service providers to establish resource sharing contracts with individual datacenters apriori for defined time intervals in a cost-aware manner. Based on these mechanisms, we develop a decentralized implementation of the contract-based resource allocation model for geo-distributed resources using Smart Contracts in Ethereum

    Resource provisioning and scheduling algorithms for hybrid workflows in edge cloud computing

    Get PDF
    In recent years, Internet of Things (IoT) technology has been involved in a wide range of application domains to provide real-time monitoring, tracking and analysis services. The worldwide number of IoT-connected devices is projected to increase to 43 billion by 2023, and IoT technologies are expected to engaged in 25% of business sector. Latency-sensitive applications in scope of intelligent video surveillance, smart home, autonomous vehicle, augmented reality, are all emergent research directions in industry and academia. These applications are required connecting large number of sensing devices to attain the desired level of service quality for decision accuracy in a sensitive timely manner. Moreover, continuous data stream imposes processing large amounts of data, which adds a huge overhead on computing and network resources. Thus, latency-sensitive and resource-intensive applications introduce new challenges for current computing models, i.e, batch and stream. In this thesis, we refer to the integrated application model of stream and batch applications as a hybrid work ow model. The main challenge of the hybrid model is achieving the quality of service (QoS) requirements of the two computation systems. This thesis provides a systemic and detailed modeling for hybrid workflows which describes the internal structure of each application type for purposes of resource estimation, model systems tuning, and cost modeling. For optimizing the execution of hybrid workflows, this thesis proposes algorithms, techniques and frameworks to serve resource provisioning and task scheduling on various computing systems including cloud, edge cloud and cooperative edge cloud. Overall, experimental results provided in this thesis demonstrated strong evidences on the responsibility of proposing different understanding and vision on the applications of integrating stream and batch applications, and how edge computing and other emergent technologies like 5G networks and IoT will contribute on more sophisticated and intelligent solutions in many life disciplines for more safe, secure, healthy, smart and sustainable society

    RedEdge: A Novel Architecture for Big Data Processing in Mobile Edge Computing Environments

    Get PDF
    We are witnessing the emergence of new big data processing architectures due to the convergence of the Internet of Things (IoTs), edge computing and cloud computing. Existing big data processing architectures are underpinned by the transfer of raw data streams to the cloud computing environment for processing and analysis. This operation is expensive and fails to meet the real-time processing needs of IoT applications. In this article, we present and evaluate a novel big data processing architecture named RedEdge (i.e., data reduction on the edge) that incorporates mechanism to facilitate the processing of big data streams near the source of the data. The RedEdge model leverages mobile IoT-termed mobile edge devices as primary data processing platforms. However, in the case of the unavailability of computational and battery power resources, it offloads data streams in nearer mobile edge devices or to the cloud. We evaluate the RedEdge architecture and the related mechanism within a real-world experiment setting involving 12 mobile users. The experimental evaluation reveals that the RedEdge model has the capability to reduce big data stream by up to 92.86% without compromising energy and memory consumption on mobile edge devices

    Latency-Aware Strategies for Placing Data Stream Analytics onto Edge Computing

    Get PDF
    International audienceMuch of the "big data" generated today is received in near real-time and requires quick analysis. In Internet of Things (IoT) [1, 9], for instance, continuous data streams produced by multiple sources must be handled under very short delays. As a result, several stream processing engines have been proposed. Under several engines, a stream processing application is a directed graph or dataflow whose vertices are operators that execute a function over the incoming data and edges that define how data flows between them. A dataflow has one or multiple sources (i.e., sensors, gateways or actuators), operators that perform transformations on the data (e.g., filtering, mapping, and aggregation) and sinks (i.e., queries that consume or store the data). In a traditional cloud deployment, the whole application is placed in the cloud computing to benefit from virtually unlimited resources. However, processing all the data in the cloud can introduce latency due to data transfer, which makes near real-time processing difficult to achieve. In contrast, edge computing has become an attractive solution for performing certain stream processing operators, as many edge devices have non-trivial compute capacity. The deployment of data stream processing applications onto heterogeneous infrastructure has been proved to be NP-hard [2]. Moving operators from cloud to edge devices is challenging due to limitations of edge devices [5]. Existing work often proposes placements strategies considering user intervention [8]. Many models do not support memory and communication constraints [6, 4] while others consider all data sinks to be located in the cloud, with no feedback loop to actua-tors located at the edge of the network [3, 7]. There is a lack of solutions covering scenarios involving smart cities, precision agriculture, and smart homes comprising various heterogeneous sensors and actuators, as well as, time-constraint applications. We model the data stream processing placement problem considering heterogeneous computational and network resources, and computing and communication as M/M/1 queues (i.e., Poisson arrival distribution, exponential service time and single server). Events are handled in a First-Come, First-Served fashion both by the computation and communication services, guaranteeing the time order of events; an important requirement in many data stream processing applications. The model allows us to calculate the waiting and service times for each message in computation and communication queues allowing for estimating the response time. We then propose two strategies to minimize the application response time by splitting the dataflow graph dynamically and distributing the operators across cloud and edge infrastructure. We focus on real-time analytics applications with multiple sources and sinks distributed across resources. In particular, we first decompose the application graph by considering behaviors such as forks and joins (i.e., split points), and by identifying the operator dependencies recursively. The Response Time Rate (RTR) strategy takes the decomposed graph and organizes the deployment sequence and consecutively calculates the response time for each operator by considering the previous mappings, resource capabilities, and operator requirements. RTR with Region Patterns (RTR+RP) strategy extends RTR by exploiting the split points to first find candidate operators for edge or cloud and then estimates the response time for the edge operators. Comprehensive simulations considering multiple application configurations demonstrate that our approach can improve the response time up to 50%. For future work, we will investigate further techniques to deal with CPU-intensive operators and their energy consumption

    Continuous Learning of HPC Infrastructure Models using Big Data Analytics and In-Memory processing Tools

    Get PDF
    open4siThis work was supported, in parts, by the FP7 ERC Advance project MULTITHERMAN (g.a. 291125), by the EU H2020 FETHPC project ANTAREX (g.a. 67623) and by the EU H2020 FETHPC project Exanode (g.a. 671578).Exascale computing represents the next leap in the HPC race. Reaching this level of performance is subject to several engineering challenges such as energy consumption, equipment-cooling, reliability and massive parallelism. Model-based optimization is an essential tool in the design process and control of energy efficient, reliable and thermally constrained systems. However, in the Exascale domain, model learning techniques tailored to the specific supercomputer require real measurements and must therefore handle and analyze a massive amount of data coming from the HPC monitoring infrastructure. This becomes rapidly a 'big data' scale problem. The common approach where measurements are first stored in large databases and then processed is no more affordable due to the increasingly storage costs and lack of real-time support. Nowadays instead, cloud-based machine learning techniques aim to build on-line models using real-time approaches such as 'stream processing' and 'in-memory' computing, that avoid storage costs and enable fastdata processing. Moreover, the fast delivery and adaptation of the models to the quick data variations, make the decision stage of the optimization loop more effective and reliable. In this paper we leverage scalable, lightweight and flexible IoT technologies, such as the MQTT protocol, to build a highly scalable HPC monitoring infrastructure able to handle the massive sensor data produced by next-gen HPC components. We then show how state-of-the art tools for big data computing and analysis, such as Apache Spark, can be used to manage the huge amount of data delivered by the monitoring layer and to build adaptive models in real-time using on-line machine learning techniques.openBeneventi, Francesco; Bartolini, Andrea; Cavazzoni, Carlo; Benini, LucaBeneventi, Francesco; Bartolini, Andrea; Cavazzoni, Carlo; Benini, Luc
    • …
    corecore