3,267 research outputs found

    Model-driven Scheduling for Distributed Stream Processing Systems

    Full text link
    Distributed Stream Processing frameworks are being commonly used with the evolution of Internet of Things(IoT). These frameworks are designed to adapt to the dynamic input message rate by scaling in/out.Apache Storm, originally developed by Twitter is a widely used stream processing engine while others includes Flink, Spark streaming. For running the streaming applications successfully there is need to know the optimal resource requirement, as over-estimation of resources adds extra cost.So we need some strategy to come up with the optimal resource requirement for a given streaming application. In this article, we propose a model-driven approach for scheduling streaming applications that effectively utilizes a priori knowledge of the applications to provide predictable scheduling behavior. Specifically, we use application performance models to offer reliable estimates of the resource allocation required. Further, this intuition also drives resource mapping, and helps narrow the estimated and actual dataflow performance and resource utilization. Together, this model-driven scheduling approach gives a predictable application performance and resource utilization behavior for executing a given DSPS application at a target input stream rate on distributed resources.Comment: 54 page

    On the Benefits of Transparent Compression for Cost-Effective Cloud Data Storage

    Get PDF
    International audienceInfrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiring computational resources: it allows users to deploy virtual machines (VMs) at large scale and pay only for the resources that were actually used throughout the runtime of the VMs. This new model raises new challenges in the design and development of IaaS middleware: excessive storage costs associated with both user data and VM images might make the cloud less attractive, especially for users that need to manipulate huge data sets and a large number of VM images. Storage costs result not only from storage space utilization, but also from bandwidth consumption: in typical deployments, a large number of data transfers between the VMs and the persistent storage are performed, all under high performance requirements. This paper evaluates the trade-off resulting from transparently applying data compression to conserve storage space and bandwidth at the cost of slight computational overhead. We aim at reducing the storage space and bandwidth needs with minimal impact on data access performance. Our solution builds on BlobSeer, a distributed data management service specifically designed to sustain a high throughput for concurrent accesses to huge data sequences that are distributed at large scale. Extensive experiments demonstrate that our approach achieves large reductions (at least 40%) of bandwidth and storage space utilization, while still attaining high performance levels that even surpass the original (no compression) performance levels in several data-intensive scenarios

    Bringing Introspection Into the BlobSeer Data-Management System Using the MonALISA Distributed Monitoring Framework

    Get PDF
    Held in conjunction with CISIS 2010 ConferenceInternational audienceIntrospection is the prerequisite of an autonomic behavior, the first step towards a performance improvement and a resource-usage optimization for large-scale distributed systems. In grid environments, the task of observing the application behavior is assigned to monitoring systems. However, most of them are designed to provide general resource information and do not consider specific information for higher-level services. More specifically, in the context of data-intensive applications, a specific introspection layer is required in order to collect data about the usage of storage resources, about data access patterns, etc. This paper discusses the requirements for an introspection layer in a data-management system for large-scale distributed infrastructures. We focus on the case of BlobSeer, a large-scale distributed system for storing massive data. The paper explains why and how to enhance BlobSeer with introspective capabilities and proposes a three-layered architecture relying on the MonALISA monitoring framework. This approach has been evaluated on the Grid'5000 testbed, with experiments that prove the feasibility of generating relevant information related to the state and the behavior of the system

    Using Global Behavior Modeling to Improve QoS in Cloud Data Storage Services

    Get PDF
    The cloud computing model aims to make large-scale data-intensive computing affordable even for users with limited financial resources, that cannot invest into expensive infrastructures necesssary to run them. In this context, MapReduce is emerging as a highly scalable programming paradigm that enables high-throughput data-intensive processing as a cloud service. Its performance is highly dependent on the underlying storage service, responsible to efficiently support massively parallel data accesses by guaranteeing a high throughput under heavy access concurrency. In this context, quality of service plays a crucial role: the storage service needs to sustain a stable throughput for each individual accesss, in addition to achieving a high aggregated throughput under concurrency. In this paper we propose a technique to address this problem using component monitoring, application-side feedback and behavior pattern analysis to automatically infer useful knowledge about the causes of poor quality of service and provide an easy way to reason in about potential improvements. We apply our proposal to Blob Seer, a representative data storage service specifically designed to achieve high aggregated throughputs and show through extensive experimentation substantial improvements in the stability of individual data read accesses under MapReduce workloads
    corecore