381 research outputs found

    Unifying data and replica placement for data-intensive services in geographically distributed clouds

    Get PDF
    The increased reliance of data management applications on cloud computing technologies has rendered research in identifying solutions to the data placement problem to be of paramount importance. The objective of the classical data placement problem is to optimally partition, while also allowing for replication, the set of data-items into distributed data centers to minimize the overall network communication cost. Despite significant advancement in data placement research, replica placement has seldom been studied in unison with data placement. More specifically, most of the existing solutions employ a two-phase approach: 1) data placement, followed by 2) replication. Replication should however be seen as an integral part of data placement, and should be studied as a joint optimization problem with the latter. In this paper, we propose a unified paradigm of data placement, called CPR, which combines data placement and replication of data-intensive services into geographically distributed clouds as a joint optimization problem. Underneath CPR, lies an overlapping correlation clustering algorithm capable of assigning a data-item to multiple data centers, thereby enabling us to jointly solve data placement and replication. Experiments on a real-world trace-based online social network dataset show that CPR is effective and scalable. Empirically, it is approximate to 35% better in efficacy on the evaluated metrics, while being up to 8 times faster in execution time when compared to state-of-the-art techniques

    UnifyDR: A Generic Framework for Unifying Data and Replica Placement

    Get PDF
    The advent of (big) data management applications operating at Cloud scale has led to extensive research on the data placement problem. The key objective of data placement is to obtain a partitioning (possibly allowing for replicas) of a set of data-items into distributed nodes that minimizes the overall network communication cost. Although replication is intrinsic to data placement, it has seldom been studied in combination with the latter. On the contrary, most of the existing solutions treat them as two independent problems, and employ a two-phase approach: (1) data placement, followed by (2) replica placement. We address this by proposing a new paradigm, CDR , with the objective of c ombining d ata and r eplica placement as a single joint optimization problem. Specifically, we study two variants of the CDR problem: (1) CDR-Single , where the objective is to minimize the communication cost alone, and (2) CDR-Multi , which performs a multi-objective optimization to also minimize traffic and storage costs. To unify data and replica placement, we propose a generic framework called UnifyDR , which leverages overlapping correlation clustering to assign a data-item to multiple nodes, thereby facilitating data and replica placement to be performed jointly. We establish the generic nature of UnifyDR by portraying its ability to address the CDR problem in two real-world use-cases, that of join-intensive online analytical processing (OLAP) queries and a location-based online social network (OSN) service. The effectiveness and scalability of UnifyDR are showcased by experiments performed on data generated using the TPC-DS benchmark and a trace of the Gowalla OSN for the OLAP queries and OSN service use-case, respectively. Empirically, the presented approach obtains an improvement of approximately 35% in terms of the evaluated metrics and a speed-up of 8 times in comparison to state-of-the-art techniques.This work was supported by the Agentschap Innoveren & Ondernemen (VLAIO) Strategic Fundamental Research (SBO) under Grant 150038 (DiSSeCt)

    Boosting big data streaming applications in clouds with burstFlow

    Get PDF
    The rapid growth of stream applications in financial markets, health care, education, social media, and sensor networks represents a remarkable milestone for data processing and analytic in recent years, leading to new challenges to handle Big Data in real-time. Traditionally, a single cloud infrastructure often holds the deployment of Stream Processing applications because it has extensive and adaptative virtual computing resources. Hence, data sources send data from distant and different locations of the cloud infrastructure, increasing the application latency. The cloud infrastructure may be geographically distributed and it requires to run a set of frameworks to handle communication. These frameworks often comprise a Message Queue System and a Stream Processing Framework. The frameworks explore Multi-Cloud deploying each service in a different cloud and communication via high latency network links. This creates challenges to meet real-time application requirements because the data streams have different and unpredictable latencies forcing cloud providers' communication systems to adjust to the environment changes continually. Previous works explore static micro-batch demonstrating its potential to overcome communication issues. This paper introduces BurstFlow, a tool for enhancing communication across data sources located at the edges of the Internet and Big Data Stream Processing applications located in cloud infrastructures. BurstFlow introduces a strategy for adjusting the micro-batch sizes dynamically according to the time required for communication and computation. BurstFlow also presents an adaptive data partition policy for distributing incoming streams across available machines by considering memory and CPU capacities. The experiments use a real-world multi-cloud deployment showing that BurstFlow can reduce the execution time up to 77% when compared to the state-of-the-art solutions, improving CPU efficiency by up to 49%

    Efficient Resource Management for Cloud Computing Environments

    Get PDF
    Cloud computing has recently gained popularity as a cost-effective model for hosting and delivering services over the Internet. In a cloud computing environment, a cloud provider packages its physical resources in data centers into virtual resources and offers them to service providers using a pay-as-you-go pricing model. Meanwhile, a service provider uses the rented virtual resources to host its services. This large-scale multi-tenant architecture of cloud computing systems raises key challenges regarding how data centers resources should be controlled and managed by both service and cloud providers. This thesis addresses several key challenges pertaining to resource management in cloud environments. From the perspective of service providers, we address the problem of selecting appropriate data centers for service hosting with consideration of resource price, service quality as well as dynamic reconfiguration costs. From the perspective of cloud providers, as it has been reported that workload in real data centers can be typically divided into server-based applications and MapReduce applications with different performance and scheduling criteria, we provide separate resource management solutions for each type of workloads. For server-based applications, we provide a dynamic capacity provisioning scheme that dynamically adjusts the number of active servers to achieve the best trade-off between energy savings and scheduling delay, while considering heterogeneous resource characteristics of both workload and physical machines. For MapReduce applications, we first analyzed task run-time resource consumption of a large variety of MapReduce jobs and discovered it can vary significantly over-time, depending on the phase the task is currently executing. We then present a novel scheduling algorithm that controls task execution at the level of phases with the aim of improving both job running time and resource utilization. Through detailed simulations and experiments using real cloud clusters, we have found our proposed solutions achieve substantial gain compared to current state-of-art resource management solutions, and therefore have strong implications in the design of real cloud resource management systems in practice

    Efficient Resource Management for Cloud Computing Environments

    Get PDF
    Cloud computing has recently gained popularity as a cost-effective model for hosting and delivering services over the Internet. In a cloud computing environment, a cloud provider packages its physical resources in data centers into virtual resources and offers them to service providers using a pay-as-you-go pricing model. Meanwhile, a service provider uses the rented virtual resources to host its services. This large-scale multi-tenant architecture of cloud computing systems raises key challenges regarding how data centers resources should be controlled and managed by both service and cloud providers. This thesis addresses several key challenges pertaining to resource management in cloud environments. From the perspective of service providers, we address the problem of selecting appropriate data centers for service hosting with consideration of resource price, service quality as well as dynamic reconfiguration costs. From the perspective of cloud providers, as it has been reported that workload in real data centers can be typically divided into server-based applications and MapReduce applications with different performance and scheduling criteria, we provide separate resource management solutions for each type of workloads. For server-based applications, we provide a dynamic capacity provisioning scheme that dynamically adjusts the number of active servers to achieve the best trade-off between energy savings and scheduling delay, while considering heterogeneous resource characteristics of both workload and physical machines. For MapReduce applications, we first analyzed task run-time resource consumption of a large variety of MapReduce jobs and discovered it can vary significantly over-time, depending on the phase the task is currently executing. We then present a novel scheduling algorithm that controls task execution at the level of phases with the aim of improving both job running time and resource utilization. Through detailed simulations and experiments using real cloud clusters, we have found our proposed solutions achieve substantial gain compared to current state-of-art resource management solutions, and therefore have strong implications in the design of real cloud resource management systems in practice

    Big Data and Large-scale Data Analytics: Efficiency of Sustainable Scalability and Security of Centralized Clouds and Edge Deployment Architectures

    Get PDF
    One of the significant shifts of the next-generation computing technologies will certainly be in the development of Big Data (BD) deployment architectures. Apache Hadoop, the BD landmark, evolved as a widely deployed BD operating system. Its new features include federation structure and many associated frameworks, which provide Hadoop 3.x with the maturity to serve different markets. This dissertation addresses two leading issues involved in exploiting BD and large-scale data analytics realm using the Hadoop platform. Namely, (i)Scalability that directly affects the system performance and overall throughput using portable Docker containers. (ii) Security that spread the adoption of data protection practices among practitioners using access controls. An Enhanced Mapreduce Environment (EME), OPportunistic and Elastic Resource Allocation (OPERA) scheduler, BD Federation Access Broker (BDFAB), and a Secure Intelligent Transportation System (SITS) of multi-tiers architecture for data streaming to the cloud computing are the main contribution of this thesis study

    Building interactive distributed processing applications at a global scale

    Get PDF
    Along with the continuous engagement with technology, many latency-sensitive interactive applications have emerged, e.g., global content sharing in social networks, adaptive lights/temperatures in smart buildings, and online multi-user games. These applications typically process a massive amount of data at a global scale. In this cases, distributing storage and processing is key to handling the large scale. Distribution necessitates handling two main aspects: a) the placement of data/processing and b) the data motion across the distributed locations. However, handling the distribution while meeting latency guarantees at large scale comes with many challenges around hiding heterogeneity and diversity of devices and workload, handling dynamism in the environment, providing continuous availability despite failures, and supporting persistent large state. In this thesis, we show how latency-driven designs for placement and data-motion can be used to build production infrastructures for interactive applications at a global scale, while also being able to address myriad challenges on heterogeneity, dynamism, state, and availability. We demonstrate a latency-driven approach is general and applicable at all layers of the stack: from storage, to processing, down to networking. We designed and built four distinct systems across the spectrum. We have developed Ambry (collaboration with LinkedIn), a geo-distributed storage system for interactive data sharing across the globe. Ambry is LinkedIn's mainstream production system for all its media content running across 4 datacenters and over 500 million users. Ambry minimizes user perceived latency via smart data placement and propagation. Second, we have built two processing systems, a traditional model, Samza, and the avant-garde model, Steel. Samza (collaboration with LinkedIn) is a production stream processing framework used at 15 companies (including LinkedIn, Uber, Netflix, and TripAdvisor), powering >200 pipelines at LinkedIn alone. Samza minimizes the impact of data motion on the end-to-end latency, thus, enabling large persistent state (100s of TB) along with processing. Steel (collaboration with Microsoft) extends processing to the emerging edge. Integrated with Azure, Steel dynamically optimizes placement and data-motion across the entire edge-cloud environment. Finally, we have designed FreeFlow, a high performance networking mechanisms for containers. Using the container placement, FreeFlow opportunistically bypasses networking layers, minimizing data motion and reducing latency (up to 3 orders of magnitude)

    Storage QoS provisioning for execution programming of data-intensive applications

    Get PDF
    Abstract. In this paper a method for execution programming of data-intensive applications is presented. The method is based on storage Quality of Service (SQoS) provisioning. SQoS provisioning uses the semantic based storage monitoring based on a storage resources model and a storage performance management. Test results show the gain for the execution time when using the QStorMan toolkit which implements the presented method. Taking into account the SQoS provisioning opportunity on the one hand, and the increasingly growing user demands on the other hand, we believe that the execution programming of data-intensive applications can bring a new quality into the application execution
    corecore