574 research outputs found

    Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds

    Full text link
    MapReduce has become a popular programming model for running data intensive applications on the cloud. Completion time goals or deadlines of MapReduce jobs set by users are becoming crucial in existing cloud-based data processing environments like Hadoop. There is a conflict between the scheduling MR jobs to meet deadlines and "data locality" (assigning tasks to nodes that contain their input data). To meet the deadline a task may be scheduled on a node without local input data for that task causing expensive data transfer from a remote node. In this paper, a novel scheduler is proposed to address the above problem which is primarily based on the dynamic resource reconfiguration approach. It has two components: 1) Resource Predictor: which dynamically determines the required number of Map/Reduce slots for every job to meet completion time guarantee; 2) Resource Reconfigurator: that adjusts the CPU resources while not violating completion time goals of the users by dynamically increasing or decreasing individual VMs to maximize data locality and also to maximize the use of resources within the system among the active jobs. The proposed scheduler has been evaluated against Fair Scheduler on virtual cluster built on a physical cluster of 20 machines. The results demonstrate a gain of about 12% increase in throughput of Job

    Scheduling in Mapreduce Clusters

    Get PDF
    MapReduce is a framework proposed by Google for processing huge amounts of data in a distributed environment. The simplicity of the programming model and the fault-tolerance feature of the framework make it very popular in Big Data processing. As MapReduce clusters get popular, their scheduling becomes increasingly important. On one hand, many MapReduce applications have high performance requirements, for example, on response time and/or throughput. On the other hand, with the increasing size of MapReduce clusters, the energy-efficient scheduling of MapReduce clusters becomes inevitable. These scheduling challenges, however, have not been systematically studied. The objective of this dissertation is to provide MapReduce applications with low cost and energy consumption through the development of scheduling theory and algorithms, energy models, and energy-aware resource management. In particular, we will investigate energy-efficient scheduling in hybrid CPU-GPU MapReduce clusters. This research work is expected to have a breakthrough in Big Data processing, particularly in providing green computing to Big Data applications such as social network analysis, medical care data mining, and financial fraud detection. The tools we propose to develop are expected to increase utilization and reduce energy consumption for MapReduce clusters. In this PhD dissertation, we propose to address the aforementioned challenges by investigating and developing 1) a match-making scheduling algorithm for improving the data locality of Map- Reduce applications, 2) a real-time scheduling algorithm for heterogeneous Map- Reduce clusters, and 3) an energy-efficient scheduler for hybrid CPU-GPU Map- Reduce cluster. Advisers: Ying Lu and David Swanso

    Big Data and Large-scale Data Analytics: Efficiency of Sustainable Scalability and Security of Centralized Clouds and Edge Deployment Architectures

    Get PDF
    One of the significant shifts of the next-generation computing technologies will certainly be in the development of Big Data (BD) deployment architectures. Apache Hadoop, the BD landmark, evolved as a widely deployed BD operating system. Its new features include federation structure and many associated frameworks, which provide Hadoop 3.x with the maturity to serve different markets. This dissertation addresses two leading issues involved in exploiting BD and large-scale data analytics realm using the Hadoop platform. Namely, (i)Scalability that directly affects the system performance and overall throughput using portable Docker containers. (ii) Security that spread the adoption of data protection practices among practitioners using access controls. An Enhanced Mapreduce Environment (EME), OPportunistic and Elastic Resource Allocation (OPERA) scheduler, BD Federation Access Broker (BDFAB), and a Secure Intelligent Transportation System (SITS) of multi-tiers architecture for data streaming to the cloud computing are the main contribution of this thesis study

    DyScale: A MapReduce Job Scheduler for Heterogeneous Multicore Processors

    Get PDF
    The functionality of modern multi-core processors is often driven by a given power budget that requires designers to evaluate different decision trade-offs, e.g., to choose between many slow, power-efficient cores, or fewer faster, power-hungry cores, or a combination of them. Here, we prototype and evaluate a new Hadoop scheduler, called DyScale, that exploits capabilities offered by heterogeneous cores within a single multi-core processor for achieving a variety of performance objectives. A typical MapReduce workload contains jobs with different performance goals: large, batch jobs that are throughput oriented, and smaller interactive jobs that are response time sensitive. Heterogeneous multi-core processors enable creating virtual resource pools based on slow and fast cores for multi-class priority scheduling. Since the same data can be accessed with either slow or fast slots, spare resources (slots) can be shared between different resource pools. Using measurements on an actual experimental setting and via simulation, we argue in favor of heterogeneous multi-core processors as they achieve faster (up to 40 percent) processing of small, interactive MapReduce jobs, while offering improved throughput (up to 40 percent) for large, batch jobs. We evaluate the performance benefits of DyScale versus the FIFO and Capacity job schedulers that are broadly used in the Hadoop community
    • …
    corecore