51 research outputs found

    Scheduling MapReduce Jobs under Multi-Round Precedences

    Full text link
    We consider non-preemptive scheduling of MapReduce jobs with multiple tasks in the practical scenario where each job requires several map-reduce rounds. We seek to minimize the average weighted completion time and consider scheduling on identical and unrelated parallel processors. For identical processors, we present LP-based O(1)-approximation algorithms. For unrelated processors, the approximation ratio naturally depends on the maximum number of rounds of any job. Since the number of rounds per job in typical MapReduce algorithms is a small constant, our scheduling algorithms achieve a small approximation ratio in practice. For the single-round case, we substantially improve on previously best known approximation guarantees for both identical and unrelated processors. Moreover, we conduct an experimental analysis and compare the performance of our algorithms against a fast heuristic and a lower bound on the optimal solution, thus demonstrating their promising practical performance

    Energy Efficient Scheduling of MapReduce Jobs

    Full text link
    MapReduce is emerged as a prominent programming model for data-intensive computation. In this work, we study power-aware MapReduce scheduling in the speed scaling setting first introduced by Yao et al. [FOCS 1995]. We focus on the minimization of the total weighted completion time of a set of MapReduce jobs under a given budget of energy. Using a linear programming relaxation of our problem, we derive a polynomial time constant-factor approximation algorithm. We also propose a convex programming formulation that we combine with standard list scheduling policies, and we evaluate their performance using simulations.Comment: 22 page

    Offline Scheduling of Map and Reduce Tasks on Hadoop Systems

    Get PDF
    International audienceMapReduce is a model to manage quantities massive of data. It is based on the distributed and parallel execution of tasks over the cluster of machines. Hadoop is an implementation of MapReduce model, it is used to offer BigData services on the cloud. In this paper, we expose the scheduling problem on Hadoop systems. We focus on the offline-scheduling, expose the problem in a mathematic model and use the time-indexed formulation. We aim consider the maximum of constraints of the MapReduce environment. Solutions for the presented model would be a reference for the on-line Schedules in the case of low and medium instances. Our work is useful in term of the problem definition: constraints are based on observations and take into account resources consumption, data locality, heterogeneous machines and workflow management; this paper defines boundaries references to evaluate the online model

    Doctor of Philosophy

    Get PDF
    dissertationDataflow pipeline models are widely used in visualization systems. Despite recent advancements in parallel architecture, most systems still support only a single CPU or a small collection of CPUs such as a SMP workstation. Even for systems that are specifically tuned towards parallel visualization, their execution models only provide support for data-parallelism while ignoring taskparallelism and pipeline-parallelism. With the recent popularization of machines equipped with multicore CPUs and multi-GPU units, these visualization systems are undoubtedly falling further behind in reaching maximum efficiency. On the other hand, there exist several libraries that can schedule program executions on multiple CPUs and/or multiple GPUs. However, due to differences in executing a task graph and a pipeline along with their APIs being considerably low-level, it still remains a challenge to integrate these run-time libraries into current visualization systems. Thus, there is a need for a redesigned dataflow architecture to fully support and exploit the power of highly parallel machines in large-scale visualization. The new design must be able to schedule executions on heterogeneous platforms while at the same time supporting arbitrarily large datasets through the use of streaming data structures. The primary goal of this dissertation work is to develop a parallel dataflow architecture for streaming large-scale visualizations. The framework includes supports for platforms ranging from multicore processors to clusters consisting of thousands CPUs and GPUs. We achieve this in our system by introducing the notion of Virtual Processing Elements and Task-Oriented Modules along with a highly customizable scheduler that controls the assignment of tasks to elements dynamically. This creates an intuitive way to maintain multiple CPU/GPU kernels yet still provide coherency and synchronization across module executions. We have implemented these techniques into HyperFlow which is made of an API with all basic dataflow constructs described in the dissertation, and a distributed run-time library that can be used to deploy those pipelines on multicore, multi-GPU and cluster-based platforms

    High Performance Computing for DNA Sequence Alignment and Assembly

    Get PDF
    Recent advances in DNA sequencing technology have dramatically increased the scale and scope of DNA sequencing. These data are used for a wide variety of important biological analyzes, including genome sequencing, comparative genomics, transcriptome analysis, and personalized medicine but are complicated by the volume and complexity of the data involved. Given the massive size of these datasets, computational biology must draw on the advances of high performance computing. Two fundamental computations in computational biology are read alignment and genome assembly. Read alignment maps short DNA sequences to a reference genome to discover conserved and polymorphic regions of the genome. Genome assembly computes the sequence of a genome from many short DNA sequences. Both computations benefit from recent advances in high performance computing to efficiently process the huge datasets involved, including using highly parallel graphics processing units (GPUs) as high performance desktop processors, and using the MapReduce framework coupled with cloud computing to parallelize computation to large compute grids. This dissertation demonstrates how these technologies can be used to accelerate these computations by orders of magnitude, and have the potential to make otherwise infeasible computations practical

    Heuristics for periodical batch job scheduling in a MapReduce computing framework

    Full text link
    Task scheduling has a significant impact on the performance of the MapReduce computing framework. In this paper, a scheduling problem of periodical batch jobs with makespan minimization is considered. The problem is modeled as a general two-stage hybrid flow shop scheduling problem with schedule-dependent setup times. The new model incorporates the data locality of tasks and is formulated as an integer program. Three heuristics are developed to solve the problem and an improvement policy based on data locality is presented to enhance the methods. A lower bound of the makespan is derived. 150 instances are randomly generated from data distributions drawn from a real cluster. The parameters involved in the methods are set according to different cluster setups. The proposed heuristics are compared over different numbers of jobs and cluster setups. Computational results show that the performance of the methods is highly dependent on both the number of jobs and the cluster setups. The proposed improvement policy is effective and the impact of the input data distribution on the policy is analyzed and tested.This work is supported by the National Natural Science Foundation of China (No. 61272377) and the Specialized Research Fund for the Doctoral Program of Higher Education (No. 20120092110027). Ruben Ruiz is partially supported by the Spanish Ministry of Economy and Competitiveness, under the project "RESULT - Realistic Extended Scheduling Using Light Techniques" (No. DPI2012-36243-C02-01) partially financed with FEDER funds.Xiaoping Li; Tianze Jiang; Ruiz García, R. (2016). Heuristics for periodical batch job scheduling in a MapReduce computing framework. Information Sciences. 326:119-133. https://doi.org/10.1016/j.ins.2015.07.040S11913332

    Learning workload behaviour models from monitored time-series for resource estimation towards data center optimization

    Get PDF
    In recent years there has been an extraordinary growth of the demand of Cloud Computing resources executed in Data Centers. Modern Data Centers are complex systems that need management. As distributed computing systems grow, and workloads benefit from such computing environments, the management of such systems increases in complexity. The complexity of resource usage and power consumption on cloud-based applications makes the understanding of application behavior through expert examination difficult. The difficulty increases when applications are seen as "black boxes", where only external monitoring can be retrieved. Furthermore, given the different amount of scenarios and applications, automation is required. To deal with such complexity, Machine Learning methods become crucial to facilitate tasks that can be automatically learned from data. Firstly, this thesis proposes an unsupervised learning technique to learn high level representations from workload traces. Such technique provides a fast methodology to characterize workloads as sequences of abstract phases. The learned phase representation is validated on a variety of datasets and used in an auto-scaling task where we show that it can be applied in a production environment, achieving better performance than other state-of-the-art techniques. Secondly, this thesis proposes a neural architecture, based on Sequence-to-Sequence models, that provides the expected resource usage of applications sharing hardware resources. The proposed technique provides resource managers the ability to predict resource usage over time as well as the completion time of the running applications. The technique provides lower error predicting usage when compared with other popular Machine Learning methods. Thirdly, this thesis proposes a technique for auto-tuning Big Data workloads from the available tunable parameters. The proposed technique gathers information from the logs of an application generating a feature descriptor that captures relevant information from the application to be tuned. Using this information we demonstrate that performance models can generalize up to a 34% better when compared with other state-of-the-art solutions. Moreover, the search time to find a suitable solution can be drastically reduced, with up to a 12x speedup and almost equal quality results as modern solutions. These results prove that modern learning algorithms, with the right feature information, provide powerful techniques to manage resource allocation for applications running in cloud environments. This thesis demonstrates that learning algorithms allow relevant optimizations in Data Center environments, where applications are externally monitored and careful resource management is paramount to efficiently use computing resources. We propose to demonstrate this thesis in three areas that orbit around resource management in server environmentsEls Centres de Dades (Data Centers) moderns són sistemes complexos que necessiten ser gestionats. A mesura que creixen els sistemes de computació distribuïda i les aplicacions es beneficien d’aquestes infraestructures, també n’augmenta la seva complexitat. La complexitat que implica gestionar recursos de còmput i d’energia en sistemes de computació al núvol fa difícil entendre el comportament de les aplicacions que s'executen de manera manual. Aquesta dificultat s’incrementa quan les aplicacions s'observen com a "caixes negres", on només es poden monitoritzar algunes mètriques de les caixes de manera externa. A més, degut a la gran varietat d’escenaris i aplicacions, és necessari automatitzar la gestió d'aquests recursos. Per afrontar-ne el repte, l'aprenentatge automàtic juga un paper cabdal que facilita aquestes tasques, que poden ser apreses automàticament en base a dades prèvies del sistema que es monitoritza. Aquesta tesi demostra que els algorismes d'aprenentatge poden aportar optimitzacions molt rellevants en la gestió de Centres de Dades, on les aplicacions són monitoritzades externament i la gestió dels recursos és de vital importància per a fer un ús eficient de la capacitat de còmput d'aquests sistemes. En primer lloc, aquesta tesi proposa emprar aprenentatge no supervisat per tal d’aprendre representacions d'alt nivell a partir de traces d'aplicacions. Aquesta tècnica ens proporciona una metodologia ràpida per a caracteritzar aplicacions vistes com a seqüències de fases abstractes. La representació apresa de fases és validada en diferents “datasets” i s'aplica a la gestió de tasques d'”auto-scaling”, on es conclou que pot ser aplicable en un medi de producció, aconseguint un millor rendiment que altres mètodes de vanguardia. En segon lloc, aquesta tesi proposa l'ús de xarxes neuronals, basades en arquitectures “Sequence-to-Sequence”, que proporcionen una estimació dels recursos usats per aplicacions que comparteixen recursos de hardware. La tècnica proposada facilita als gestors de recursos l’habilitat de predir l'ús de recursos a través del temps, així com també una estimació del temps de còmput de les aplicacions. Tanmateix, redueix l’error en l’estimació de recursos en comparació amb d’altres tècniques populars d'aprenentatge automàtic. Per acabar, aquesta tesi introdueix una tècnica per a fer “auto-tuning” dels “hyper-paràmetres” d'aplicacions de Big Data. Consisteix així en obtenir informació dels “logs” de les aplicacions, generant un vector de característiques que captura informació rellevant de les aplicacions que s'han de “tunejar”. Emprant doncs aquesta informació es valida que els ”Regresors” entrenats en la predicció del rendiment de les aplicacions són capaços de generalitzar fins a un 34% millor que d’altres “Regresors” de vanguàrdia. A més, el temps de cerca per a trobar una bona solució es pot reduir dràsticament, aconseguint un increment de millora de fins a 12 vegades més dels resultats de qualitat en contraposició a alternatives modernes. Aquests resultats posen de manifest que els algorismes moderns d'aprenentatge automàtic esdevenen tècniques molt potents per tal de gestionar l'assignació de recursos en aplicacions que s'executen al núvol.Arquitectura de computador
    corecore