21 research outputs found

    Scheduling in Mapreduce Clusters

    Get PDF
    MapReduce is a framework proposed by Google for processing huge amounts of data in a distributed environment. The simplicity of the programming model and the fault-tolerance feature of the framework make it very popular in Big Data processing. As MapReduce clusters get popular, their scheduling becomes increasingly important. On one hand, many MapReduce applications have high performance requirements, for example, on response time and/or throughput. On the other hand, with the increasing size of MapReduce clusters, the energy-efficient scheduling of MapReduce clusters becomes inevitable. These scheduling challenges, however, have not been systematically studied. The objective of this dissertation is to provide MapReduce applications with low cost and energy consumption through the development of scheduling theory and algorithms, energy models, and energy-aware resource management. In particular, we will investigate energy-efficient scheduling in hybrid CPU-GPU MapReduce clusters. This research work is expected to have a breakthrough in Big Data processing, particularly in providing green computing to Big Data applications such as social network analysis, medical care data mining, and financial fraud detection. The tools we propose to develop are expected to increase utilization and reduce energy consumption for MapReduce clusters. In this PhD dissertation, we propose to address the aforementioned challenges by investigating and developing 1) a match-making scheduling algorithm for improving the data locality of Map- Reduce applications, 2) a real-time scheduling algorithm for heterogeneous Map- Reduce clusters, and 3) an energy-efficient scheduler for hybrid CPU-GPU Map- Reduce cluster. Advisers: Ying Lu and David Swanso

    Task Scheduling in Big Data Platforms: A Systematic Literature Review

    Get PDF
    Context: Hadoop, Spark, Storm, and Mesos are very well known frameworks in both research and industrial communities that allow expressing and processing distributed computations on massive amounts of data. Multiple scheduling algorithms have been proposed to ensure that short interactive jobs, large batch jobs, and guaranteed-capacity production jobs running on these frameworks can deliver results quickly while maintaining a high throughput. However, only a few works have examined the effectiveness of these algorithms. Objective: The Evidence-based Software Engineering (EBSE) paradigm and its core tool, i.e., the Systematic Literature Review (SLR), have been introduced to the Software Engineering community in 2004 to help researchers systematically and objectively gather and aggregate research evidences about different topics. In this paper, we conduct a SLR of task scheduling algorithms that have been proposed for big data platforms. Method: We analyse the design decisions of different scheduling models proposed in the literature for Hadoop, Spark, Storm, and Mesos over the period between 2005 and 2016. We provide a research taxonomy for succinct classification of these scheduling models. We also compare the algorithms in terms of performance, resources utilization, and failure recovery mechanisms. Results: Our searches identifies 586 studies from journals, conferences and workshops having the highest quality in this field. This SLR reports about different types of scheduling models (dynamic, constrained, and adaptive) and the main motivations behind them (including data locality, workload balancing, resources utilization, and energy efficiency). A discussion of some open issues and future challenges pertaining to improving the current studies is provided

    Running stream-like programs on heterogeneous multi-core systems

    Get PDF
    All major semiconductor companies are now shipping multi-cores. Phones, PCs, laptops, and mobile internet devices will all require software that can make effective use of these cores. Writing high-performance parallel software is difficult, time-consuming and error prone, increasing both time-to-market and cost. Software outlives hardware; it typically takes longer to develop new software than hardware, and legacy software tends to survive for a long time, during which the number of cores per system will increase. Development and maintenance productivity will be improved if parallelism and technical details are managed by the machine, while the programmer reasons about the application as a whole. Parallel software should be written using domain-specific high-level languages or extensions. These languages reveal implicit parallelism, which would be obscured by a sequential language such as C. When memory allocation and program control are managed by the compiler, the program's structure and data layout can be safely and reliably modified by high-level compiler transformations. One important application domain contains so-called stream programs, which are structured as independent kernels interacting only through one-way channels, called streams. Stream programming is not applicable to all programs, but it arises naturally in audio and video encode and decode, 3D graphics, and digital signal processing. This representation enables high-level transformations, including kernel unrolling and kernel fusion. This thesis develops new compiler and run-time techniques for stream programming. The first part of the thesis is concerned with a statically scheduled stream compiler. It introduces a new static partitioning algorithm, which determines which kernels should be fused, in order to balance the loads on the processors and interconnects. A good partitioning algorithm is crucial if the compiler is to produce efficient code. The algorithm also takes account of downstream compiler passes---specifically software pipelining and buffer allocation---and it models the compiler's ability to fuse kernels. The latter is important because the compiler may not be able to fuse arbitrary collections of kernels. This thesis also introduces a static queue sizing algorithm. This algorithm is important when memory is distributed, especially when local stores are small. The algorithm takes account of latencies and variations in computation time, and is constrained by the sizes of the local memories. The second part of this thesis is concerned with dynamic scheduling of stream programs. First, it investigates the performance of known online, non-preemptive, non-clairvoyant dynamic schedulers. Second, it proposes two dynamic schedulers for stream programs. The first is specifically for one-dimensional stream programs. The second is more general: it does not need to be told the stream graph, but it has slightly larger overhead. This thesis also introduces some support tools related to stream programming. StarssCheck is a debugging tool, based on Valgrind, for the StarSs task-parallel programming language. It generates a warning whenever the program's behaviour contradicts a pragma annotation. Such behaviour could otherwise lead to exceptions or race conditions. StreamIt to OmpSs is a tool to convert a streaming program in the StreamIt language into a dynamically scheduled task based program using StarSs.Totes les empreses de semiconductors produeixen actualment multi-cores. M貌bils,PCs, port脿tils, i dispositius m貌bils d鈥橧nternet necessitaran programari quefaci servir eficientment aquests cores. Escriure programari paral路lel d鈥檃ltrendiment 茅s dif铆cil, labori贸s i propens a errors, incrementant tant el tempsde llan莽ament al mercat com el cost. El programari t茅 una vida m茅s llarga queel maquinari; t铆picament pren m茅s temps desenvolupar nou programi que noumaquinari, i el programari ja existent pot perdurar molt temps, durant el qualel nombre de cores dels sistemes incrementar脿. La productivitat dedesenvolupament i manteniment millorar脿 si el paral路lelisme i els detallst猫cnics s贸n gestionats per la m脿quina, mentre el programador raona sobre elconjunt de l鈥檃plicaci贸.El programari paral路lel hauria de ser escrit en llenguatges espec铆fics deldomini. Aquests llenguatges extrauen paral路lelisme impl铆cit, el qual 茅s ocultatper un llenguatge seq眉encial com C. Quan l鈥檃ssignaci贸 de mem貌ria i lesestructures de control s贸n gestionades pel compilador, l鈥檈structura iorganitzaci贸 de dades del programi poden ser modificades de manera segura ifiable per les transformacions d鈥檃lt nivell del compilador.Un dels dominis de l鈥檃plicaci贸 importants 茅s el que consta dels programes destream; aquest programes s贸n estructurats com a nuclis independents queinteractuen nom茅s a trav茅s de canals d鈥檜n sol sentit, anomenats streams. Laprogramaci贸 de streams no 茅s aplicable a tots els programes, per貌 sorgeix deforma natural en la codificaci贸 i descodificaci贸 d鈥櫭爑dio i v铆deo, gr脿fics 3D, iprocessament de senyals digitals. Aquesta representaci贸 permet transformacionsd鈥檃lt nivell, fins i tot descomposici贸 i fusi贸 de nucli.Aquesta tesi desenvolupa noves t猫cniques de compilaci贸 i sistemes en tempsd鈥檈xecuci贸 per a programaci贸 de streams. La primera part d鈥檃questa tesi esfocalitza amb un compilador de streams de planificaci贸 est脿tica. Presenta unnou algorisme de partici贸 est脿tica, que determina quins nuclis han de serfusionats, per tal d鈥檈quilibrar la c脿rrega en els processadors i en lesinterconnexions. Un bon algorisme de particionat 茅s fonamental per tal de queel compilador produeixi codi eficient. L鈥檃lgorisme tamb茅 t茅 en compte elspassos de compilaci贸 subseq眉ents---espec铆ficament software pipelining il鈥檃rranjament de buffers---i modela la capacitat del compilador per fusionarnuclis. Aquesta tesi tamb茅 presenta un algorisme est脿tic de redimensionament de cues.Aquest algorisme 茅s important quan la mem貌ria 茅s distribu茂da, especialment quanles mem貌ries locals s贸n petites. L鈥檃lgorisme t茅 en compte lat猫ncies ivariacions en els temps de c脿lcul, i considera el l铆mit imposat per la mida deles mem貌ries locals.La segona part d鈥檃questa tesi es centralitza en la planificaci贸 din脿mica deprogrames de streams. En primer lloc, investiga el rendiment dels planificadorsdin脿mics online, non-preemptive i non-clairvoyant. En segon lloc, proposa dosplanificadors din脿mics per programes de stream. El primer 茅s espec铆ficament pera programes de streams unidimensionals. El segon 茅s m茅s general: no necessitael graf de streams, per貌 els overheads s贸n una mica m茅s grans.Aquesta tesi tamb茅 presenta un conjunt d鈥檈ines de suport relacionades amb laprogramaci贸 de streams. StarssCheck 茅s una eina de depuraci贸, que 茅s basa enValgrind, per StarSs, un llenguatge de programaci贸 paral路lela basat en tasques.Aquesta eina genera un av铆s cada vegada que el comportament del programa est脿en contradicci贸 amb una anotaci贸 pragma. Aquest comportament d鈥檜na altra manerapodria causar excepcions o situacions de competici贸. StreamIt to OmpSs 茅s unaeina per convertir un programa de streams codificat en el llenguatge StreamIt aun programa de tasques en StarSs planificat de forma din脿mica.Postprint (published version

    Performance Isolation in Multi-Tenant Applications

    Get PDF
    The thesis presents methods to isolate different tenants, sharing one application instance, with regards to he performance they observe. Therefore, a request based admission control is introduced. Furthermore, the publication presents methods and novel metrics to evaluate the degree of isolation a system achieves. These insights are used to evaluate the developed isolation methods, resulting in recommendations of methods for various scenarios
    corecore