31 research outputs found
Recommended from our members
Priority-grouping method for parallel multi-scheduling in Grid
With the advent in multicore computers, the scheduling of Grid jobs can be made more effective if scaled to fully utilize the underlying hardware, and parallelized to benefit from the exploitation of multicores. The fact that sequential algorithms do not scale with multicore systems nor benefit from parallelism remains a major obstacle to scheduling in the Grid. As multicore systems become ever more pervasive in our computing lives, over reliance on such systems for passive parallelism does not offer the best option in harnessing the benefits of their multiprocessors for Grid scheduling. An explicit means of exploiting parallelism for Grid scheduling is required. The Group-based Parallel Multi-scheduler, introduced in this paper, is aimed at effectively exploiting the benefits of multicore systems for Grid scheduling by splitting jobs and machines into paired groups and independently scheduling jobs in parallel from those groups. We implemented two job grouping methods, Execution Time Balanced (ETB) and Execution Time Sorted then Balanced (ETSB), and two machine grouping methods, Evenly Distributed (EvenDist) and Similar Together (SimTog). For each method, we varied the number of groups between 2, 4 and 8. We then executed the MinMin Grid scheduling algorithm independently within the groups. We demonstrated that by sharing jobs and machines into groups before scheduling, the computation time for the scheduling process drastically improved by magnitudes of 85% over the ordinary MinMin algorithm when implemented on a HPC system. We also found that our balanced group based approach achieved better results than our previous Priority based grouping approach
Recommended from our members
Group-based parallel multi-scheduler for grid computing
With the advent in multicore computers, the scheduling of Grid jobs can be made more effective if scaled to fully utilize the underlying hardware, and parallelized to benefit from the exploitation of multicores. The fact that sequential algorithms do not scale with multicore systems nor benefit from parallelism remains a major obstacle to scheduling in the Grid. As multicore systems become ever more pervasive in our computing lives, over reliance on such systems for passive parallelism does not offer the best option in harnessing the benefits of their multiprocessors for Grid scheduling. An explicit means of exploiting parallelism for Grid scheduling is required. The Group-based Parallel Multi-scheduler, introduced in this paper, is aimed at effectively exploiting the benefits of multicore systems for Grid scheduling by splitting jobs and machines into paired groups and independently scheduling jobs in parallel from those groups. We implemented two job grouping methods, Execution Time Balanced (ETB) and Execution Time Sorted then Balanced (ETSB), and two machine grouping methods, Evenly Distributed (EvenDist) and Similar Together (SimTog). For each method, we varied the number of groups between 2, 4 and 8. We then executed the MinMin Grid scheduling algorithm independently within the groups. We demonstrated that by sharing jobs and machines into groups before scheduling, the computation time for the scheduling process drastically improved by magnitudes of 85% over the ordinary MinMin algorithm when implemented on a HPC system. We also found that our balanced group based approach achieved better results than our previous Priority based grouping approach
Energy-aware scheduling in heterogeneous computing systems
In the last decade, the grid computing systems emerged as useful provider of the computing power required for solving complex problems.
The classic formulation of the scheduling problem in heterogeneous computing systems is NP-hard, thus approximation techniques are required for solving real-world scenarios of this problem. This thesis tackles the
problem of scheduling tasks in a heterogeneous computing environment in reduced execution times, considering the schedule length and the total energy consumption as the optimization objectives. An efficient multithreading local search algorithm for solving the multi-objective scheduling problem in heterogeneous computing systems, named MEMLS, is presented. The proposed method follows a fully multi-objective approach, applying a Pareto-based dominance search that is executed in parallel by using several threads. The experimental analysis demonstrates that the new multithreading algorithm outperforms a set of fast and accurate two-phase deterministic heuristics based on the traditional MinMin. The new ME-MLS method is able to achieve significant improvements in both makespan and energy consumption objectives in reduced execution times for a large set of testbed instances, while exhibiting very good scalability. The ME-MLS was evaluated solving instances
comprised of up to 2048 tasks and 64 machines. In order to scale the dimension of the problem instances even further and tackle large-sized problem instances, the Graphical Processing Unit (GPU) architecture is considered. This line of future work has been initially tackled with the gPALS: a hybrid CPU/GPU local search algorithm for
efficiently tackling a single-objective heterogeneous computing scheduling problem. The gPALS shows very promising results, being able to tackle instances of up to 32768 tasks and 1024 machines in reasonable
execution times.En la última década, los sistemas de computación grid se han convertido en útiles proveedores de la capacidad de cálculo necesaria para la resolución de problemas complejos. En su formulación clásica, el problema de
la planificación de tareas en sistemas heterogéneos es un problema NP difÃcil, por lo que se requieren técnicas de resolución aproximadas para atacar instancias de tamaño realista de este problema. Esta tesis aborda
el problema de la planificación de tareas en sistemas heterogéneos, considerando el largo de la planificación y el consumo energético como objetivos a optimizar. Para la resolución de este problema se propone un algoritmo de búsqueda local eficiente y multihilo. El método propuesto se trata de un enfoque plenamente multiobjetivo que consiste en la aplicación de una búsqueda basada en dominancia de Pareto que se ejecuta en paralelo mediante el uso de varios hilos de ejecución. El análisis experimental demuestra que el algoritmo multithilado propuesto supera a un conjunto de heurÃsticas deterministas rápidas y e caces basadas en el algoritmo MinMin tradicional. El nuevo método, ME-MLS, es capaz de lograr mejoras significativas tanto en el largo de la planificación y
como en consumo energético, en tiempos de ejecución reducidos para un gran número de casos de prueba, mientras que exhibe una escalabilidad muy promisoria. El ME-MLS fue evaluado abordando instancias de
hasta 2048 tareas y 64 máquinas. Con el n de aumentar la dimensión de las instancias abordadas y hacer frente a instancias de gran tamaño, se consideró la utilización de la arquitectura provista por las unidades de procesamiento gráfico (GPU). Esta lÃnea de trabajo futuro ha sido abordada inicialmente con el algoritmo gPALS: un algoritmo hÃbrido CPU/GPU de búsqueda local para la planificación de tareas en en sistemas
heterogéneos considerando el largo de la planificación como único objetivo. La evaluación del algoritmo gPALS ha mostrado resultados muy prometedores, siendo capaz de abordar instancias de hasta 32768
tareas y 1024 máquinas en tiempos de ejecución razonables
Memory-aware list scheduling for hybrid platforms
This report provides memory-aware heuristics to schedule tasks graphs onto heterogeneous resources, such as a dual-memory cluster equipped with multicores and a dedicated accelerator (FPGA or GPU). Each task has a different processing time for either resource. The optimization objective is to schedule the graph so as to minimize execution time, given the available memory for each resource type. In addition to ordering the tasks, we must also decide on which resource to execute them, given their computation requirement and the memory currently available on each resource. The major contributions of this report are twofold: (i) the derivation of an intricate integer linear program formulation for this scheduling problem; and (ii) the design of memory-aware heuristics, which outperform the reference heuristics HEFT and MinMin on a wide variety of problem instances. The absolute performance of these heuristics is assessed for small-size graphs, with up to 30 tasks, thanks to the linear program
Scheduling Irregular Workloads on GPUs
This doctoral research aims at understanding the nature of the overhead for data irregular GPU workloads, proposing a solution, and examining the consequences of the result. We propose a novel, retry-free GPU workload scheduler for irregular workloads. When used in a Breadth First Search (BFS) algorithm, the proposed simple, monolithic concurrent queue scales to within 10% of ideal scalability on AMD’s Fiji GPU with 14,336 active threads. The dissertation presents an important finding that the retry overhead associated with Compare and Swap (CAS) operations is the principle reason why concurrent queues do not scale well as the number of clients increases in a massively multi-threaded environment
Ansor : Generating High-Performance Tensor Programs for Deep Learning
High-performance tensor programs are crucial to guarantee efficient execution
of deep neural networks. However, obtaining performant tensor programs for
different operators on various hardware platforms is notoriously challenging.
Currently, deep learning systems rely on vendor-provided kernel libraries or
various search strategies to get performant tensor programs. These approaches
either require significant engineering effort to develop platform-specific
optimization code or fall short of finding high-performance programs due to
restricted search space and ineffective exploration strategy.
We present Ansor, a tensor program generation framework for deep learning
applications. Compared with existing search strategies, Ansor explores many
more optimization combinations by sampling programs from a hierarchical
representation of the search space. Ansor then fine-tunes the sampled programs
with evolutionary search and a learned cost model to identify the best
programs. Ansor can find high-performance programs that are outside the search
space of existing state-of-the-art approaches. In addition, Ansor utilizes a
task scheduler to simultaneously optimize multiple subgraphs in deep neural
networks. We show that Ansor improves the execution performance of deep neural
networks relative to the state-of-the-art on the Intel CPU, ARM CPU, and NVIDIA
GPU by up to , , and , respectively.Comment: Published in OSDI 202
DLAS: An Exploration and Assessment of the Deep Learning Acceleration Stack
Deep Neural Networks (DNNs) are extremely computationally demanding, which
presents a large barrier to their deployment on resource-constrained devices.
Since such devices are where many emerging deep learning applications lie
(e.g., drones, vision-based medical technology), significant bodies of work
from both the machine learning and systems communities have attempted to
provide optimizations to accelerate DNNs. To help unify these two perspectives,
in this paper we combine machine learning and systems techniques within the
Deep Learning Acceleration Stack (DLAS), and demonstrate how these layers can
be tightly dependent on each other with an across-stack perturbation study. We
evaluate the impact on accuracy and inference time when varying different
parameters of DLAS across two datasets, seven popular DNN architectures, four
DNN compression techniques, three algorithmic primitives with sparse and dense
variants, untuned and auto-scheduled code generation, and four hardware
platforms. Our evaluation highlights how perturbations across DLAS parameters
can cause significant variation and across-stack interactions. The highest
level observation from our evaluation is that the model size, accuracy, and
inference time are not guaranteed to be correlated. Overall we make 13 key
observations, including that speedups provided by compression techniques are
very hardware dependent, and that compiler auto-tuning can significantly alter
what the best algorithm to use for a given configuration is. With DLAS, we aim
to provide a reference framework to aid machine learning and systems
practitioners in reasoning about the context in which their respective DNN
acceleration solutions exist in. With our evaluation strongly motivating the
need for co-design, we believe that DLAS can be a valuable concept for
exploring the next generation of co-designed accelerated deep learning
solutions
Saturn: An Optimized Data System for Large Model Deep Learning Workloads
Large language models such as GPT-3 & ChatGPT have transformed deep learning
(DL), powering applications that have captured the public's imagination. These
models are rapidly being adopted across domains for analytics on various
modalities, often by finetuning pre-trained base models. Such models need
multiple GPUs due to both their size and computational load, driving the
development of a bevy of "model parallelism" techniques & tools. Navigating
such parallelism choices, however, is a new burden for end users of DL such as
data scientists, domain scientists, etc. who may lack the necessary systems
knowhow. The need for model selection, which leads to many models to train due
to hyper-parameter tuning or layer-wise finetuning, compounds the situation
with two more burdens: resource apportioning and scheduling. In this work, we
tackle these three burdens for DL users in a unified manner by formalizing them
as a joint problem that we call SPASE: Select a Parallelism, Allocate
resources, and SchedulE. We propose a new information system architecture to
tackle the SPASE problem holistically, representing a key step toward enabling
wider adoption of large DL models. We devise an extensible template for
existing parallelism schemes and combine it with an automated empirical
profiler for runtime estimation. We then formulate SPASE as an MILP.
We find that direct use of an MILP-solver is significantly more effective
than several baseline heuristics. We optimize the system runtime further with
an introspective scheduling approach. We implement all these techniques into a
new data system we call Saturn. Experiments with benchmark DL workloads show
that Saturn achieves 39-49% lower model selection runtimes than typical current
DL practice.Comment: Under submission at VLDB. Code available:
https://github.com/knagrecha/saturn. 12 pages + 3 pages references + 2 pages
appendi