871 research outputs found
DESIGN AND EVALUATION OF RESOURCE ALLOCATION AND JOB SCHEDULING ALGORITHMS ON COMPUTATIONAL GRIDS
Grid, an infrastructure for resource sharing, currently has shown its importance in
many scientific applications requiring tremendously high computational power. Grid
computing enables sharing, selection and aggregation of resources for solving
complex and large-scale scientific problems. Grids computing, whose resources are
distributed, heterogeneous and dynamic in nature, introduces a number of fascinating
issues in resource management. Grid scheduling is the key issue in grid environment
in which its system must meet the functional requirements of heterogeneous domains,
which are sometimes conflicting in nature also, like user, application, and network.
Moreover, the system must satisfy non-functional requirements like reliability,
efficiency, performance, effective resource utilization, and scalability. Thus, overall
aim of this research is to introduce new grid scheduling algorithms for resource
allocation as well as for job scheduling for enabling a highly efficient and effective
utilization of the resources in executing various applications.
The four prime aspects of this work are: firstly, a model of the grid scheduling
problem for dynamic grid computing environment; secondly, development of a new
web based simulator (SyedWSim), enabling the grid users to conduct a statistical
analysis of grid workload traces and provides a realistic basis for experimentation in
resource allocation and job scheduling algorithms on a grid; thirdly, proposal of a new
grid resource allocation method of optimal computational cost using synthetic and
real workload traces with respect to other allocation methods; and finally, proposal of
some new job scheduling algorithms of optimal performance considering parameters
like waiting time, turnaround time, response time, bounded slowdown, completion
time and stretch time. The issue is not only to develop new algorithms, but also to
evaluate them on an experimental computational grid, using synthetic and real
workload traces, along with the other existing job scheduling algorithms.
Experimental evaluation confirmed that the proposed grid scheduling algorithms
possess a high degree of optimality in performance, efficiency and scalability
DESIGN AND EVALUATION OF RESOURCE ALLOCATION AND JOB SCHEDULING ALGORITHMS ON COMPUTATIONAL GRIDS
Grid, an infrastructure for resource sharing, currently has shown its importance in
many scientific applications requiring tremendously high computational power. Grid
computing enables sharing, selection and aggregation of resources for solving
complex and large-scale scientific problems. Grids computing, whose resources are
distributed, heterogeneous and dynamic in nature, introduces a number of fascinating
issues in resource management. Grid scheduling is the key issue in grid environment
in which its system must meet the functional requirements of heterogeneous domains,
which are sometimes conflicting in nature also, like user, application, and network.
Moreover, the system must satisfy non-functional requirements like reliability,
efficiency, performance, effective resource utilization, and scalability. Thus, overall
aim of this research is to introduce new grid scheduling algorithms for resource
allocation as well as for job scheduling for enabling a highly efficient and effective
utilization of the resources in executing various applications.
The four prime aspects of this work are: firstly, a model of the grid scheduling
problem for dynamic grid computing environment; secondly, development of a new
web based simulator (SyedWSim), enabling the grid users to conduct a statistical\ud
analysis of grid workload traces and provides a realistic basis for experimentation in
resource allocation and job scheduling algorithms on a grid; thirdly, proposal of a new
grid resource allocation method of optimal computational cost using synthetic and
real workload traces with respect to other allocation methods; and finally, proposal of
some new job scheduling algorithms of optimal performance considering parameters
like waiting time, turnaround time, response time, bounded slowdown, completion
time and stretch time. The issue is not only to develop new algorithms, but also to
evaluate them on an experimental computational grid, using synthetic and real
workload traces, along with the other existing job scheduling algorithms.
Experimental evaluation confirmed that the proposed grid scheduling algorithms
possess a high degree of optimality in performance, efficiency and scalability
Reducing data movement on large shared memory systems by exploiting computation dependencies
Shared memory systems are becoming increasingly complex as they typically integrate several storage devices. That brings different access latencies or bandwidth rates depending on the proximity between the cores where memory accesses are issued and the storage devices containing the requested data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are generally applied by the system software. We propose techniques at the runtime system level to further mitigate the impact of NUMA effects on parallel applications' performance. We leverage runtime system metadata expressed in terms of a task dependency graph, where nodes are pieces of serial code and edges are control or data dependencies between them, to efficiently reduce data transfers. Our approach, based on graph partitioning, adds negligible overhead and is able to provide performance improvements up to 1.52Ă and average improvements of 1.12Ă with respect to the best state-of-the-art approach when deployed on a 288-core shared-memory system. Our approach reduces the coherence traffic by 2.28Ă on average with respect to the state-of-the-art.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Unionâs Horizon 2020 research and innovation programme (grant agreements 671697 and 779877). I. SĂĄnchez Barrera has been partially supported by the Spanish Ministry of Education, Culture and Sport under FormaciĂłn del Profesorado Universitario fellowship number FPU15/03612. M. MoretĂł has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under RamĂłn y Cajal fellowship number RYC-2016-21104.Peer ReviewedPostprint (published version
Workload Schedulers - Genesis, Algorithms and Comparisons
In this article we provide brief descriptions of three classes of schedulers: Operating Systems Process Schedulers, Cluster Systems, Jobs Schedulers and Big Data Schedulers. We describe their evolution from early adoptions to modern implementations, considering both the use and features of algorithms. In summary, we discuss differences between all presented classes of schedulers and discuss their chronological development. In conclusion, we highlight similarities in the focus of scheduling strategies design, applicable to both local and distributed systems
Hierarchical Beamforming: Resource Allocation, Fairness and Flow Level Performance
We consider hierarchical beamforming in wireless networks. For a given
population of flows, we propose computationally efficient algorithms for fair
rate allocation including proportional fairness and max-min fairness. We next
propose closed-form formulas for flow level performance, for both elastic (with
either proportional fairness and max-min fairness) and streaming traffic. We
further assess the performance of hierarchical beamforming using numerical
experiments. Since the proposed solutions have low complexity compared to
conventional beamforming, our work suggests that hierarchical beamforming is a
promising candidate for the implementation of beamforming in future cellular
networks.Comment: 34 page
Recent Advances in Graph Partitioning
We survey recent trends in practical algorithms for balanced graph
partitioning together with applications and future research directions
The Inter-cloud meta-scheduling
Inter-cloud is a recently emerging approach that expands cloud elasticity. By facilitating an adaptable setting, it purposes at the realization of a scalable resource provisioning that enables a diversity of cloud user requirements to be handled efficiently. This studyâs contribution is in the inter-cloud performance optimization of job executions using metascheduling concepts. This includes the development of the inter-cloud meta-scheduling (ICMS) framework, the ICMS optimal schemes and the SimIC toolkit. The ICMS model is an architectural strategy for managing and scheduling user services in virtualized dynamically inter-linked clouds. This is achieved by the development of a model that includes a set of algorithms, namely the Service-Request, Service-Distribution, Service-Availability and Service-Allocation algorithms. These along with resource management optimal schemes offer the novel functionalities of the ICMS where the message exchanging implements the job distributions method, the VM deployment offers the VM management features and the local resource management system details the management of the local cloud schedulers. The generated system offers great flexibility by facilitating a lightweight resource management methodology while at the same time handling the heterogeneity of different clouds through advanced service level agreement coordination. Experimental results are productive as the proposed ICMS model achieves enhancement of the performance of service distribution for a variety of criteria such as service execution times, makespan, turnaround times, utilization levels and energy consumption rates for various inter-cloud entities, e.g. users, hosts and VMs. For example, ICMS optimizes the performance of a non-meta-brokering inter-cloud by 3%, while ICMS with full optimal schemes achieves 9% optimization for the same configurations. The whole experimental platform is implemented into the inter-cloud Simulation toolkit (SimIC) developed by the author, which is a discrete event simulation framework
Geometry-Oblivious FMM for Compressing Dense SPD Matrices
We present GOFMM (geometry-oblivious FMM), a novel method that creates a
hierarchical low-rank approximation, "compression," of an arbitrary dense
symmetric positive definite (SPD) matrix. For many applications, GOFMM enables
an approximate matrix-vector multiplication in or even time,
where is the matrix size. Compression requires storage and work.
In general, our scheme belongs to the family of hierarchical matrix
approximation methods. In particular, it generalizes the fast multipole method
(FMM) to a purely algebraic setting by only requiring the ability to sample
matrix entries. Neither geometric information (i.e., point coordinates) nor
knowledge of how the matrix entries have been generated is required, thus the
term "geometry-oblivious." Also, we introduce a shared-memory parallel scheme
for hierarchical matrix computations that reduces synchronization barriers. We
present results on the Intel Knights Landing and Haswell architectures, and on
the NVIDIA Pascal architecture for a variety of matrices.Comment: 13 pages, accepted by SC'1
Exploiting data locality in cache-coherent NUMA systems
The end of Dennard scaling has caused a stagnation of the clock frequency in computers.To overcome this issue, in the last two decades vendors have been integrating larger numbers of processing elements in the systems, interconnecting many nodes, including multiple chips in the nodes and increasing the number of cores in each chip. The speed of main memory has not evolved at the same rate as processors, it is much slower and there is a need to provide more total bandwidth to the processors, especially with the increase in the number of cores and chips.
Still keeping a shared address space, where all processors can access the whole memory, solutions have come by integrating more memories: by using newer technologies like high-bandwidth memories (HBM) and non-volatile memories (NVM), by giving groups cores (like sockets, for example) faster access to some subset of the DRAM, or by combining many of these solutions. This has caused some heterogeneity in the access speed to main memory, depending on the CPU requesting access to a memory address and the actual physical location of that address, causing non-uniform memory access (NUMA) behaviours.
Moreover, many of these systems are cache-coherent (ccNUMA), meaning that changes in the memory done from one CPU must be visible by the other CPUs and transparent for the programmer.
These NUMA behaviours reduce the performance of applications and can pose a challenge to the programmers. To tackle this issue, this thesis proposes solutions, at the software and hardware levels, to improve the data locality in NUMA systems and, therefore, the performance of applications in these computer systems.
The first contribution shows how considering hardware prefetching simultaneously with thread and data placement in NUMA systems can find configurations with better performance than considering these aspects separately. The performance results combined with performance counters are then used to build a performance model to predict, both offline and online, the best configuration for new applications not in the model. The evaluation is done using two different high performance NUMA systems, and the performance counters collected in one machine are used to predict the best configurations in the other machine.
The second contribution builds on the idea that prefetching can have a strong effect in NUMA systems and proposes a
NUMA-aware hardware prefetching scheme. This scheme is generic and can be applied to multiple hardware prefetchers with a low hardware cost but giving very good results. The evaluation is done using a cycle-accurate architectural simulator and provides detailed results of the performance, the data transfer reduction and the energy costs.
Finally, the third and last contribution consists in scheduling algorithms for task-based programming models. These programming models help improve the programmability of applications in parallel systems and also provide useful information to the underlying runtime system. This information is used to build a task dependency graph (TDG), a directed acyclic graph that models the application where the nodes are sequential pieces of code known as tasks and the edges are the data dependencies between the different tasks. The proposed scheduling algorithms use graph partitioning techniques and provide a scheduling for the tasks in the TDG that minimises the data transfers between the different NUMA regions of the system. The results have been evaluated in real ccNUMA systems with multiple NUMA regions.La fi de la llei de Dennard ha provocat un estancament de la freqĂŒĂšncia de rellotge dels computadors. Amb l'objectiu de superar aquest fet, durant les darreres dues dĂšcades els fabricants han integrat mĂ©s quantitat d'unitats de cĂČmput als sistemes mitjançant la interconnexiĂł de nodes diferents, la inclusiĂł de mĂșltiples xips als nodes i l'increment de nuclis de processador a cada xip. La rapidesa de la memĂČria principal no ha evolucionat amb el mateix factor que els processadors; Ă©s molt mĂ©s lenta i hi ha la necessitat de proporcionar mĂ©s ample de banda als processadors, especialment amb l'increment del nombre de nuclis i xips. Tot mantenint un adreçament compartit en el qual tots els processadors poden accedir a la memĂČria sencera, les solucions han estat al voltant de la integraciĂł de mĂ©s memĂČries: amb tecnologies modernes com HBM (high-bandwidth memories) i NVM (non-volatile memories), fent que grups de nuclis (com sĂČcols sencers) tinguin accĂ©s mĂ©s rĂ pid a una part de la DRAM o amb la combinaciĂł de solucions. AixĂČ ha provocat una heterogeneĂŻtat en la velocitat d'accĂ©s a la memĂČria principal, en funciĂł del nucli que sol·licita l'accĂ©s a una adreça en particular i la seva localitzaciĂł fĂsica, fet que provoca uns comportaments no uniformes en l'accĂ©s a la memĂČria (non-uniform memory access, NUMA). A mĂ©s, sovint tenen memĂČries cau coherents (cache-coherent NUMA, ccNUMA), que implica que qualsevol canvi fet a la memĂČria des d'un nucli d'un processador ha de ser visible la resta de manera transparent. Aquests comportaments redueixen el rendiment de les aplicacions i suposen un repte. Per abordar el problema, a la tesi s'hi proposen solucions, a nivell de programari i maquinari, que milloren la localitat de dades als sistemes NUMA i, en conseqĂŒĂšncia, el rendiment de les aplicacions en aquests sistemes. La primera contribuciĂł mostra que, quan es tenen en compte alhora la precĂ rrega d'adreces de memĂČria amb maquinari (hardware prefetching) i les decisions d'ubicaciĂł dels fils d'execuciĂł i les dades als sistemes NUMA, es poden trobar millors configuracions que quan es condieren per separat. Una combinaciĂł dels resultats de rendiment i dels comptadors disponibles al sistema s'utilitza per construir un model de rendiment per fer la predicciĂł, tant per avançat com tambĂ© en temps d'execuciĂł, de la millor configuraciĂł per aplicacions que no es troben al model. L'avaluaciĂł es du a terme a dos sistemes NUMA d'alt rendiment, i els comptadors mesurats en un sistema s'usen per predir les millors configuracions a l'altre sistema. La segona contribuciĂł es basa en la idea que el prefetching pot tenir un efecte considerable als sistemes NUMA i proposa un esquema de precĂ rrega a nivell de maquinari que tĂ© en compte els efectes NUMA. L'esquema Ă©s genĂšric i es pot aplicar als algorismes de precĂ rrega existents amb un cost de maquinari molt baix perĂČ amb molt bons resultats. S'avalua amb un simulador arquitectural acurat a nivell de cicle i proporciona resultats detallats del rendiment, la reducciĂł de les comunicacions de dades i els costos energĂštics. La tercera i darrera contribuciĂł consisteix en algorismes de planificaciĂł per models de programaciĂł basats en tasques. Aquests simplifiquen la programabilitat de les aplicacions paral·leles i proveeixen informaciĂł molt Ăștil al sistema en temps d'execuciĂł (runtime system) que en controla el funcionament. Amb aquesta informaciĂł es construeix un graf de dependĂšncies entre tasques (task dependency graph, TDG), un graf dirigit i acĂclic que modela l'aplicaciĂł i en el qual els nodes sĂłn fragments de codi seqĂŒencial (o tasques) i els arcs sĂłn les dependĂšncies de dades entre les tasques. Els algorismes de planificaciĂł proposats fan servir tĂšcniques de particionat de grafs i proporcionen una planificaciĂł de les tasques del TDG que minimitza la comunicaciĂł de dades entre les diferents regions NUMA del sistema. Els resultats han estat avaluats en sistemes ccNUMA reals amb mĂșltiples regions NUMA.El final de la ley de Dennard ha provocado un estancamiento de la frecuencia
de reloj de los computadores. Con el objetivo de superar este problema,
durante las Ășltimas dos dĂ©cadas los fabricantes han integrado mĂĄs unidades
de cĂłmputo en los sistemas mediante la interconexiĂłn de nodos diferentes,
la inclusiĂłn de mĂșltiples chips en los nodos y el incremento de nĂșcleos
de procesador en cada chip. La rapidez de la memoria principal no ha
evolucionado con el mismo factor que los procesadores; es mucho mĂĄs lenta
y hay la necesidad de proporcionar mĂĄs ancho de banda a los procesadores,
especialmente con el incremento del nĂșmero de nĂșcleos y chips.
Aun manteniendo un sistema de direccionamiento compartido en el que
todos los procesadores pueden acceder al conjunto de la memoria, las soluciones
han oscilado alrededor de la integraciĂłn de mĂĄs memorias: usando
tecnologĂas modernas como las memorias de alto ancho de banda (highbandwidth
memories, HBM) y memorias no volĂĄtiles (non-volatile memories,
NVM), haciendo que grupos de nĂșcleos (como zĂłcalos completos) tengan
acceso mĂĄs veloz a un subconjunto de la DRAM, o con la combinaciĂłn de
soluciones. Esto ha provocado una heterogeneidad en la velocidad de acceso
a la memoria principal, en funciĂłn del nĂșcleo que solicita el acceso a una
direcciĂłn de memoria en particular y la ubicaciĂłn fĂsica de esta direcciĂłn, lo
que provoca unos comportamientos no uniformes en el acceso a la memoria
(non-uniform memory access, NUMA). AdemĂĄs, muchos de estos sistemas
tienen memorias caché coherentes (cache-coherent NUMA, ccNUMA), lo
que implica que cualquier cambio hecho en la memoria desde un nĂșcleo
de un procesador debe ser visible por el resto de procesadores de forma
transparente para los programadores.
Estos comportamientos NUMA reducen el rendimiento de las aplicaciones
y pueden suponer un reto para los programadores. Para abordar dicho problema,
en esta tesis se proponen soluciones, a nivel de software y hardware,
que mejoran la localidad de datos en los sistemas NUMA y, en consecuencia,
el rendimiento de las aplicaciones en estos sistemas informĂĄticos. La primera contribuciĂłn muestra que, cuando se tienen en cuenta a la vez
la precarga de direcciones de memoria mediante hardware (o hardware
prefetching ) y las decisiones de la ubicaciĂłn de los hilos de ejecuciĂłn y los
datos en los sistemas NUMA, se pueden hallar mejores configuraciones que
cuando se consideran ambos aspectos por separado. Con una combinaciĂłn
de los resultados de rendimiento y de los contadores disponibles en el
sistema se construye un modelo de rendimiento, tanto por avanzado como
en en tiempo de ejecuciĂłn, de la mejor configuraciĂłn para aplicaciones que
no estĂĄn incluidas en el modelo. La evaluaciĂłn se realiza en dos sistemas
NUMA de alto rendimiento, y los contadores medidos en uno de los sistemas
se usan para predecir las mejores configuraciones en el otro sistema.
La segunda contribuciĂłn se basa en la idea de que el prefetching puede
tener un efecto considerable en los sistemas NUMA y propone un esquema
de precarga a nivel hardware que tiene en cuenta los efectos NUMA. Este
esquema es genérico y se puede aplicar a diferentes algoritmos de precarga
existentes con un coste de hardware muy bajo pero que proporciona muy
buenos resultados. Dichos resultados se obtienen y evalĂșan mediante un
simulador arquitectural preciso a nivel de ciclo y proporciona resultados
detallados del rendimiento, la reducciĂłn de las comunicaciones de datos y
los costes energéticos.
Finalmente, la tercera y Ășltima contribuciĂłn consiste en algoritmos de planificaciĂłn
para modelos de programaciĂłn basados en tareas. Estos modelos
simplifican la programabilidad de las aplicaciones paralelas y proveen informaciĂłn
muy Ăștil al sistema en tiempo de ejecuciĂłn (runtime system)
que controla su funcionamiento. Esta informaciĂłn se utiliza para construir
un grafo de dependencias entre tareas (task dependency graph, TDG), un
grafo dirigido y acĂclico que modela la aplicaciĂłn y en el ue los nodos son
fragmentos de cĂłdigo secuencial, conocidos como tareas, y los arcos son las
dependencias de datos entre las distintas tareas. Los algoritmos de planificaciĂłn
que se proponen usan técnicas e particionado de grafos y proporcionan
una planificaciĂłn de las tareas del TDG que minimiza la comunicaciĂłn de
datos entre las distintas regiones NUMA del sistema. Los resultados se han
evaluado en sistemas ccNUMA reales con mĂșltiples regiones NUMA.Postprint (published version
- âŠ