182 research outputs found
Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures
We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-to-data binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming.This research was sponsored by project PID2019-107255GB of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; project 2017-SGR-1414 of the Generalitat de Catalunya and the Madrid Government under the Multiannual Agreement with UCM in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT, project PR65/19-22445. This project has also received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558. The JU receives support from the European Union’s Horizon 2020 research and innovation programme, and Spain, Germany, France, Italy, Poland, Switzerland, Norway. The work is also supported by grants PID2020-113656RB-C22 and PID2021-126576NB-I00 of MCIN/AEI/10.13039/501100011033 and by ERDF A way of making Europe.Peer ReviewedPostprint (published version
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to
present their current research, and to discuss topics with other students in order to look for synergies and common research
topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to
achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable
solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big
data management, training, contributing to glue disparate researchers working across different areas and provide a meeting
ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in
research topics such as sustainable software solutions (applications and system software stack), data management, energy
efficiency, and resilience.European Cooperation in Science and Technology. COS
Improving the User Experience of the rCUDA Remote GPU Virtualization Framework
Graphics processing units (GPUs) are being increasingly embraced by the high-performance computing
community as an effective way to reduce execution time by accelerating parts of their applications. remote
CUDA (rCUDA) was recently introduced as a software solution to address the high acquisition costs and
energy consumption of GPUs that constrain further adoption of this technology. Specifically, rCUDA is a
middleware that allows a reduced number of GPUs to be transparently shared among the nodes in a cluster.
Although the initial prototype versions of rCUDA demonstrated its functionality, they also revealed concerns
with respect to usability, performance, and support for new CUDA features. In response, in this paper,
we present a new rCUDA version that (1) improves usability by including a new component that allows
an automatic transformation of any CUDA source code so that it conforms to the needs of the rCUDA
framework, (2) consistently features low overhead when using remote GPUs thanks to an improved new
communication architecture, and (3) supports multithreaded applications and CUDA libraries. As a result,
for any CUDA-compatible program, rCUDA now allows the use of remote GPUs within a cluster with low
overhead, so that a single application running in one node can use all GPUs available across the cluster,
thereby extending the single-node capability of CUDA. Copyright © 2014 John Wiley & Sons, Ltd.This work was funded by the Generalitat Valenciana under Grant PROMETEOII/2013/009 of the PROMETEO program phase II. The author from Argonne National Laboratory was supported by the US Department of Energy, Office of Science, under Contract No. DE-AC02-06CH11357. The authors are also grateful for the generous support provided by Mellanox Technologies.Reaño González, C.; Silla Jiménez, F.; Castello Gimeno, A.; Peña Monferrer, AJ.; Mayo Gual, R.; Quintana OrtÃ, ES.; Duato MarÃn, JF. (2015). Improving the User Experience of the rCUDA Remote GPU Virtualization Framework. Concurrency and Computation: Practice and Experience. 27(14):3746-3770. https://doi.org/10.1002/cpe.3409S374637702714NVIDIA NVIDIA industry cases http://www.nvidia.es/object/tesla-case-studiesFigueiredo, R., Dinda, P. A., & Fortes, J. (2005). Guest Editors’ Introduction: Resource Virtualization Renaissance. Computer, 38(5), 28-31. doi:10.1109/mc.2005.159Duato J Igual FD Mayo R Peña AJ Quintana-Ortà ES Silla F An efficient implementation of GPU virtualization in high performance clusters Euro-Par 2009 Workshops, ser. LNCS, 6043 Delft, Netherlands, 385 394Duato J Peña AJ Silla F Mayo R Quintana-Ortà ES Performance of CUDA virtualized remote GPUs in high performance clusters International Conference on Parallel Processing, Taipei, Taiwan 2011 365 374Duato J Peña AJ Silla F Fernández JC Mayo R Quintana-Ortà ES Enabling CUDA acceleration within virtual machines using rCUDA International Conference on High Performance Computing, Bangalore, India 2011 1 10Shi, L., Chen, H., Sun, J., & Li, K. (2012). vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines. IEEE Transactions on Computers, 61(6), 804-816. doi:10.1109/tc.2011.112Gupta V Gavrilovska A Schwan K Kharche H Tolia N Talwar V Ranganathan P GViM: GPU-accelerated virtual machines 3rd Workshop on System-Level Virtualization for High Performance Computing, Nuremberg, Germany 2009 17 24Giunta G Montella R Agrillo G Coviello G A GPGPU transparent virtualization component for high performance computing clouds Euro-Par 2010 - Parallel Processing, 6271 Ischia, Italy, 379 391Zillians VGPU http://www.zillians.com/vgpuLiang TY Chang YW GridCuda: a grid-enabled CUDA programming toolkit Proceedings of the 25th IEEE International Conference on Advanced Information Networking and Applications Workshops (WAINA), Biopolis, Singapore 2011 141 146Barak A Ben-Nun T Levy E Shiloh A Apackage for OpenCL based heterogeneous computing on clusters with many GPU devices Workshop on Parallel Programming and Applications on Accelerator Clusters, Heraklion, Crete, Greece 2010 1 7Xiao S Balaji P Zhu Q Thakur R Coghlan S Lin H Wen G Hong J Feng W-C VOCL: an optimized environment for transparent virtualization of graphics processing units Proceedings of InPar, San Jose, California, USA 2012 1 12Kim J Seo S Lee J Nah J Jo G Lee J SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters Proceedings of the 26th International Conference on Supercomputing, Venice, Italy 2012 341 352NVIDIA The NVIDIA CUDA Compiler Driver NVCC Version 5, NVIDIA 2012Quinlan D Panas T Liao C ROSE http://rosecompiler.org/Free Software Foundation, Inc. GCC, the GNU Compiler Collection http://gcc.gnu.org/LLVM Clang: a C language family frontend for LLVM http://clang.llvm.org/Martinez G Feng W Gardner M CU2CL: a CUDA-to-OpenCL Translator for Multi- and Many-core Architectures http://eprints.cs.vt.edu/archive/00001161/01/CU2CL.pdfLLVM The LLVM compiler infrastructure http://llvm.org/Reaño C Peña AJ Silla F Duato J Mayo R Quintana-Orti ES CU2rCU: towards the complete rCUDA remote GPU virtualization and sharing solution Proceedings of the 19th International Conference on High Performance Computing (HiPC), Pune, India 2012 1 10NVIDIA The NVIDIA GPU Computing SDK Version 4, NVIDIA 2011Sandia National Labs LAMMPS molecular dynamics simulator http://lammps.sandia.gov/Citrix Systems, Inc. Xen http://xen.org/Peña AJ Virtualization of accelerators in high performance clusters Ph.D. Thesis, 2013NVIDIA CUDA profiler user's guide version 5, NVIDIA 2012Igual, F. D., Chan, E., Quintana-OrtÃ, E. S., Quintana-OrtÃ, G., van de Geijn, R. A., & Van Zee, F. G. (2012). The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations. Journal of Parallel and Distributed Computing, 72(9), 1134-1143. doi:10.1016/j.jpdc.2011.10.014Slurm workload manager http://slurm.schedmd.co
Recommended from our members
Galois : a system for parallel execution of irregular algorithms
textA programming model which allows users to program with high productivity and which produces high performance executions has been a goal for decades. This dissertation makes progress towards this elusive goal by describing the design and implementation of the Galois system, a parallel programming model for shared-memory, multicore machines. Central to the design is the idea that scheduling of a program can be decoupled from the core computational operator and data structures. However, efficient programs often require application-specific scheduling to achieve best performance. To bridge this gap, an extensible and abstract scheduling policy language is proposed, which allows programmers to focus on selecting high-level scheduling policies while delegating the tedious task of implementing the policy to a scheduler synthesizer and runtime system. Implementations of deterministic and prioritized scheduling also are described. An evaluation of a well-studied benchmark suite reveals that factoring programs into operators, schedulers and data structures can produce significant performance improvements over unfactored approaches. Comparison of the Galois system with existing programming models for graph analytics shows significant performance improvements, often orders of magnitude more, due to (1) better support for the restrictive programming models of existing systems and (2) better support for more sophisticated algorithms and scheduling, which cannot be expressed in other systems.Computer Science
Towards resource-aware computing for task-based runtimes and parallel architectures
Current large scale systems show increasing power demands, to the point that it has become a huge strain on facilities and budgets. The increasing restrictions in terms of power consumption of High Performance Computing (HPC) systems and data centers have forced hardware vendors to include power capping capabilities in their commodity processors. Power capping opens up new opportunities for applications to directly manage their power behavior at user level. However, constraining power consumption causes the individual sockets of a parallel system to deliver different performance levels under the same power cap, even when they are equally designed, which is an effect caused by manufacturing variability. Modern chips suffer from heterogeneous power consumption due to manufacturing issues, a problem known as manufacturing or process variability. As a result, systems that do not consider such variability caused by manufacturing issues lead to performance degradations and wasted power. In order to avoid such negative impact, users and system administrators must actively counteract any manufacturing variability.
In this thesis we show that parallel systems benefit from taking into account the consequences of manufacturing variability, in terms of both performance and energy efficiency. In order to evaluate our work we have also implemented our own task-based version of the PARSEC benchmark suite. This allows to test our methodology using state-of-the-art parallelization techniques and real world workloads. We present two approaches to mitigate manufacturing variability, by power redistribution at runtime level and by power- and variability-aware job scheduling at system-wide level. A parallel runtime system can be used to effectively deal with this new kind of performance heterogeneity by compensating the uneven effects of power capping. In the context of a NUMA node composed of several multi core sockets, our system is able to optimize the energy and concurrency levels assigned to each socket to maximize performance. Applied transparently within the parallel runtime system, it does not require any programmer interaction like changing the application source code or manually reconfiguring the parallel system. We compare our novel runtime analysis with an offline approach and demonstrate that it can achieve equal performance at a fraction of the cost. The next approach presented in this theis, we show that it is possible to predict the impact of this variability on specific applications by using variability-aware power prediction models. Based on these power models, we propose two job scheduling policies that consider the effects of manufacturing variability for each application and that ensures that power
consumption stays under a system wide power budget. We evaluate our policies under different power budgets and traffic scenarios, consisting of both single- and multi-node parallel applications.Los sistemas modernos de gran escala muestran crecientes demandas de energÃa, hasta el punto de que se ha convertido en una gran presión para las instalaciones y los presupuestos. Las restricciones crecientes de consumo de energÃa de los sistemas de alto rendimiento (HPC) y los centros de datos han obligado a los proveedores de hardware a incluir capacidades de limitación de energÃa en sus procesadores. La limitación de energÃa abre nuevas oportunidades para que las aplicaciones administren directamente su comportamiento de energÃa a nivel de usuario. Sin embargo, la restricción en el consumo de energÃa de sockets individuales de un sistema paralelo resulta en diferentes niveles de rendimiento, por el mismo lÃmite de potencia, incluso cuando están diseñados por igual. Esto es un efecto causado durante el proceso de la fabricación. Los chips modernos sufren de un consumo de energÃa heterogéneo debido a problemas de fabricación, un problema conocido como variabilidad del proceso o fabricación. Como resultado, los sistemas que no consideran este tipo de variabilidad causada por problemas de fabricación conducen a degradaciones del rendimiento y desperdicio de energÃa. Para evitar dicho impacto negativo, los usuarios y administradores del sistema deben contrarrestar activamente cualquier variabilidad de fabricación. En esta tesis, demostramos que los sistemas paralelos se benefician de tener en cuenta las consecuencias de la variabilidad de la fabricación, tanto en términos de rendimiento como de eficiencia energética. Para evaluar nuestro trabajo, también hemos implementado nuestra propia versión del paquete de aplicaciones de prueba PARSEC, basada en tareas paralelos. Esto permite probar nuestra metodologÃa utilizando técnicas avanzadas de paralelización con cargas de trabajo del mundo real. Presentamos dos enfoques para mitigar la variabilidad de fabricación, mediante la redistribución de la energÃa a durante la ejecución de las aplicaciones y mediante la programación de trabajos a nivel de todo el sistema. Se puede utilizar un sistema runtime paralelo para tratar con eficacia este nuevo tipo de heterogeneidad de rendimiento, compensando los efectos desiguales de la limitación de potencia. En el contexto de un nodo NUMA compuesto de varios sockets y núcleos, nuestro sistema puede optimizar los niveles de energÃa y concurrencia asignados a cada socket para maximizar el rendimiento. Aplicado de manera transparente dentro del sistema runtime paralelo, no requiere ninguna interacción del programador como cambiar el código fuente de la aplicación o reconfigurar manualmente el sistema paralelo. Comparamos nuestro novedoso análisis de runtime con los resultados óptimos, obtenidos de una análisis manual exhaustiva, y demostramos que puede lograr el mismo rendimiento a una fracción del costo. El siguiente enfoque presentado en esta tesis, muestra que es posible predecir el impacto de la variabilidad de fabricación en aplicaciones especÃficas mediante el uso de modelos de predicción de potencia conscientes de la variabilidad. Basados ​​en estos modelos de predicción de energÃa, proponemos dos polÃticas de programación de trabajos que consideran los efectos de la variabilidad de fabricación para cada aplicación y que aseguran que el consumo se mantiene bajo un presupuesto de energÃa de todo el sistema. Evaluamos nuestras polÃticas con diferentes presupuestos de energÃa y escenarios de tráfico, que consisten en aplicaciones paralelas que corren en uno o varios nodos.Postprint (published version
Towards resource-aware computing for task-based runtimes and parallel architectures
Current large scale systems show increasing power demands, to the point that it has become a huge strain on facilities and budgets. The increasing restrictions in terms of power consumption of High Performance Computing (HPC) systems and data centers have forced hardware vendors to include power capping capabilities in their commodity processors. Power capping opens up new opportunities for applications to directly manage their power behavior at user level. However, constraining power consumption causes the individual sockets of a parallel system to deliver different performance levels under the same power cap, even when they are equally designed, which is an effect caused by manufacturing variability. Modern chips suffer from heterogeneous power consumption due to manufacturing issues, a problem known as manufacturing or process variability. As a result, systems that do not consider such variability caused by manufacturing issues lead to performance degradations and wasted power. In order to avoid such negative impact, users and system administrators must actively counteract any manufacturing variability.
In this thesis we show that parallel systems benefit from taking into account the consequences of manufacturing variability, in terms of both performance and energy efficiency. In order to evaluate our work we have also implemented our own task-based version of the PARSEC benchmark suite. This allows to test our methodology using state-of-the-art parallelization techniques and real world workloads. We present two approaches to mitigate manufacturing variability, by power redistribution at runtime level and by power- and variability-aware job scheduling at system-wide level. A parallel runtime system can be used to effectively deal with this new kind of performance heterogeneity by compensating the uneven effects of power capping. In the context of a NUMA node composed of several multi core sockets, our system is able to optimize the energy and concurrency levels assigned to each socket to maximize performance. Applied transparently within the parallel runtime system, it does not require any programmer interaction like changing the application source code or manually reconfiguring the parallel system. We compare our novel runtime analysis with an offline approach and demonstrate that it can achieve equal performance at a fraction of the cost. The next approach presented in this theis, we show that it is possible to predict the impact of this variability on specific applications by using variability-aware power prediction models. Based on these power models, we propose two job scheduling policies that consider the effects of manufacturing variability for each application and that ensures that power
consumption stays under a system wide power budget. We evaluate our policies under different power budgets and traffic scenarios, consisting of both single- and multi-node parallel applications.Los sistemas modernos de gran escala muestran crecientes demandas de energÃa, hasta el punto de que se ha convertido en una gran presión para las instalaciones y los presupuestos. Las restricciones crecientes de consumo de energÃa de los sistemas de alto rendimiento (HPC) y los centros de datos han obligado a los proveedores de hardware a incluir capacidades de limitación de energÃa en sus procesadores. La limitación de energÃa abre nuevas oportunidades para que las aplicaciones administren directamente su comportamiento de energÃa a nivel de usuario. Sin embargo, la restricción en el consumo de energÃa de sockets individuales de un sistema paralelo resulta en diferentes niveles de rendimiento, por el mismo lÃmite de potencia, incluso cuando están diseñados por igual. Esto es un efecto causado durante el proceso de la fabricación. Los chips modernos sufren de un consumo de energÃa heterogéneo debido a problemas de fabricación, un problema conocido como variabilidad del proceso o fabricación. Como resultado, los sistemas que no consideran este tipo de variabilidad causada por problemas de fabricación conducen a degradaciones del rendimiento y desperdicio de energÃa. Para evitar dicho impacto negativo, los usuarios y administradores del sistema deben contrarrestar activamente cualquier variabilidad de fabricación. En esta tesis, demostramos que los sistemas paralelos se benefician de tener en cuenta las consecuencias de la variabilidad de la fabricación, tanto en términos de rendimiento como de eficiencia energética. Para evaluar nuestro trabajo, también hemos implementado nuestra propia versión del paquete de aplicaciones de prueba PARSEC, basada en tareas paralelos. Esto permite probar nuestra metodologÃa utilizando técnicas avanzadas de paralelización con cargas de trabajo del mundo real. Presentamos dos enfoques para mitigar la variabilidad de fabricación, mediante la redistribución de la energÃa a durante la ejecución de las aplicaciones y mediante la programación de trabajos a nivel de todo el sistema. Se puede utilizar un sistema runtime paralelo para tratar con eficacia este nuevo tipo de heterogeneidad de rendimiento, compensando los efectos desiguales de la limitación de potencia. En el contexto de un nodo NUMA compuesto de varios sockets y núcleos, nuestro sistema puede optimizar los niveles de energÃa y concurrencia asignados a cada socket para maximizar el rendimiento. Aplicado de manera transparente dentro del sistema runtime paralelo, no requiere ninguna interacción del programador como cambiar el código fuente de la aplicación o reconfigurar manualmente el sistema paralelo. Comparamos nuestro novedoso análisis de runtime con los resultados óptimos, obtenidos de una análisis manual exhaustiva, y demostramos que puede lograr el mismo rendimiento a una fracción del costo. El siguiente enfoque presentado en esta tesis, muestra que es posible predecir el impacto de la variabilidad de fabricación en aplicaciones especÃficas mediante el uso de modelos de predicción de potencia conscientes de la variabilidad. Basados ​​en estos modelos de predicción de energÃa, proponemos dos polÃticas de programación de trabajos que consideran los efectos de la variabilidad de fabricación para cada aplicación y que aseguran que el consumo se mantiene bajo un presupuesto de energÃa de todo el sistema. Evaluamos nuestras polÃticas con diferentes presupuestos de energÃa y escenarios de tráfico, que consisten en aplicaciones paralelas que corren en uno o varios nodos
Exploiting asymmetric multi-core systems with flexible system software
Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. These architectures combine different types of processing cores designed at different performance and power optimization points, thus exposing a performance-power trade-off. By maintaining two types of cores, AMCs are able to provide high performance under the facility power budget. However, there are significant challenges when using AMCs such as scheduling and load balancing.
This thesis initially explores the potential of AMCs when executing current HPC applications and searches for the most appropriate execution model. Specifically we evaluate several execution models on an Arm big.LITTLE AMC using the PARSEC benchmark suite that includes representative HPC applications. We compare schedulers at the user, OS and runtime system levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system as it improves the user-level scheduling by 23%, while the heterogeneous-aware OS scheduling solution improves the user-level scheduling by 10%.
Following this outcome, this thesis focuses on increasing performance of AMC systems by improving scheduling in the runtime system level. Scheduling in the runtime system level is provided by the use of task-based parallel programming models. These programming models offer programming flexibility as they consist of an interface and a runtime system to manage the underlying resources and threads. In this thesis we improve scheduling with task-based programming models by providing three novel task schedulers for AMCs. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with an existing state-of the art heterogeneous scheduler and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45x on a real 8-core AMC and up to 2.1x on a simulated 32-core AMC.
Another enhancement we provide in task-based programming models is the adaptability to fine grained parallelism. The increasing number of cores on modern CMPs is pushing research towards the use of fine grained workloads, which is an important challenge for task-based programming models. Our study makes the observation that task creation becomes a bottleneck when executing fine grained workloads with task-based programming models. As the number of cores increases, the time spent generating tasks is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX minimizes task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. From our evaluation using 11 HPC workloads on both symmetric and AMC systems, we obtain performance improvements up to 15x, averaging to 3.1x over the baseline.
Finally, this thesis presents a showcase for a real-time CPU scheduler with the goal to increase the frames per second (FPS) of the game-play on mobile devices with AMC systems. We design and implement the RTS scheduler in the Android framework. RTS provides an efficient scheduling policy that takes into account the current temperature of the system to perform task migration. RTS solution increases the median FPS of the baseline mechanisms by up to 7.5% and at the same time it maintains temperature stable.Los procesadores multinúcleos asimétricos (AMC) son una solución arquitectónica exitosa para dispositivos móviles y supercomputadores. Estas arquitecturas combinan diferentes tipos de núcleos de procesamiento diseñados con diferentes propiedades de rendimiento y potencia. Al mantener dos o más tipos de núcleos, los AMCs pueden proporcionar un alto rendimiento con un consumo bajo de energÃa de las infraestructuras. Sin embargo, existen importantes desafÃos al usar los AMC, como la programación y el equilibrio de carga. Esta tesis explora inicialmente el potencial de los AMC al ejecutar aplicaciones actuales de Computacion de Alto Rendimiento (HPC) y busca el modelo de ejecución más apropiado para ellas. EspecÃficamente evaluamos varios modelos de ejecución en un procesador asimétrico Arm big.LITTLE utilizando las aplicaciones PARSEC que son aplicaciones representativas de HPC. En este trabajo se compara la programación en los niveles de usuario, sistema operativo y librerÃa y evaluamos el impacto de estas opciones en el conocido problema de equilibrar la carga entre los AMCs. Nuestros resultados demuestran que la programación es más efectiva cuando se lleva a cabo en el nivel del runtime, ya que mejora la programación del nivel de usuario en un 23%, mientras que la solución de programación del sistema operativo heterogéneo mejora la programación del nivel de usuario en un 10%. Siguiendo este resultado, esta tesis se centra en aumentar el rendimiento de los sistemas AMC mejorando la programación al nivel de librerÃa. La programación en este nivel se proporciona mediante el uso de Modelos de Programación Paralelos Basados en Tareas (MPBT). Estos modelos de programación ofrecen flexibilidad de programación, ya que consisten en una interfaz y un runtime para administrar los recursos e hilos subyacentes. En esta tesis, mejoramos la programación con MPBT al proporcionar tres nuevos planificadores de tareas para AMCs. Estos planificadores dinámicos reducen el tiempo total de ejecución ya sea detectando la camino más largo o el camino crÃtico del grafo de dependencia de tareas de la aplicación, que es generado dinámicamente. En nuestra evaluación, comparamos estos planificadores con un planificador heterogéneo existente y demonstramos su mejora sobre un planificador FIFO. Mostramos que los planificadores heterogéneos mejoran el planificador FIFO en hasta 1.45x en un AMC real de 8 núcleos y hasta 2.1x en un AMC simulado de 32 núcleos. Otra contribución en los MPBT es la adaptabilidad al paralelismo de grano fino. El creciente número de núcleos en los chip multinúcleos modernos está empujando la investigación hacia el uso de cargas de trabajo de grano fino, que es un desafÃo importante para los MPBT. Nuestro estudio observa que la creación de tareas bloquea la ejecución con cargas de trabajo de grano fino con MPBT. Cuando el número de núcleos aumenta, el tiempo empleado en generar tareas pasa a ser más crÃtico para toda la ejecución. Nuestra solución es TaskGenX, que minimiza los costes de creación de tareas y se basa en una extensión del runtime y en un hardware dedicado. En el runtime, TaskGenX desacopla la creación de tareas de las otras actividades del runtime, ejecutando esta actividad en un hardware especializado. Evaluamos 11 aplicaciones de HPC con TaskGenX en sistemas simétricos y AMC y obtenemos mejoras de rendimiento de hasta 15x, con un promedio de 3.1x sobre la implementación de referencia. Finalmente, esta tesis presenta un planificador de CPU con el objetivo de aumentar los fotogramas por segundo (FPS) para juegos en dispositivos móviles con sistemas AMC. Diseñamos e implementamos el planificador de Real-Time Scheduler (RTS) en Android. El RTS proporciona una polÃtica de programación eficiente que tiene en cuenta la temperatura actual del sistema para realizar la migración de tareas. La solución RTS aumenta la FPS mediana de los mecanismos de referenciaPostprint (published version
Performance-aware scheduling of parallel applications on non-dedicated clusters
This work presents a HPC framework that provides new strategies for resource management and job scheduling, based on executing different applications in shared compute nodes, maximizing platform utilization. The framework includes a scalable monitoring tool that is able to analyze the platform's compute node utilization. We also introduce an extension of CLARISSE, a middleware for data-staging coordination and control on large-scale HPC platforms that uses the information provided by the monitor in combination with application-level analysis to detect performance degradation in the running applications. This degradation, caused by the fact that the applications share the compute nodes and may compete for their resources, is avoided by means of dynamic application migration. A description of the architecture, as well as a practical evaluation of the proposal, shows significant performance improvements up to 20% in the makespan and 10% in energy consumption compared to a non-optimized execution.This work was partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under the grant TIN2016-79637-P "Towards Unification of HPC and Big Data Paradigms"; and the European Union's Horizon 2020 research and innovation program under Grant No. 801091, project "Exascale programming models for extreme data processing" (ASPIDE)
Exploiting asymmetric multi-core systems with flexible system software
Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. These architectures combine different types of processing cores designed at different performance and power optimization points, thus exposing a performance-power trade-off. By maintaining two types of cores, AMCs are able to provide high performance under the facility power budget. However, there are significant challenges when using AMCs such as scheduling and load balancing.
This thesis initially explores the potential of AMCs when executing current HPC applications and searches for the most appropriate execution model. Specifically we evaluate several execution models on an Arm big.LITTLE AMC using the PARSEC benchmark suite that includes representative HPC applications. We compare schedulers at the user, OS and runtime system levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system as it improves the user-level scheduling by 23%, while the heterogeneous-aware OS scheduling solution improves the user-level scheduling by 10%.
Following this outcome, this thesis focuses on increasing performance of AMC systems by improving scheduling in the runtime system level. Scheduling in the runtime system level is provided by the use of task-based parallel programming models. These programming models offer programming flexibility as they consist of an interface and a runtime system to manage the underlying resources and threads. In this thesis we improve scheduling with task-based programming models by providing three novel task schedulers for AMCs. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with an existing state-of the art heterogeneous scheduler and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45x on a real 8-core AMC and up to 2.1x on a simulated 32-core AMC.
Another enhancement we provide in task-based programming models is the adaptability to fine grained parallelism. The increasing number of cores on modern CMPs is pushing research towards the use of fine grained workloads, which is an important challenge for task-based programming models. Our study makes the observation that task creation becomes a bottleneck when executing fine grained workloads with task-based programming models. As the number of cores increases, the time spent generating tasks is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX minimizes task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. From our evaluation using 11 HPC workloads on both symmetric and AMC systems, we obtain performance improvements up to 15x, averaging to 3.1x over the baseline.
Finally, this thesis presents a showcase for a real-time CPU scheduler with the goal to increase the frames per second (FPS) of the game-play on mobile devices with AMC systems. We design and implement the RTS scheduler in the Android framework. RTS provides an efficient scheduling policy that takes into account the current temperature of the system to perform task migration. RTS solution increases the median FPS of the baseline mechanisms by up to 7.5% and at the same time it maintains temperature stable.Los procesadores multinúcleos asimétricos (AMC) son una solución arquitectónica exitosa para dispositivos móviles y supercomputadores. Estas arquitecturas combinan diferentes tipos de núcleos de procesamiento diseñados con diferentes propiedades de rendimiento y potencia. Al mantener dos o más tipos de núcleos, los AMCs pueden proporcionar un alto rendimiento con un consumo bajo de energÃa de las infraestructuras. Sin embargo, existen importantes desafÃos al usar los AMC, como la programación y el equilibrio de carga. Esta tesis explora inicialmente el potencial de los AMC al ejecutar aplicaciones actuales de Computacion de Alto Rendimiento (HPC) y busca el modelo de ejecución más apropiado para ellas. EspecÃficamente evaluamos varios modelos de ejecución en un procesador asimétrico Arm big.LITTLE utilizando las aplicaciones PARSEC que son aplicaciones representativas de HPC. En este trabajo se compara la programación en los niveles de usuario, sistema operativo y librerÃa y evaluamos el impacto de estas opciones en el conocido problema de equilibrar la carga entre los AMCs. Nuestros resultados demuestran que la programación es más efectiva cuando se lleva a cabo en el nivel del runtime, ya que mejora la programación del nivel de usuario en un 23%, mientras que la solución de programación del sistema operativo heterogéneo mejora la programación del nivel de usuario en un 10%. Siguiendo este resultado, esta tesis se centra en aumentar el rendimiento de los sistemas AMC mejorando la programación al nivel de librerÃa. La programación en este nivel se proporciona mediante el uso de Modelos de Programación Paralelos Basados en Tareas (MPBT). Estos modelos de programación ofrecen flexibilidad de programación, ya que consisten en una interfaz y un runtime para administrar los recursos e hilos subyacentes. En esta tesis, mejoramos la programación con MPBT al proporcionar tres nuevos planificadores de tareas para AMCs. Estos planificadores dinámicos reducen el tiempo total de ejecución ya sea detectando la camino más largo o el camino crÃtico del grafo de dependencia de tareas de la aplicación, que es generado dinámicamente. En nuestra evaluación, comparamos estos planificadores con un planificador heterogéneo existente y demonstramos su mejora sobre un planificador FIFO. Mostramos que los planificadores heterogéneos mejoran el planificador FIFO en hasta 1.45x en un AMC real de 8 núcleos y hasta 2.1x en un AMC simulado de 32 núcleos. Otra contribución en los MPBT es la adaptabilidad al paralelismo de grano fino. El creciente número de núcleos en los chip multinúcleos modernos está empujando la investigación hacia el uso de cargas de trabajo de grano fino, que es un desafÃo importante para los MPBT. Nuestro estudio observa que la creación de tareas bloquea la ejecución con cargas de trabajo de grano fino con MPBT. Cuando el número de núcleos aumenta, el tiempo empleado en generar tareas pasa a ser más crÃtico para toda la ejecución. Nuestra solución es TaskGenX, que minimiza los costes de creación de tareas y se basa en una extensión del runtime y en un hardware dedicado. En el runtime, TaskGenX desacopla la creación de tareas de las otras actividades del runtime, ejecutando esta actividad en un hardware especializado. Evaluamos 11 aplicaciones de HPC con TaskGenX en sistemas simétricos y AMC y obtenemos mejoras de rendimiento de hasta 15x, con un promedio de 3.1x sobre la implementación de referencia. Finalmente, esta tesis presenta un planificador de CPU con el objetivo de aumentar los fotogramas por segundo (FPS) para juegos en dispositivos móviles con sistemas AMC. Diseñamos e implementamos el planificador de Real-Time Scheduler (RTS) en Android. El RTS proporciona una polÃtica de programación eficiente que tiene en cuenta la temperatura actual del sistema para realizar la migración de tareas. La solución RTS aumenta la FPS mediana de los mecanismos de referenci
- …