39 research outputs found

    Coordinated Scheduling and Dynamic Performance Analysis in Multiprocessors Systems

    Get PDF
    El rendimiento de los actuales sistemas multiprocesador de memoria compartida depende tanto de la utilización eficiente de todos los componentes del sistema (procesadores, memoria, etc), como de las características del conjunto de aplicaciones a ejecutar. Esta Tesis tiene como principal objetivo mejorar la ejecución de conjuntos de aplicaciones paralelas en sistemas multiprocesador de memoria compartida mediante la utilización de información sobre el rendimiento de las aplicaciones para la planificación de los procesadores.Es una práctica común de los usuarios de un sistema multiprocesador reservar muchos procesadores para ejecutar sus aplicaciones asumiendo que cuantos más procesadores utilicen mejor rendimiento sacarán sus aplicaciones. Sin embargo, normalmente esto no es cierto. Las aplicaciones paralelas tienen diferentes características respecto a su escalabilidad. Su rendimiento depende además de parámetros que sólo son conocidos en tiempo de ejecución, como por ejemplo el conjunto de datos de entrada o la influencia que pueden ejercer determinadas aplicaciones que se ejecutan de forma concurrente.En esta tesis proponemos que el sistema no base sus decisiones solamente en las peticiones de recursos de los usuarios sino que él, dinámicamente, mida el rendimiento que están consiguiendo las aplicaciones y base, o ajuste, sus decisiones teniendo en cuenta esa información.El rendimiento de las aplicaciones paralelas puede ser medido por el sistema de forma dinámica y automática sin introducir una sobrecarga significativa en la ejecución de las aplicaciones. Utilizando esta información, la planificación de procesadores puede ser decidida, o ajustada, siendo mucho más robusta a requerimientos incorrectos por parte de los usuarios, que otras políticas que no consideran este tipo de información. Además de considerar el rendimiento, proponemos imponer una eficiencia objetivo a las aplicaciones paralelas. Esta eficiencia objetivo determinará si la aplicación está consiguiendo un rendimiento aceptable o no, y será usada para ajustar la asignación de procesadores. La eficiencia objetivo de un sistema podrá ser un parámetro ajustable dinámicamente en función del estado del sistema: número de aplicaciones ejecutándose, aplicaciones encoladas, etc.También proponemos coordinar los diferentes niveles de planificación que intervienen en la planificación de procesadores: Nivel librería de usuario, planificador de procesadores (en el S.O), y gestión del sistema de colas. La idea es establecer una interficie entre niveles para enviar y recibir información entre niveles, así como considerar esta información para tomar las decisiones propias de cada nivel.La evaluación de esta Tesis ha sido realizada utilizando un enfoque práctico. Hemos diseñado e implementado un entorno de ejecución completo para ejecutar aplicaciones paralelas que siguen el modelo de programación OpenMP. Hemos introducido nuestras propuestas modificando los tres niveles de planificación mencionados. Los resultados muestran que las ideas propuestas en esta tesis mejoran significativamente el rendimiento del sistema. En aquellos casos en que tanto las aplicaciones como los parámetros del sistema han sido previamente optimizados, las propuestas realizadas introducen una penalización del 5% en el peor de los casos, comparado con el mejor de los resultados obtenidos por otras políticas evaluadas. Sin embargo, en otros casos evaluados, las propuestas realizadas en esta tesis han mejorado el rendimiento del sistema hasta un 400% también comparado con el mejor resultado obtenido por otras políticas evaluadas.Las principales conclusiones que podemos obtener de esta Tesis son las siguientes: - El rendimiento de las aplicaciones paralelas puede ser medido en tiempo de ejecución. Los requisitos para aplicar el mecanismo de medida propuesto en esta Tesis son que las aplicaciones sean maleables y estar en un entorno de ejecución multiprocesador de memoria compartida. - El rendimiento de las aplicaciones paralelas debe ser considerado para decidir la asignación de procesadores a aplicaciones. El sistema debe utilizar la información del rendimiento para auto-ajustar sus decisiones. Además, el sistema debe imponer una eficiencia objetivo para asegurar el uso eficiente de procesadores.- Los diferentes niveles de planificación deben estar coordinados para evitar interferencias entre ellosThe performance of current shared-memory multiprocessor systems depends on both the efficient utilization of all the architectural elements in the system (processors, memory, etc), and the workload characteristics.This Thesis has the main goal of improving the execution of workloads of parallel applications in shared-memory multiprocessor systems by using real performance information in the processor scheduling.It is a typical practice of users in multiprocessor systems to request for a high number of processors assuming that the higher the processor request, the higher the number of processors allocated, and the higher the speedup achieved by their applications. However, this is not true. Parallel applications have different characteristics with respect to their scalability. Their speedup also depends on run-time parameters such as the influence of the rest of running applications.This Thesis proposes that the system should not base its decisions on the users requests only, but the system must decide, or adjust, its decisions based on real performance information calculated at run-time. The performance of parallel applications is information that the system can dynamically measure without introducing a significant penalty in the application execution time. Using this information, the processor allocation can be decided, or modified, being robust to incorrect processor requests given by users. We also propose that the system use a target efficiency to ensure the efficient use of processors. This target efficiency is a system parameter and can be dynamically decided as a function of the characteristics of running applications or the number of queued applications.We also propose to coordinate the different scheduling levels that operate in the processor scheduling: the run-time scheduler, the processor scheduler, and the queueing system. We propose to establish an interface between levels to send and receive information, and to take scheduling decisions considering the information provided by the rest of levels.The evaluation of this Thesis has been done using a practical approach. We have designed and implemented a complete execution environment to execute OpenMP parallel applications. We have introduced our proposals, modifying the three scheduling levels (run-time library, processor scheduler, and queueing system).Results show that the ideas proposed in this Thesis significantly improve the system performance. If the evaluated workload has been previously tuned, in the worst case, we have introduced a slowdown around 5% in the workload execution time compared with the best execution time achieved. However, in some extreme cases, with a workload and a system configuration not previously tuned, we have improved the system performance in a 400%, also compared with the next best time.The main results achieved in this Thesis can be summarized as follows:- The performance of parallel applications can be measured at run-time. The requirements to apply the mechanism proposed in this Thesis are to have malleable applications and shared-memory multiprocessor architectures.- The performance of parallel applications 1must be considered to decide the processor allocation. The system must use this information to self-adjust its decisions based on the achieved performance. Moreover, the system must impose a target efficiency to ensure the efficient use of processors.- The different scheduling levels must be coordinated to avoid interferences between levels.Postprint (published version

    Coordinated Scheduling and Dynamic Performance Analysis in Multiprocessors Systems

    Get PDF
    El rendimiento de los actuales sistemas multiprocesador de memoria compartida depende tanto de la utilización eficiente de todos los componentes del sistema (procesadores, memoria, etc), como de las características del conjunto de aplicaciones a ejecutar. Esta Tesis tiene como principal objetivo mejorar la ejecución de conjuntos de aplicaciones paralelas en sistemas multiprocesador de memoria compartida mediante la utilización de información sobre el rendimiento de las aplicaciones para la planificación de los procesadores.Es una práctica común de los usuarios de un sistema multiprocesador reservar muchos procesadores para ejecutar sus aplicaciones asumiendo que cuantos más procesadores utilicen mejor rendimiento sacarán sus aplicaciones. Sin embargo, normalmente esto no es cierto. Las aplicaciones paralelas tienen diferentes características respecto a su escalabilidad. Su rendimiento depende además de parámetros que sólo son conocidos en tiempo de ejecución, como por ejemplo el conjunto de datos de entrada o la influencia que pueden ejercer determinadas aplicaciones que se ejecutan de forma concurrente.En esta tesis proponemos que el sistema no base sus decisiones solamente en las peticiones de recursos de los usuarios sino que él, dinámicamente, mida el rendimiento que están consiguiendo las aplicaciones y base, o ajuste, sus decisiones teniendo en cuenta esa información.El rendimiento de las aplicaciones paralelas puede ser medido por el sistema de forma dinámica y automática sin introducir una sobrecarga significativa en la ejecución de las aplicaciones. Utilizando esta información, la planificación de procesadores puede ser decidida, o ajustada, siendo mucho más robusta a requerimientos incorrectos por parte de los usuarios, que otras políticas que no consideran este tipo de información. Además de considerar el rendimiento, proponemos imponer una eficiencia objetivo a las aplicaciones paralelas. Esta eficiencia objetivo determinará si la aplicación está consiguiendo un rendimiento aceptable o no, y será usada para ajustar la asignación de procesadores. La eficiencia objetivo de un sistema podrá ser un parámetro ajustable dinámicamente en función del estado del sistema: número de aplicaciones ejecutándose, aplicaciones encoladas, etc.También proponemos coordinar los diferentes niveles de planificación que intervienen en la planificación de procesadores: Nivel librería de usuario, planificador de procesadores (en el S.O), y gestión del sistema de colas. La idea es establecer una interficie entre niveles para enviar y recibir información entre niveles, así como considerar esta información para tomar las decisiones propias de cada nivel.La evaluación de esta Tesis ha sido realizada utilizando un enfoque práctico. Hemos diseñado e implementado un entorno de ejecución completo para ejecutar aplicaciones paralelas que siguen el modelo de programación OpenMP. Hemos introducido nuestras propuestas modificando los tres niveles de planificación mencionados. Los resultados muestran que las ideas propuestas en esta tesis mejoran significativamente el rendimiento del sistema. En aquellos casos en que tanto las aplicaciones como los parámetros del sistema han sido previamente optimizados, las propuestas realizadas introducen una penalización del 5% en el peor de los casos, comparado con el mejor de los resultados obtenidos por otras políticas evaluadas. Sin embargo, en otros casos evaluados, las propuestas realizadas en esta tesis han mejorado el rendimiento del sistema hasta un 400% también comparado con el mejor resultado obtenido por otras políticas evaluadas.Las principales conclusiones que podemos obtener de esta Tesis son las siguientes: - El rendimiento de las aplicaciones paralelas puede ser medido en tiempo de ejecución. Los requisitos para aplicar el mecanismo de medida propuesto en esta Tesis son que las aplicaciones sean maleables y estar en un entorno de ejecución multiprocesador de memoria compartida. - El rendimiento de las aplicaciones paralelas debe ser considerado para decidir la asignación de procesadores a aplicaciones. El sistema debe utilizar la información del rendimiento para auto-ajustar sus decisiones. Además, el sistema debe imponer una eficiencia objetivo para asegurar el uso eficiente de procesadores.- Los diferentes niveles de planificación deben estar coordinados para evitar interferencias entre ellosThe performance of current shared-memory multiprocessor systems depends on both the efficient utilization of all the architectural elements in the system (processors, memory, etc), and the workload characteristics.This Thesis has the main goal of improving the execution of workloads of parallel applications in shared-memory multiprocessor systems by using real performance information in the processor scheduling.It is a typical practice of users in multiprocessor systems to request for a high number of processors assuming that the higher the processor request, the higher the number of processors allocated, and the higher the speedup achieved by their applications. However, this is not true. Parallel applications have different characteristics with respect to their scalability. Their speedup also depends on run-time parameters such as the influence of the rest of running applications.This Thesis proposes that the system should not base its decisions on the users requests only, but the system must decide, or adjust, its decisions based on real performance information calculated at run-time. The performance of parallel applications is information that the system can dynamically measure without introducing a significant penalty in the application execution time. Using this information, the processor allocation can be decided, or modified, being robust to incorrect processor requests given by users. We also propose that the system use a target efficiency to ensure the efficient use of processors. This target efficiency is a system parameter and can be dynamically decided as a function of the characteristics of running applications or the number of queued applications.We also propose to coordinate the different scheduling levels that operate in the processor scheduling: the run-time scheduler, the processor scheduler, and the queueing system. We propose to establish an interface between levels to send and receive information, and to take scheduling decisions considering the information provided by the rest of levels.The evaluation of this Thesis has been done using a practical approach. We have designed and implemented a complete execution environment to execute OpenMP parallel applications. We have introduced our proposals, modifying the three scheduling levels (run-time library, processor scheduler, and queueing system).Results show that the ideas proposed in this Thesis significantly improve the system performance. If the evaluated workload has been previously tuned, in the worst case, we have introduced a slowdown around 5% in the workload execution time compared with the best execution time achieved. However, in some extreme cases, with a workload and a system configuration not previously tuned, we have improved the system performance in a 400%, also compared with the next best time.The main results achieved in this Thesis can be summarized as follows:- The performance of parallel applications can be measured at run-time. The requirements to apply the mechanism proposed in this Thesis are to have malleable applications and shared-memory multiprocessor architectures.- The performance of parallel applications 1must be considered to decide the processor allocation. The system must use this information to self-adjust its decisions based on the achieved performance. Moreover, the system must impose a target efficiency to ensure the efficient use of processors.- The different scheduling levels must be coordinated to avoid interferences between levels

    Modular workload format: Extending SWF for modular systems

    Get PDF
    This paper presents the Modular Workload Format (MWF), a proposal for extending the widely accepted Standard Workload Format (SWF) for job scheduling evaluation. David Talby and Dror Feitelson proposed the SWF in 1999, allowing to describe data center workload in a synthesized way. Its simplicity, representing each job by a single line in a text file and including details to make job scheduling evaluation quite accurate, was part of its success. Using these years’ experience but considering new system and workload characteristics, we propose an extension to support multiple steps in a single job, heterogeneous jobs, and relevant inputs not covered by the SWF as energy/power references. The goal of this contribution is to adapt the SWF to current trends in architectures and workloads. Moreover, we propose a simple approach for converting any already existing SWF trace file into an MWF trace file to be able to reuse already existing traces.This work is partially supported from the European Union’s Horizon 2020 under grant agreement No. 754304 (DEEP-EST Project) and the Spanish grant PID2019-107255GB-C21.Peer ReviewedPostprint (author's final draft

    Evaluating SLURM simulator with real-machine SLURM and vice versa

    Get PDF
    © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Having a precise and a fast job scheduler model that resembles the real-machine job scheduling software behavior is extremely important in the field of job scheduling. The idea behind SLURM simulator is preserving the original code of the core SLURM functions while allowing for all the advantages of a simulator. Since 2011, SLURM simulator has passed through several iterations of improvements in different research centers. In this work, we present our latest improvements of SLURM simulator and perform the first-ever validation of the simulator on the real machine. In particular, we improved the simulator's performance for about 2.6 times, made the simulator deterministic across several same set-up runs, and improved the simulator's accuracy; its deviation from the real-machine is lowered from previous 12% to at most 1.7%. Finally, we illustrate with several use cases the value of the simulator for job scheduling researchers, SLURM-system administrators, and SLURM developers.Peer ReviewedPostprint (author's final draft

    Dynamic load balancing of MPI+OpenMP applications

    Get PDF
    The hybrid programming model MPI+OpenMP are useful to solve the problems of load balancing of parallel applications independently of the architecture. Typical approaches to balance parallel applications using two levels of parallelism or only MPI consist of including complex codes that dynamically detect which data domains are more computational intensive and either manually redistribute the allocated processors or manually redistribute data. This approach has two drawbacks: it is time consuming and it requires an expert in application analysis. In this paper we present an automatic and dynamic approach for load balancing MPI+OpenMP applications. The system calculates the percentage of load imbalance and decides a processor distribution for the MPI processes that eliminates the computational load imbalance. Results show that this method can balance effectively applications without analyzing nor modifying them and that in the cases that the application was well balanced does not incur in a great overhead for the dynamic instrumentation and analysis realized.This work has been supported by the Spanish Ministry of Education under grant CYCIT TIC2001-0995-C02-01, the ESPRIT Project POP (IST -2001-33071) and by the IBM CAS Program. The research described in this work has been developed using the resources of the European Center for Parallelism of Barcelona (CEPBA).Peer ReviewedPostprint (author's final draft

    Dynamic load balancing for hybrid applications

    Get PDF
    DLB relies on the usage of hybrid programming models and exploits the malleability of the second level of parallelism to redistribute computation power across processes

    BSLD threshold driven parallel job scheduling for energy efficient HPC centers

    Get PDF
    Recently, power awareness in high performance computing (HPC) community has increased significantly. While CPU power reduction of HPC applications using Dynamic Voltage Frequency Scaling (DVFS) has been explored thoroughly, CPU power management for large scale parallel systems at system level has left unexplored. In this paper we propose a power-aware parallel job scheduler assuming DVFS enabled clusters. Traditional parallel job schedulers determine when a job will be run, power aware ones should assign CPU frequency which it will be run at. We have introduced two adjustable thresholds in order to enable fine grain energy performance trade-off control. Since our power reduction approach is policy independent it can be added to any parallel job scheduling policy. Furthermore, we have done an analysis of HPC system dimension. Running an application at lower frequency on more processors can be more energy efficient than running it at the highest CPU frequency on less processors. This paper investigates whether having more DVFS enabled processors and same load can lead to better energy efficiency and performance. Five workload logs from systems in production use with up to 9 216 processors are simulated to evaluate the proposed algorithm and the dimensioning problem. Our approach decreases CPU energy by 7%- 18% on average depending on allowed job performance penalty. Applying the same frequency scaling algorithm on 20% larger system, CPU energy needed to execute same load can be decreased by almost 30% while having same or better job performance.Postprint (published version

    BSLD threshold driven power management policy for HPC centers

    Get PDF
    In this paper, we propose a power-aware parallel job scheduler assuming DVFS enabled clusters. A CPU frequency assignment algorithm is integrated into the well established EASY backfilling job scheduling policy. Running a job at lower frequency results in a reduction in power dissipation and accordingly in energy consumption. However, lower frequencies introduce a penalty in performance. Our frequency assignment algorithm has two adjustable parameters in order to enable fine grain energy-performance trade-off control. Furthermore, we have done an analysis of HPC system dimension. This paper investigates whether having more DVFS enabled processors for same load can lead to better energy efficiency and performance. Five workload traces from systems in production use with up to 9 216 processors are simulated to evaluate the proposed algorithm and the dimensioning problem. Our approach decreases CPU energy by 7%– 18% on average depending on allowed job performance penalty. Using the power-aware job scheduling for 20% larger system, CPU energy needed to execute same load can be decreased by almost 30% while having same or better job performance.Peer ReviewedPostprint (published version

    Explicit uncore frequency scaling for energy optimisation policies with EAR in Intel architectures

    Get PDF
    EAR is an energy management framework which offers three main services: energy accounting, energy control and energy optimisation. The latter is done through the EAR runtime library (EARL). EARL is a dynamic, transparent, and lightweight runtime library that provides energy optimisation and control. It implements energy optimisation policies that selects the optimal CPU frequency based on runtime application characteristics and policy settings. Given that EARL defines a policy API and a plugin mechanism, different policies can be easily evaluated. In this paper we propose and evaluate the utilisation of explicit Uncore Frequency Scaling (explicit UFS) in Intel architectures to increase the energy savings opportunities in the cases where the hardware cannot select the optimal frequency for the Integrated Memory Controller (IMC). We extended the min_energy_to_solution policy to select the CPU and IMC frequencies and we executed and evaluated it with some kernels and six real applications. Results showed an average energy saving of 9% with an average time penalty of 3%. On some use cases, the impact of explicit UFS compared with HW UFS was up to 8% of extra energy savings.This work has been funded by the BSC-Lenovo collaboration agreement.Peer ReviewedPostprint (author's final draft

    Optimizing job performance under a given power constraint in HPC centers

    Get PDF
    Never-ending striving for performance has resulted in a tremendous increase in power consumption of HPC centers. Power budgeting has become very important from several reasons such as reliability, operating costs and limited power draw due to the existing infrastructure. In this paper we propose a power budget guided job scheduling policy that maximize overall job performance for a given power budget. We have shown that using DVFS under a power constraint performance can be significantly improved as it allows more jobs to run simultaneously leading to shorter wait times. Aggressiveness of frequency scaling applied to a job depends on instantaneous power consumption and on the job's predicted performance. Our policy has been evaluated for four workload traces from systems in production use with up to 4 008 processors. The results show that our policy achieves up to two times better performance compared to power budgeting without DVFS. Moreover it leads to 23% lower CPU energy consumption on average. Furthermore, we have investigated how much job performance and energy efficiency can be improved under our policy and same power budget by an increase in the number of DVFS enabled processors.Peer ReviewedPostprint (published version
    corecore