10 research outputs found

    Experiences on the characterization of parallel applications in embedded systems with Extrae/Paraver

    Get PDF
    Cutting-edge functionalities in embedded systems require the use of parallel architectures to meet their performance requirements. This imposes the introduction of a new layer in the software stacks of embedded systems: the parallel programming model. Unfortunately, the tools used to analyze embedded systems fall short to characterize the performance of parallel applications at a parallel programming model level, and correlate this with information about non-functional requirements such as real-time, energy, memory usage, etc. HPC tools, like Extrae, are designed with that level of abstraction in mind, but their main focus is on performance evaluation. Overall, providing insightful information about the performance of parallel embedded applications at the parallel programming model level, and relate it to the non-functional requirements, is of paramount importance to fully exploit the performance capabilities of parallel embedded architectures. This paper contributes to the state-of-the-art of analysis tools for embedded systems by: (1) analyzing the particular constraints of embedded systems compared to HPC systems (e.g., static setting, restricted memory, limited drivers) to support HPC analysis tools; (2) porting Extrae, a powerful tracing tool from the HPC domain, to the GR740 platform, a SoC used in the space domain; and (3) augmenting Extrae with new features needed to correlate the parallel execution with the following non-functional requirements: energy, temperature and memory usage. Finally, the paper presents the usefulness of Extrae to characterize OpenMP applications and its non-functional requirements, evaluating different aspects of the applications running in the GR740.This work has been partially funded from the HP4S (High Performance Parallel Payload Processing for Space) project under the ESA-ESTEC ITI contract № 4000124124/18/NL/CRS.Peer ReviewedPostprint (author's final draft

    Automating the application data placement in hybrid memory systems

    Get PDF
    Multi-tiered memory systems, such as those based on Intel® Xeon Phi™processors, are equipped with several memory tiers with different characteristics including, among others, capacity, access latency, bandwidth, energy consumption, and volatility. The proper distribution of the application data objects into the available memory layers is key to shorten the time– to–solution, but the way developers and end-users determine the most appropriate memory tier to place the application data objects has not been properly addressed to date.In this paper we present a novel methodology to build an extensible framework to automatically identify and place the application’s most relevant memory objects into the Intel Xeon Phi fast on-package memory. Our proposal works on top of inproduction binaries by first exploring the application behavior and then substituting the dynamic memory allocations. This makes this proposal valuable even for end-users who do not have the possibility of modifying the application source code. We demonstrate the value of a framework based in our methodology for several relevant HPC applications using different allocation strategies to help end-users improve performance with minimal intervention. The results of our evaluation reveal that our proposal is able to identify the key objects to be promoted into fast on-package memory in order to optimize performance, leading to even surpassing hardware-based solutions.This work has been performed in the Intel-BSC Exascale Lab. Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266. We would like to thank the Intel’s DCG HEAT team for allowing us to access their computational resources. We also want to acknowledge this team, especially Larry Meadows and Jason Sewall, as well as Pardo Keppel for the productive discussions. We thank Raphaël Léger for allowing us to access the MAXW-DGTD application and its input.Peer ReviewedPostprint (author's final draft

    Performance analysis of parallel Python applications

    Get PDF
    Python is progressively consolidating itself within the HPC community with its simple syntax, large standard library, and powerful third-party libraries for scientific computing that are especially attractive to domain scientists. Despite Python lowering the bar for accessing parallel computing, utilizing the capacities of HPC systems efficiently remains a challenging task, after all. Yet, at the moment only few supporting tools exist and provide merely basic information in the form of summarized profile data. In this paper, we present our efforts in developing event-based tracing support for Python within the performance monitor Extrae to provide detailed information and enable a profound performance analysis. We present concepts to record the complete communication behavior as well as to capture entry and exit of functions in Python to provide the according application context. We evaluate our implementation in Extrae by analyzing the well-established electronic structure simulation package GPAW and demonstrate that the recorded traces provide equivalent information as for traditional C or Fortran applications and, therefore, offering the same profound analysis capabilities now for Python, as well.Peer ReviewedPostprint (published version

    Mobile Agents in Wireless Sensor Networks to Improve the Coordination of Emergency Response Teams

    Get PDF
    En aquesta tesis, hi ha proposat un sistema automàtic de monitorització de víctimes en grans catàstrofes. Les noves sol·lucions d’interconnexió de dispositius, com les Xarxes Tolerants a Retards (DTNs) i les Xarxes de Sensors Sense Fils (WSNs), ofereixen un ample ventall d’oportunitats en entorns hostils on l’accés a les comunicacions és, o bé inexistent o bé inaccessible. Gràcies a la flexibilitat de les WSNs, i la seva facilitat de desplegament, poden fàcilment accedir a zones inexplorades o de difícil accés i monitoritzar els seus voltants per retornar informació valuosa. Les DTNs per la seva banda, poden suplir la mancança d’una xarxa de comunicacions rudimentària sense la necessitat d’aixecar una infrastructura de comunicacions, senzillament fent servir els recursos propers de manera oportunista. En aquest estudi, fem servir aquestes dues tecnologies per crear una arquitectura híbrida per ajudar al triatge de víctimes en escenaris d’emergència. L’habitual falta d’una infrastructura de comunicacions, i els escassos recursos fan d’aquest tipus d’escenari el lloc perfecte per treure tot el potencial de les DTNs i de les WSNs. En aquesta tesi, primer fem una mirada general a l’arquitectura, com les dues tecnologies funcionaran juntes, i com s’intercanviaran dades fent servir tecnologies d’Agents Mòbils (MA). Després, agafarem la construcció d’itinerari explícit dels MAs tradicionals i l’aplicarem als MAs funcionant sobre WSNs, amb les seves restriccions de còmput i de bateria. En tercer lloc, extendrem aquesta estructura d’itinerari explícit per funcionar amb clústers de WSNs, xarxes totalment autònomes amb els seus serveis de monitorització, amb l’única limitació de no poder ser més grans de 32 nodes. Tot això recolzat per proves i resultats que demostren la seva viabilitat i utilitat. Per acabar, presentem dues propostes teòriques, una per accedir i recuperar serveis remots en WSNs grans, i una altra per proporcionar control d’accés als nodes d’aquest tipus de xarxes.In this thesis, we propose an automated real-time monitoring system for victims in Mass Casualty Incidents (MCIs). New networking solutions like Delay and Disruption Tolerant Networks (DTNs) or Wireless Sensor Networks (WSNs) offer a wide array of opportunities in hostile environments where access to communications is either non-existent or broken. Due to the flexibility of WSNs, and their almost effortless field deployment, they can easily reach unexplored or unfriendly areas, monitor their surroundings and send back useful information. DTNs on their hand, can substitute a slow and high latency infrastructure based network without needing it, just using nearby resources opportunistically. In this study, we use this two technologies to create an hybrid architecture to help in the triaging of victims in emergency scenarios. The, usually, lack of a communication’s infrastructure, and the scarcity of resources, make this kind of scenarios a perfect place to obtain the most of DTNs and WSNs. In this thesis, we firstly present an overview of the architecture, how the two technologies are going to work together, and how they will exchange data using Mobile Agent (MA) technologies. Then, we take the explicit itinerary construction from traditional MAs and apply it to MAs working on wireless sensor nodes, with their resource restrictions and battery issues. Thirdly, we extend this itinerary structure to support WSNs clusters, fully autonomous networks with their own monitoring services, with the only limitation of not being larger than 32 nodes. This contributions are supported by tests and results which prove their feasibility and usability. Finally, we present two theoretical approaches, one to retrieve remote services in large WSNs, and another to provide access control for the nodes used in this kind of networks

    Analyzing the efficiency of hybrid codes

    Get PDF
    Hybrid parallelization may be the only path for most codes to use HPC systems on a very large scale. Even within a small scale, with an increasing number of cores per node, combining MPI with some shared memory thread-based library allows to reduce the application network requirements. Despite the benefits of a hybrid approach, it is not easy to achieve an efficient hybrid execution. This is not only because of the added complexity of combining two different programming models, but also because in many cases the code was initially designed with just one level of parallelization and later extended to a hybrid mode. This paper presents our model to diagnose the efficiency of hybrid applications, distinguishing the contribution of each parallel programming paradigm. The flexibility of the proposed methodology allows us to use it for different paradigms and scenarios, like comparing the MPI+OpenMP and MPI+CUDA versions of the same code.This work has been partially developed under the scope of POP CoE which has received funding from the European Union´s Horizon 2020 research and innovation programme (under grant agreements No. 676553 and 824080), and with the support of the Comision Interministerial de Ciencia y Tecnología (CICYT) under contract No. PID2019- 107255GB-C22. We also want to acknowledge the ChEESE CoE and the EDANYA group from Universidad de Málaga (www.uma.es/edanya) that granted us permission to report on the Tsunami-HySEA analysis.Peer ReviewedPostprint (author's final draft

    Mobile agents in wireless sensor networks to improve the coordination of emergency response teams

    Get PDF
    En aquesta tesis, hi ha proposat un sistema automàtic de monitorització de víctimes en grans catàstrofes. Les noves sol·lucions d'interconnexió de dispositius, com les Xarxes Tolerants a Retards (DTNs) i les Xarxes de Sensors Sense Fils (WSNs), ofereixen un ample ventall d'oportunitats en entorns hostils on l'accés a les comunicacions és, o bé inexistent o bé inaccessible. Gràcies a la flexibilitat de les WSNs, i la seva facilitat de desplegament, poden fàcilment accedir a zones inexplorades o de difícil accés i monitoritzar els seus voltants per retornar informació valuosa. Les DTNs per la seva banda, poden suplir la mancança d'una xarxa de comunicacions rudimentària sense la necessitat d'aixecar una infrastructura de comunicacions, senzillament fent servir els recursos propers de manera oportunista. En aquest estudi, fem servir aquestes dues tecnologies per crear una arquitectura híbrida per ajudar al triatge de víctimes en escenaris d'emergència. L'habitual falta d'una infrastructura de comunicacions, i els escassos recursos fan d'aquest tipus d'escenari el lloc perfecte per treure tot el potencial de les DTNs i de les WSNs. En aquesta tesi, primer fem una mirada general a l'arquitectura, com les dues tecnologies funcionaran juntes, i com s'intercanviaran dades fent servir tecnologies d'Agents Mòbils (MA). Després, agafarem la construcció d'itinerari explícit dels MAs tradicionals i l'aplicarem als MAs funcionant sobre WSNs, amb les seves restriccions de còmput i de bateria. En tercer lloc, extendrem aquesta estructura d'itinerari explícit per funcionar amb clústers de WSNs, xarxes totalment autònomes amb els seus serveis de monitorització, amb l'única limitació de no poder ser més grans de 32 nodes. Tot això recolzat per proves i resultats que demostren la seva viabilitat i utilitat. Per acabar, presentem dues propostes teòriques, una per accedir i recuperar serveis remots en WSNs grans, i una altra per proporcionar control d'accés als nodes d'aquest tipus de xarxes.In this thesis, we propose an automated real-time monitoring system for victims in Mass Casualty Incidents (MCIs). New networking solutions like Delay and Disruption Tolerant Networks (DTNs) or Wireless Sensor Networks (WSNs) offer a wide array of opportunities in hostile environments where access to communications is either non-existent or broken. Due to the flexibility of WSNs, and their almost effortless field deployment, they can easily reach unexplored or unfriendly areas, monitor their surroundings and send back useful information. DTNs on their hand, can substitute a slow and high latency infrastructure based network without needing it, just using nearby resources opportunistically. In this study, we use this two technologies to create an hybrid architecture to help in the triaging of victims in emergency scenarios. The, usually, lack of a communication's infrastructure, and the scarcity of resources, make this kind of scenarios a perfect place to obtain the most of DTNs and WSNs. In this thesis, we firstly present an overview of the architecture, how the two technologies are going to work together, and how they will exchange data using Mobile Agent (MA) technologies. Then, we take the explicit itinerary construction from traditional MAs and apply it to MAs working on wireless sensor nodes, with their resource restrictions and battery issues. Thirdly, we extend this itinerary structure to support WSNs clusters, fully autonomous networks with their own monitoring services, with the only limitation of not being larger than 32 nodes. This contributions are supported by tests and results which prove their feasibility and usability. Finally, we present two theoretical approaches, one to retrieve remote services in large WSNs, and another to provide access control for the nodes used in this kind of networks

    Automating the application data placement in hybrid memory systems

    No full text
    Multi-tiered memory systems, such as those based on Intel® Xeon Phi™processors, are equipped with several memory tiers with different characteristics including, among others, capacity, access latency, bandwidth, energy consumption, and volatility. The proper distribution of the application data objects into the available memory layers is key to shorten the time– to–solution, but the way developers and end-users determine the most appropriate memory tier to place the application data objects has not been properly addressed to date.In this paper we present a novel methodology to build an extensible framework to automatically identify and place the application’s most relevant memory objects into the Intel Xeon Phi fast on-package memory. Our proposal works on top of inproduction binaries by first exploring the application behavior and then substituting the dynamic memory allocations. This makes this proposal valuable even for end-users who do not have the possibility of modifying the application source code. We demonstrate the value of a framework based in our methodology for several relevant HPC applications using different allocation strategies to help end-users improve performance with minimal intervention. The results of our evaluation reveal that our proposal is able to identify the key objects to be promoted into fast on-package memory in order to optimize performance, leading to even surpassing hardware-based solutions.This work has been performed in the Intel-BSC Exascale Lab. Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266. We would like to thank the Intel’s DCG HEAT team for allowing us to access their computational resources. We also want to acknowledge this team, especially Larry Meadows and Jason Sewall, as well as Pardo Keppel for the productive discussions. We thank Raphaël Léger for allowing us to access the MAXW-DGTD application and its input.Peer Reviewe

    The secrets of the accelerators unveiled: tracing heterogeneous executions through OMPT

    No full text
    Heterogeneous systems are an important trend in the future of supercomputers, yet they can be hard to program and developers still lack powerful tools to gain understanding about how well their accelerated codes perform and how to improve them. Having different types of hardware accelerators available, each with their own specific low-level APIs to program them, there is not yet a clear consensus on a standard way to retrieve information about the accelerator’s performance. To improve this scenario, OMPT is a novel performance monitoring interface that is being considered for integration into the OpenMP standard. OMPT allows analysis tools to monitor the execution of parallel OpenMP applications by providing detailed information about the activity of the runtime through a standard API. For accelerated devices, OMPT also facilitates the exchange of performance information between the runtime and the analysis tool. We implement part of the OMPT specification that refers to the use of accelerators both in the Nanos++ parallel runtime system and the Extrae tracing framework, obtaining detailed performance information about the execution of the tasks issued to the accelerated devices to later conduct insightful analysis. Our work extends previous efforts in the field to expose detailed information from the OpenMP and OmpSs runtimes, regarding the activity and performance of task-based parallel applications. In this paper, we focus on the evaluation of FPGA devices studying the performance of two common kernels in scientific algorithms: matrix multiplication and Cholesky decomposition. Furthermore, this development is seamlessly applicable for the analysis of GPGPU accelerators and Intel®Xeon PhiTM co-processors operating under the OmpSs programming model.This work was partially supported by the European Union H2020 program through the AXIOM project (grant ICT-01-2014 GA 645496) and the Mont-Blanc 2 project, by the Ministerio de Economía y Competitividad, under contracts Computación de Altas Prestaciones VII (TIN2015-65316-P); Departament d'Innovació, Universitats i Empresa de la Generalitat de Catalunya, under projects MPEXPAR: Models de Programació i Entorns d'Execució Paral·lels (2014-SGR-1051) and 2009-SGR-980; the BSC-CNS Severo Ochoa program (SEV-2011-00067); the Intel-BSC Exascale Laboratory project; and the OMPT Working Group.Peer ReviewedPostprint (published version

    The secrets of the accelerators unveiled: tracing heterogeneous executions through OMPT

    No full text
    Heterogeneous systems are an important trend in the future of supercomputers, yet they can be hard to program and developers still lack powerful tools to gain understanding about how well their accelerated codes perform and how to improve them. Having different types of hardware accelerators available, each with their own specific low-level APIs to program them, there is not yet a clear consensus on a standard way to retrieve information about the accelerator’s performance. To improve this scenario, OMPT is a novel performance monitoring interface that is being considered for integration into the OpenMP standard. OMPT allows analysis tools to monitor the execution of parallel OpenMP applications by providing detailed information about the activity of the runtime through a standard API. For accelerated devices, OMPT also facilitates the exchange of performance information between the runtime and the analysis tool. We implement part of the OMPT specification that refers to the use of accelerators both in the Nanos++ parallel runtime system and the Extrae tracing framework, obtaining detailed performance information about the execution of the tasks issued to the accelerated devices to later conduct insightful analysis. Our work extends previous efforts in the field to expose detailed information from the OpenMP and OmpSs runtimes, regarding the activity and performance of task-based parallel applications. In this paper, we focus on the evaluation of FPGA devices studying the performance of two common kernels in scientific algorithms: matrix multiplication and Cholesky decomposition. Furthermore, this development is seamlessly applicable for the analysis of GPGPU accelerators and Intel®Xeon PhiTM co-processors operating under the OmpSs programming model.This work was partially supported by the European Union H2020 program through the AXIOM project (grant ICT-01-2014 GA 645496) and the Mont-Blanc 2 project, by the Ministerio de Economía y Competitividad, under contracts Computación de Altas Prestaciones VII (TIN2015-65316-P); Departament d'Innovació, Universitats i Empresa de la Generalitat de Catalunya, under projects MPEXPAR: Models de Programació i Entorns d'Execució Paral·lels (2014-SGR-1051) and 2009-SGR-980; the BSC-CNS Severo Ochoa program (SEV-2011-00067); the Intel-BSC Exascale Laboratory project; and the OMPT Working Group.Peer Reviewe
    corecore