94 research outputs found

    Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs

    Get PDF
    Hybrid parallel programming models that combine message passing (MP) and shared- memory multithreading (MT) are becoming more popular, especially with applications requiring higher degrees of parallelism and scalability. Consequently, coupled parallel programs, those built via the integration of independently developed and optimized software libraries linked into a single application, increasingly comprise message-passing libraries with differing preferred degrees of threading, resulting in thread-level heterogeneity. Retroactively matching threading levels between independently developed and maintained libraries is difficult, and the challenge is exacerbated because contemporary middleware services provide only static scheduling policies over entire program executions, necessitating suboptimal, over-subscribed or under-subscribed, configurations. In coupled applications, a poorly configured component can lead to overall poor application performance, suboptimal resource utilization, and increased time-to-solution. So it is critical that each library executes in a manner consistent with its design and tuning for a particular system architecture and workload. Therefore, there is a need for techniques that address dynamic, conflicting configurations in coupled multithreaded message-passing (MT-MP) programs. Our thesis is that we can achieve significant performance improvements over static under-subscribed approaches through reconfigurable execution environments that consider compute phase parallelization strategies along with both hardware and software characteristics. In this work, we present new ways to structure, execute, and analyze coupled MT- MP programs. Our study begins with an examination of contemporary approaches used to accommodate thread-level heterogeneity in coupled MT-MP programs. Here we identify potential inefficiencies in how these programs are structured and executed in the high-performance computing domain. We then present and evaluate a novel approach for accommodating thread-level heterogeneity. Our approach enables full utilization of all available compute resources throughout an application’s execution by providing programmable facilities with modest overheads to dynamically reconfigure runtime environments for compute phases with differing threading factors and affinities. Our performance results show that for a majority of the tested scientific workloads our approach and corresponding open-source reference implementation render speedups greater than 50 % over the static under-subscribed baseline. Motivated by our examination of reconfigurable execution environments and their memory overhead, we also study the memory attribution problem: the inability to predict or evaluate during runtime where the available memory is used across the software stack comprising the application, reusable software libraries, and supporting runtime infrastructure. Specifically, dynamic adaptation requires runtime intervention, which by its nature introduces additional runtime and memory overhead. To better understand the latter, we propose and evaluate a new way to quantify component-level memory usage from unmodified binaries dynamically linked to a message-passing communication library. Our experimental results show that our approach and corresponding implementation accurately measure memory resource usage as a function of time, scale, communication workload, and software or hardware system architecture, clearly distinguishing between application and communication library usage at a per-process level

    SLA Translation in Multi-Layered Service Oriented Architectures: Status and Challenges

    Get PDF

    Adaptive heterogeneous parallelism for semi-empirical lattice dynamics in computational materials science.

    Get PDF
    With the variability in performance of the multitude of parallel environments available today, the conceptual overhead created by the need to anticipate runtime information to make design-time decisions has become overwhelming. Performance-critical applications and libraries carry implicit assumptions based on incidental metrics that are not portable to emerging computational platforms or even alternative contemporary architectures. Furthermore, the significance of runtime concerns such as makespan, energy efficiency and fault tolerance depends on the situational context. This thesis presents a case study in the application of both Mattsons prescriptive pattern-oriented approach and the more principled structured parallelism formalism to the computational simulation of inelastic neutron scattering spectra on hybrid CPU/GPU platforms. The original ad hoc implementation as well as new patternbased and structured implementations are evaluated for relative performance and scalability. Two new structural abstractions are introduced to facilitate adaptation by lazy optimisation and runtime feedback. A deferred-choice abstraction represents a unified space of alternative structural program variants, allowing static adaptation through model-specific exhaustive calibration with regards to the extrafunctional concerns of runtime, average instantaneous power and total energy usage. Instrumented queues serve as mechanism for structural composition and provide a representation of extrafunctional state that allows realisation of a market-based decentralised coordination heuristic for competitive resource allocation and the Lyapunov drift algorithm for cooperative scheduling

    Hardware counter based performance analysis, modelling, and improvement through thread migration in numa systems

    Get PDF
    [EN]These last years have seen an important evolution in the computational resources available in science and engineering. Currently, most high performance systems include several multicore processors and use a NUMA (Non Uniform Memory Access) memory architecture. In this context, data locality becomes a highly important issue for parallel codes performance. It is foreseeable that the complexity as SMP (Symmetric Multiprocessing) NUMA systems increases during the next years. These will increase both the number of cores and the memory complexity, including the various cache levels, which implies memory access latency will depend, increasingly, of the proximity or affinity of the different threads to the memory modules where their data reside. Improving the performance and scalability of parallel codes on multicore architectures may be quite complex. This way, memory management on parallel codes will become more complicated, especially from the point of view of a programmer who wishes to obtain the best performance. Not only this, but the problem worsens in the usual case with different processes in execution simultaneously. Automatically migrating executing threads among the cores and processors, depending on their behaviour, may improve performance of parallel programs. Furthermore, it may allow to simplify their development, since the programmer avoids to explicitly manage locality. Modern microprocessors include registers that give useful information at a low cost, usually known as hardware counters (HCs). HCs are not commonly used due to a lack of tools to easily obtain their data. These HCs, in modern processors, allow to obtain the memory access latency during cache miss resolutions, and even the memory address that leads to the event. This opens the door to the development of new techniques for performance improvement based on this information. A procedure to easily and automatically obtain data about a shared memory parallel code execution on SMP multicore and NUMA systems, to model it using the hardware counters of modern processors, alongside additional information, as the memory access latencies from different threads. This procedure will be used during a parallel program execution, at runtime, to model its performance. This information will be used to improve the efficiency of the execution of said parallel codes automatically and transparently to the user.[GL]Hoxe en día, a maioría dos sistemas de computación son multicore e mesmo multiprocessador. Nestes sistemas, o comportamento dos accesos á memoria de cada fío para os distintos nodos de memoria é un dos aspectos que máis significativamente afectan o rendemento de calquera código. Este feito é cada vez máis relevante a medida que aumenta o chamado "memory wall". Neste traballo, esta cuestión foi abordada baixo dous puntos de vista. Desde o punto de vista dun programador de aplicacións paralelas, desenvolvéronse ferramentas e modelos para caracterizar o comportamento de códigos e axudao para a súa aplicación. Desde o punto de vista dun usuario de aplicacións paralelas, desenvolveuse unha ferramenta de migración para seleccionar e adaptar, automaticamente durante a execución, a colocación de fíos no sistema para mellorar o seu funcionamento. Todas estas ferramentas fan uso de datos de rendemento en tempo de execución obtidos a partir de Contadores Hardware (HC) presentes nos procesadores Intel. En comparación cos "software profilers", os HC proporcionan, cunha baixa sobrecarga, unha información de rendemento detallada e rica referente ás unidades funcionais, caches, acceso á memoria principal por parte da CPU, etc. Outra vantaxe de usalos é que non precisa ningunha modificación do código fonte. Con todo, os tipos e os significados dos contadores hardware varían dunha arquitectura a outra debido á variación nas organizacións do hardware. Ademais, pode haber dificultades para correlacionar as métricas de rendemento de baixo nivel co código fonte orixinal. O número limitado de rexistros para almacenar os contadores moitas veces pode forzar aos usuarios a realizar múltiples medicións para recoller todas as métricas de rendemento desexadas. En concreto, neste traballo, utilizáronse os Precise Event Based Sampling (PEBS, MOSTRAXE BASEADO EN EVENTOS PRECISOS) nos procesadores Intel modernos e os Event Address Register (EARs, REXISTROS DE ENDEREZO DE EVENTO) nos procesadores Itanium 2. O procesador Itanium 2 ofrece un conxunto de rexistros, os EARs que rexistran os enderezos de instrución e datos dos fallos caché, e os enderezos de instrución e datos de fallos de TLB [25]. Cando se usan para capturar fallos caché, os EARs permiten a detección das latencias maiores de 4 ciclos. Xa que os accesos de punto flotante sempre provocan un fallo (os datos de punto flotante son sempre almacenados na L2D), calquer acceso pode ser potencialmente detectado. Os EARs permiten a mostraxe estatística, configurando un contador de rendemento para contar as aparicións dun determinado evento. O PEBS usa un mecanismo de interrupción cos HC para almacenar un conxunto de información sobre o estado da arquitectura para o procesador. A información ofrece o estado arquitectónico da instrución executada despois da instrución que causou o evento. Xunto con esta información, que inclúe o estado de todos os rexistros, os procesadores Sandy Bridge posúen un sistema de medición da latencia a memoria. Ista é un medio para caracterizar a latencia de carga media para os diferentes niveis da xerarquía de memoria. A latencia é medida dende a expedición da instrucción ata cando os datos son globalmente observables, e dicir, cando chegan ao procesador. Ademáis da latencia, o PEBS permite coñecer a orixe dos datos e o nivel de memoria de onde se leron. A diferenza dos EARs, o PEBS permite tamén medir a latencia de operacións enteiras ou de almacenamento de dato

    Optimization of high-throughput real-time processes in physics reconstruction

    Get PDF
    La presente tesis se ha desarrollado en colaboración entre la Universidad de Sevilla y la Organización Europea para la Investigación Nuclear, CERN. El detector LHCb es uno de los cuatro grandes detectores situados en el Gran Colisionador de Hadrones, LHC. En LHCb, se colisionan partículas a altas energías para comprender la diferencia existente entre la materia y la antimateria. Debido a la cantidad ingente de datos generada por el detector, es necesario realizar un filtrado de datos en tiempo real, fundamentado en los conocimientos actuales recogidos en el Modelo Estándar de física de partículas. El filtrado, también conocido como High Level Trigger, deberá procesar un throughput de 40 Tb/s de datos, y realizar un filtrado de aproximadamente 1 000:1, reduciendo el throughput a unos 40 Gb/s de salida, que se almacenan para posterior análisis. El proceso del High Level Trigger se subdivide a su vez en dos etapas: High Level Trigger 1 (HLT1) y High Level Trigger 2 (HLT2). El HLT1 transcurre en tiempo real, y realiza una reducción de datos de aproximadamente 30:1. El HLT1 consiste en una serie de procesos software que reconstruyen lo que ha sucedido en la colisión de partículas. En la reconstrucción del HLT1 únicamente se analizan las trayectorias de las partículas producidas fruto de la colisión, en un problema conocido como reconstrucción de trazas, para dictaminar el interés de las colisiones. Por contra, el proceso HLT2 es más fino, requiriendo más tiempo en realizarse y reconstruyendo todos los subdetectores que componen LHCb. Hacia 2020, el detector LHCb, así como todos los componentes del sistema de adquisici´on de datos, serán actualizados acorde a los últimos desarrollos técnicos. Como parte del sistema de adquisición de datos, los servidores que procesan HLT1 y HLT2 también sufrirán una actualización. Al mismo tiempo, el acelerador LHC será también actualizado, de manera que la cantidad de datos generada en cada cruce de grupo de partículas aumentare en aproxidamente 5 veces la actual. Debido a las actualizaciones tanto del acelerador como del detector, se prevé que la cantidad de datos que deberá procesar el HLT en su totalidad sea unas 40 veces mayor a la actual. La previsión de la escalabilidad del software actual a 2020 subestim´ó los recursos necesarios para hacer frente al incremento en throughput. Esto produjo que se pusiera en marcha un estudio de todos los algoritmos tanto del HLT1 como del HLT2, así como una actualización del código a nuevos estándares, para mejorar su rendimiento y ser capaz de procesar la cantidad de datos esperada. En esta tesis, se exploran varios algoritmos de la reconstrucción de LHCb. El problema de reconstrucción de trazas se analiza en profundidad y se proponen nuevos algoritmos para su resolución. Ya que los problemas analizados exhiben un paralelismo masivo, estos algoritmos se implementan en lenguajes especializados para tarjetas gráficas modernas (GPUs), dada su arquitectura inherentemente paralela. En este trabajo se dise ˜nan dos algoritmos de reconstrucción de trazas. Además, se diseñan adicionalmente cuatro algoritmos de decodificación y un algoritmo de clustering, problemas también encontrados en el HLT1. Por otra parte, se diseña un algoritmo para el filtrado de Kalman, que puede ser utilizado en ambas etapas. Los algoritmos desarrollados cumplen con los requisitos esperados por la colaboración LHCb para el año 2020. Para poder ejecutar los algoritmos eficientemente en tarjetas gráficas, se desarrolla un framework especializado para GPUs, que permite la ejecución paralela de secuencias de reconstrucción en GPUs. Combinando los algoritmos desarrollados con el framework, se completa una secuencia de ejecución que asienta las bases para un HLT1 ejecutable en GPU. Durante la investigación llevada a cabo en esta tesis, y gracias a los desarrollos arriba mencionados y a la colaboración de un pequeño equipo de personas coordinado por el autor, se completa un HLT1 ejecutable en GPUs. El rendimiento obtenido en GPUs, producto de esta tesis, permite hacer frente al reto de ejecutar una secuencia de reconstrucción en tiempo real, bajo las condiciones actualizadas de LHCb previstas para 2020. As´ı mismo, se completa por primera vez para cualquier experimento del LHC un High Level Trigger que se ejecuta únicamente en GPUs. Finalmente, se detallan varias posibles configuraciones para incluir tarjetas gr´aficas en el sistema de adquisición de datos de LHCb.The current thesis has been developed in collaboration between Universidad de Sevilla and the European Organization for Nuclear Research, CERN. The LHCb detector is one of four big detectors placed alongside the Large Hadron Collider, LHC. In LHCb, particles are collided at high energies in order to understand the difference between matter and antimatter. Due to the massive quantity of data generated by the detector, it is necessary to filter data in real-time. The filtering, also known as High Level Trigger, processes a throughput of 40 Tb/s of data and performs a selection of approximately 1 000:1. The throughput is thus reduced to roughly 40 Gb/s of data output, which is then stored for posterior analysis. The High Level Trigger process is subdivided into two stages: High Level Trigger 1 (HLT1) and High Level Trigger 2 (HLT2). HLT1 occurs in real-time, and yields a reduction of data of approximately 30:1. HLT1 consists in a series of software processes that reconstruct particle collisions. The HLT1 reconstruction only analyzes the trajectories of particles produced at the collision, solving a problem known as track reconstruction, that determines whether the collision data is kept or discarded. In contrast, HLT2 is a finer process, which requires more time to execute and reconstructs all subdetectors composing LHCb. Towards 2020, the LHCb detector and all the components composing the data acquisition system will be upgraded. As part of the data acquisition system, the servers that process HLT1 and HLT2 will also be upgraded. In addition, the LHC accelerator will also be updated, increasing the data generated in every bunch crossing by roughly 5 times. Due to the accelerator and detector upgrades, the amount of data that the HLT will require to process is expected to increase by 40 times. The foreseen scalability of the software through 2020 underestimated the required resources to face the increase in data throughput. As a consequence, studies of all algorithms composing HLT1 and HLT2 and code modernizations were carried out, in order to obtain a better performance and increase the processing capability of the foreseen hardware resources in the upgrade. In this thesis, several algorithms of the LHCb recontruction are explored. The track reconstruction problem is analyzed in depth, and new algorithms are proposed. Since the analyzed problems are massively parallel, these algorithms are implemented in specialized languages for modern graphics cards (GPUs), due to their inherently parallel architecture. From this work stem two algorithm designs. Furthermore, four additional decoding algorithms and a clustering algorithms have been designed and implemented, which are also part of HLT1. Apart from that, an parallel Kalman filter algorithm has been designed and implemented, which can be used in both HLT stages. The developed algorithms satisfy the requirements of the LHCb collaboration for the LHCb upgrade. In order to execute the algorithms efficiently on GPUs, a software framework specialized for GPUs is developed, which allows executing GPU reconstruction sequences in parallel. Combining the developed algorithms with the framework, an execution sequence is completed as the foundations of a GPU HLT1. During the research carried out in this thesis, the aforementioned developments and a small group of collaborators coordinated by the author lead to the completion of a full GPU HLT1 sequence. The performance obtained on GPUs allows executing a reconstruction sequence in real-time, under LHCb upgrade conditions. The developed GPU HLT1 constitutes the first GPU high level trigger ever developed for an LHC experiment. Finally, various possible realizations of the GPU HLT1 to integrate in a production GPU-equipped data acquisition system are detailed

    Helmholtz Portfolio Theme Large-Scale Data Management and Analysis (LSDMA)

    Get PDF
    The Helmholtz Association funded the "Large-Scale Data Management and Analysis" portfolio theme from 2012-2016. Four Helmholtz centres, six universities and another research institution in Germany joined to enable data-intensive science by optimising data life cycles in selected scientific communities. In our Data Life cycle Labs, data experts performed joint R&D together with scientific communities. The Data Services Integration Team focused on generic solutions applied by several communities

    Handling Information and its Propagation to Engineer Complex Embedded Systems

    Get PDF
    Avec l’intérêt que la technologie d’aujourd’hui a sur les données, il est facile de supposer que l’information est au bout des doigts, prêt à être exploité. Les méthodologies et outils de recherche sont souvent construits sur cette hypothèse. Cependant, cette illusion d’abondance se brise souvent lorsqu’on tente de transférer des techniques existantes à des applications industrielles. Par exemple, la recherche a produit divers méthodologies permettant d’optimiser l’utilisation des ressources de grands systèmes complexes, tels que les avioniques de l’Airbus A380. Ces approches nécessitent la connaissance de certaines mesures telles que les temps d’exécution, la consommation de mémoire, critères de communication, etc. La conception de ces systèmes complexes a toutefois employé une combinaison de compétences de différents domaines (probablement avec des connaissances en génie logiciel) qui font que les données caractéristiques au système sont incomplètes ou manquantes. De plus, l’absence d’informations pertinentes rend difficile de décrire correctement le système, de prédire son comportement, et améliorer ses performances. Nous faisons recours au modèles probabilistes et des techniques d’apprentissage automatique pour remédier à ce manque d’informations pertinentes. La théorie des probabilités, en particulier, a un grand potentiel pour décrire les systèmes partiellement observables. Notre objectif est de fournir des approches et des solutions pour produire des informations pertinentes. Cela permet une description appropriée des systèmes complexes pour faciliter l’intégration, et permet l’utilisation des techniques d’optimisation existantes. Notre première étape consiste à résoudre l’une des difficultés rencontrées lors de l’intégration de système : assurer le bon comportement temporelle des composants critiques des systèmes. En raison de la mise à l’échelle de la technologie et de la dépendance croissante à l’égard des architectures à multi-coeurs, la surcharge de logiciels fonctionnant sur différents coeurs et le partage d’espace mémoire n’est plus négligeable. Pour tel, nous étendons la boîte à outils des système temps réel avec une analyse temporelle probabiliste statique qui estime avec précision l’exécution d’un logiciel avec des considerations pour les conflits de mémoire partagée. Le modèle est ensuite intégré dans un simulateur pour l’ordonnancement de systèmes temps réel multiprocesseurs. ----------ABSTRACT: In today’s data-driven technology, it is easy to assume that information is at the tip of our fingers, ready to be exploited. Research methodologies and tools are often built on top of this assumption. However, this illusion of abundance often breaks when attempting to transfer existing techniques to industrial applications. For instance, research produced various methodologies to optimize the resource usage of large complex systems, such as the avionics of the Airbus A380. These approaches require the knowledge of certain metrics such as the execution time, memory consumption, communication delays, etc. The design of these complex systems, however, employs a mix of expertise from different fields (likely with limited knowledge in software engineering) which might lead to incomplete or missing specifications. Moreover, the unavailability of relevant information makes it difficult to properly describe the system, predict its behavior, and improve its performance. We fall back on probabilistic models and machine learning techniques to address this lack of relevant information. Probability theory, especially, has great potential to describe partiallyobservable systems. Our objective is to provide approaches and solutions to produce relevant information. This enables a proper description of complex systems to ease integration, and allows the use of existing optimization techniques. Our first step is to tackle one of the difficulties encountered during system integration: ensuring the proper timing behavior of critical systems. Due to technology scaling, and with the growing reliance on multi-core architectures, the overhead of software running on different cores and sharing memory space is no longer negligible. For such, we extend the real-time system tool-kit with a static probabilistic timing analysis technique that accurately estimates the execution of software with an awareness of shared memory contention. The model is then incorporated into a simulator for scheduling multi-processor real-time systems
    • …
    corecore