16 research outputs found

    A Sparse SCF algorithm and its parallel implementation: Application to DFTB

    Full text link
    We present an algorithm and its parallel implementation for solving a self consistent problem as encountered in Hartree Fock or Density Functional Theory. The algorithm takes advantage of the sparsity of matrices through the use of local molecular orbitals. The implementation allows to exploit efficiently modern symmetric multiprocessing (SMP) computer architectures. As a first application, the algorithm is used within the density functional based tight binding method, for which most of the computational time is spent in the linear algebra routines (diagonalization of the Fock/Kohn-Sham matrix). We show that with this algorithm (i) single point calculations on very large systems (millions of atoms) can be performed on large SMP machines (ii) calculations involving intermediate size systems (1~000--100~000 atoms) are also strongly accelerated and can run efficiently on standard servers (iii) the error on the total energy due to the use of a cut-off in the molecular orbital coefficients can be controlled such that it remains smaller than the SCF convergence criterion.Comment: 13 pages, 11 figure

    Exploiting data locality in cache-coherent NUMA systems

    Get PDF
    The end of Dennard scaling has caused a stagnation of the clock frequency in computers.To overcome this issue, in the last two decades vendors have been integrating larger numbers of processing elements in the systems, interconnecting many nodes, including multiple chips in the nodes and increasing the number of cores in each chip. The speed of main memory has not evolved at the same rate as processors, it is much slower and there is a need to provide more total bandwidth to the processors, especially with the increase in the number of cores and chips. Still keeping a shared address space, where all processors can access the whole memory, solutions have come by integrating more memories: by using newer technologies like high-bandwidth memories (HBM) and non-volatile memories (NVM), by giving groups cores (like sockets, for example) faster access to some subset of the DRAM, or by combining many of these solutions. This has caused some heterogeneity in the access speed to main memory, depending on the CPU requesting access to a memory address and the actual physical location of that address, causing non-uniform memory access (NUMA) behaviours. Moreover, many of these systems are cache-coherent (ccNUMA), meaning that changes in the memory done from one CPU must be visible by the other CPUs and transparent for the programmer. These NUMA behaviours reduce the performance of applications and can pose a challenge to the programmers. To tackle this issue, this thesis proposes solutions, at the software and hardware levels, to improve the data locality in NUMA systems and, therefore, the performance of applications in these computer systems. The first contribution shows how considering hardware prefetching simultaneously with thread and data placement in NUMA systems can find configurations with better performance than considering these aspects separately. The performance results combined with performance counters are then used to build a performance model to predict, both offline and online, the best configuration for new applications not in the model. The evaluation is done using two different high performance NUMA systems, and the performance counters collected in one machine are used to predict the best configurations in the other machine. The second contribution builds on the idea that prefetching can have a strong effect in NUMA systems and proposes a NUMA-aware hardware prefetching scheme. This scheme is generic and can be applied to multiple hardware prefetchers with a low hardware cost but giving very good results. The evaluation is done using a cycle-accurate architectural simulator and provides detailed results of the performance, the data transfer reduction and the energy costs. Finally, the third and last contribution consists in scheduling algorithms for task-based programming models. These programming models help improve the programmability of applications in parallel systems and also provide useful information to the underlying runtime system. This information is used to build a task dependency graph (TDG), a directed acyclic graph that models the application where the nodes are sequential pieces of code known as tasks and the edges are the data dependencies between the different tasks. The proposed scheduling algorithms use graph partitioning techniques and provide a scheduling for the tasks in the TDG that minimises the data transfers between the different NUMA regions of the system. The results have been evaluated in real ccNUMA systems with multiple NUMA regions.La fi de la llei de Dennard ha provocat un estancament de la freqüència de rellotge dels computadors. Amb l'objectiu de superar aquest fet, durant les darreres dues dècades els fabricants han integrat més quantitat d'unitats de còmput als sistemes mitjançant la interconnexió de nodes diferents, la inclusió de múltiples xips als nodes i l'increment de nuclis de processador a cada xip. La rapidesa de la memòria principal no ha evolucionat amb el mateix factor que els processadors; és molt més lenta i hi ha la necessitat de proporcionar més ample de banda als processadors, especialment amb l'increment del nombre de nuclis i xips. Tot mantenint un adreçament compartit en el qual tots els processadors poden accedir a la memòria sencera, les solucions han estat al voltant de la integració de més memòries: amb tecnologies modernes com HBM (high-bandwidth memories) i NVM (non-volatile memories), fent que grups de nuclis (com sòcols sencers) tinguin accés més ràpid a una part de la DRAM o amb la combinació de solucions. Això ha provocat una heterogeneïtat en la velocitat d'accés a la memòria principal, en funció del nucli que sol·licita l'accés a una adreça en particular i la seva localització física, fet que provoca uns comportaments no uniformes en l'accés a la memòria (non-uniform memory access, NUMA). A més, sovint tenen memòries cau coherents (cache-coherent NUMA, ccNUMA), que implica que qualsevol canvi fet a la memòria des d'un nucli d'un processador ha de ser visible la resta de manera transparent. Aquests comportaments redueixen el rendiment de les aplicacions i suposen un repte. Per abordar el problema, a la tesi s'hi proposen solucions, a nivell de programari i maquinari, que milloren la localitat de dades als sistemes NUMA i, en conseqüència, el rendiment de les aplicacions en aquests sistemes. La primera contribució mostra que, quan es tenen en compte alhora la precàrrega d'adreces de memòria amb maquinari (hardware prefetching) i les decisions d'ubicació dels fils d'execució i les dades als sistemes NUMA, es poden trobar millors configuracions que quan es condieren per separat. Una combinació dels resultats de rendiment i dels comptadors disponibles al sistema s'utilitza per construir un model de rendiment per fer la predicció, tant per avançat com també en temps d'execució, de la millor configuració per aplicacions que no es troben al model. L'avaluació es du a terme a dos sistemes NUMA d'alt rendiment, i els comptadors mesurats en un sistema s'usen per predir les millors configuracions a l'altre sistema. La segona contribució es basa en la idea que el prefetching pot tenir un efecte considerable als sistemes NUMA i proposa un esquema de precàrrega a nivell de maquinari que té en compte els efectes NUMA. L'esquema és genèric i es pot aplicar als algorismes de precàrrega existents amb un cost de maquinari molt baix però amb molt bons resultats. S'avalua amb un simulador arquitectural acurat a nivell de cicle i proporciona resultats detallats del rendiment, la reducció de les comunicacions de dades i els costos energètics. La tercera i darrera contribució consisteix en algorismes de planificació per models de programació basats en tasques. Aquests simplifiquen la programabilitat de les aplicacions paral·leles i proveeixen informació molt útil al sistema en temps d'execució (runtime system) que en controla el funcionament. Amb aquesta informació es construeix un graf de dependències entre tasques (task dependency graph, TDG), un graf dirigit i acíclic que modela l'aplicació i en el qual els nodes són fragments de codi seqüencial (o tasques) i els arcs són les dependències de dades entre les tasques. Els algorismes de planificació proposats fan servir tècniques de particionat de grafs i proporcionen una planificació de les tasques del TDG que minimitza la comunicació de dades entre les diferents regions NUMA del sistema. Els resultats han estat avaluats en sistemes ccNUMA reals amb múltiples regions NUMA.El final de la ley de Dennard ha provocado un estancamiento de la frecuencia de reloj de los computadores. Con el objetivo de superar este problema, durante las últimas dos décadas los fabricantes han integrado más unidades de cómputo en los sistemas mediante la interconexión de nodos diferentes, la inclusión de múltiples chips en los nodos y el incremento de núcleos de procesador en cada chip. La rapidez de la memoria principal no ha evolucionado con el mismo factor que los procesadores; es mucho más lenta y hay la necesidad de proporcionar más ancho de banda a los procesadores, especialmente con el incremento del número de núcleos y chips. Aun manteniendo un sistema de direccionamiento compartido en el que todos los procesadores pueden acceder al conjunto de la memoria, las soluciones han oscilado alrededor de la integración de más memorias: usando tecnologías modernas como las memorias de alto ancho de banda (highbandwidth memories, HBM) y memorias no volátiles (non-volatile memories, NVM), haciendo que grupos de núcleos (como zócalos completos) tengan acceso más veloz a un subconjunto de la DRAM, o con la combinación de soluciones. Esto ha provocado una heterogeneidad en la velocidad de acceso a la memoria principal, en función del núcleo que solicita el acceso a una dirección de memoria en particular y la ubicación física de esta dirección, lo que provoca unos comportamientos no uniformes en el acceso a la memoria (non-uniform memory access, NUMA). Además, muchos de estos sistemas tienen memorias caché coherentes (cache-coherent NUMA, ccNUMA), lo que implica que cualquier cambio hecho en la memoria desde un núcleo de un procesador debe ser visible por el resto de procesadores de forma transparente para los programadores. Estos comportamientos NUMA reducen el rendimiento de las aplicaciones y pueden suponer un reto para los programadores. Para abordar dicho problema, en esta tesis se proponen soluciones, a nivel de software y hardware, que mejoran la localidad de datos en los sistemas NUMA y, en consecuencia, el rendimiento de las aplicaciones en estos sistemas informáticos. La primera contribución muestra que, cuando se tienen en cuenta a la vez la precarga de direcciones de memoria mediante hardware (o hardware prefetching ) y las decisiones de la ubicación de los hilos de ejecución y los datos en los sistemas NUMA, se pueden hallar mejores configuraciones que cuando se consideran ambos aspectos por separado. Con una combinación de los resultados de rendimiento y de los contadores disponibles en el sistema se construye un modelo de rendimiento, tanto por avanzado como en en tiempo de ejecución, de la mejor configuración para aplicaciones que no están incluidas en el modelo. La evaluación se realiza en dos sistemas NUMA de alto rendimiento, y los contadores medidos en uno de los sistemas se usan para predecir las mejores configuraciones en el otro sistema. La segunda contribución se basa en la idea de que el prefetching puede tener un efecto considerable en los sistemas NUMA y propone un esquema de precarga a nivel hardware que tiene en cuenta los efectos NUMA. Este esquema es genérico y se puede aplicar a diferentes algoritmos de precarga existentes con un coste de hardware muy bajo pero que proporciona muy buenos resultados. Dichos resultados se obtienen y evalúan mediante un simulador arquitectural preciso a nivel de ciclo y proporciona resultados detallados del rendimiento, la reducción de las comunicaciones de datos y los costes energéticos. Finalmente, la tercera y última contribución consiste en algoritmos de planificación para modelos de programación basados en tareas. Estos modelos simplifican la programabilidad de las aplicaciones paralelas y proveen información muy útil al sistema en tiempo de ejecución (runtime system) que controla su funcionamiento. Esta información se utiliza para construir un grafo de dependencias entre tareas (task dependency graph, TDG), un grafo dirigido y acíclico que modela la aplicación y en el ue los nodos son fragmentos de código secuencial, conocidos como tareas, y los arcos son las dependencias de datos entre las distintas tareas. Los algoritmos de planificación que se proponen usan técnicas e particionado de grafos y proporcionan una planificación de las tareas del TDG que minimiza la comunicación de datos entre las distintas regiones NUMA del sistema. Los resultados se han evaluado en sistemas ccNUMA reales con múltiples regiones NUMA.Postprint (published version

    Center for Programming Models for Scalable Parallel Computing - Towards Enhancing OpenMP for Manycore and Heterogeneous Nodes

    Full text link

    Single system image: A survey

    Get PDF
    Single system image is a computing paradigm where a number of distributed computing resources are aggregated and presented via an interface that maintains the illusion of interaction with a single system. This approach encompasses decades of research using a broad variety of techniques at varying levels of abstraction, from custom hardware and distributed hypervisors to specialized operating system kernels and user-level tools. Existing classification schemes for SSI technologies are reviewed, and an updated classification scheme is proposed. A survey of implementation techniques is provided along with relevant examples. Notable deployments are examined and insights gained from hands-on experience are summarized. Issues affecting the adoption of kernel-level SSI are identified and discussed in the context of technology adoption literature

    Optimizations and Cost Models for multi-core architectures: an approach based on parallel paradigms

    Get PDF
    The trend in modern microprocessor architectures is clear: multi-core chips are here to stay, and researchers expect multiprocessors with 128 to 1024 cores on a chip in some years. Yet the software community is slowly taking the path towards parallel programming: while some works target multi-cores, these are usually inherited from the previous tools for SMP architectures, and rarely exploit specific characteristics of multi-cores. But most important, current tools have no facilities to guarantee performance or portability among architectures. Our research group was one of the first to propose the structured parallel programming approach to solve the problem of performance portability and predictability. This has been successfully demonstrated years ago for distributed and shared memory multiprocessors, and we strongly believe that the same should be applied to multi-core architectures. The main problem with performance portability is that optimizations are effective only under specific conditions, making them dependent on both the specific program and the target architecture. For this reason in current parallel programming (in general, but especially with multi-cores) optimizations usually follows a try-and-decide approach: each one must be implemented and tested on the specific parallel program to understand its benefits. If we want to make a step forward and really achieve some form of performance portability, we require some kind of prediction of the expected performance of a program. The concept of performance modeling is quite old in the world of parallel programming; yet, in the last years, this kind of research saw small improvements: cost models to describe multi-cores are missing, mainly because of the increasing complexity of microarchitectures and the poor knowledge of specific implementation details of current processors. In the first part of this thesis we prove that the way of performance modeling is still feasible, by studying the Tilera TilePro64. The high number of cores on-chip in this processor (64) required the use of several innovative solutions, such as a complex interconnection network and the use of multiple memory interfaces per chip. For these features the TilePro64 can be considered an insight of what to expect in future multi-core processors. The availability of a cycle-accurate simulator and an extensive documentation allowed us to model the architecture, and in particular its memory subsystem, at the accuracy level required to compare optimizations In the second part, focused on optimizations, we cover one of the most important issue of multi-core architectures: the memory subsystem. In this area multi-core strongly differs in their structure w.r.t off-chip parallel architectures, both SMP and NUMA, thus opening new opportunities. In detail, we investigate the problem of data distribution over the memory controllers in several commercial multi-cores, and the efficient use of the cache coherency mechanisms offered by the TilePro64 processor. Finally, by using the performance model, we study different implementations, derived from the previous optimizations, of a simple test-case application. We are able to predict the best version using only profiled data from a sequential execution. The accuracy of the model has been verified by experimentally comparing the implementations on the real architecture, giving results within 1 − 2% of accuracy

    Performance Models for Electronic Structure Methods on Modern Computer Architectures

    Get PDF
    Electronic structure codes are computationally intensive scientic applications used to probe and elucidate chemical processes at an atomic level. Maximizing the performance of these applications on any given hardware platform is vital in order to facilitate larger and more accurate computations. An important part of this endeavor is the development of protocols for measuring performance, and models to describe that performance as a function of system architecture. This thesis makes contributions in both areas, with a focus on shared memory parallel computer architectures and the Gaussian electronic structure code. Shared memory parallel computer systems are increasingly important as hardware man- ufacturers are unable to extract performance improvements by increasing clock frequencies. Instead the emphasis is on using multi-core processors to provide higher performance. These processor chips generally have complex cache hierarchies, and may be coupled together in multi-socket systems which exhibit highly non-uniform memory access (NUMA) characteristics. This work seeks to understand how cache characteristics and memory/thread placement affects the performance of electronic structure codes, and to develop performance models that can be used to describe and predict code performance by accounting for these effects. A protocol for performing memory and thread placement experiments on NUMA systems is presented and its implementation under both the Solaris and Linux operating systems is discussed. A placement distribution model is proposed and subsequently used to guide both memory/thread placement experiments and as an aid in the analysis of results obtained from experiments. In order to describe single threaded performance as a function of cache blocking a simple linear performance model is investigated for use when computing the electron repulsion integrals that lie at the heart of virtually all electronic structure methods. A parametric cache variation study is performed. This is achieved by combining parameters obtained for the linear performance model on existing hardware, with instruction and cache miss counts obtained by simulation, and predictions are made of performance as a function of cache architecture. Extension of the linear performance model to describe multi-threaded performance on complex NUMA architectures is discussed and investigated experimentally. Use of dynamic page migration to improve locality is also considered. Finally the use of large scale electronic structure calculations is demonstrated in a series of calculations aiming to study the charge distribution for a single positive ion solvated within a shell of water molecules of increasing size

    Performance Analysis of Complex Shared Memory Systems

    Get PDF
    Systems for high performance computing are getting increasingly complex. On the one hand, the number of processors is increasing. On the other hand, the individual processors are getting more and more powerful. In recent years, the latter is to a large extent achieved by increasing the number of cores per processor. Unfortunately, scientific applications often fail to fully utilize the available computational performance. Therefore, performance analysis tools that help to localize and fix performance problems are indispensable. Large scale systems for high performance computing typically consist of multiple compute nodes that are connected via network. Performance analysis tools that analyze performance problems that arise from using multiple nodes are readily available. However, the increasing number of cores per processor that can be observed within the last decade represents a major change in the node architecture. Therefore, this work concentrates on the analysis of the node performance. The goal of this thesis is to improve the understanding of the achieved application performance on existing hardware. It can be observed that the scaling of parallel applications on multi-core processors differs significantly from the scaling on multiple processors. Therefore, the properties of shared resources in contemporary multi-core processors as well as remote accesses in multi-processor systems are investigated and their respective impact on the application performance is analyzed. As a first step, a comprehensive suite of highly optimized micro-benchmarks is developed. These benchmarks are able to determine the performance of memory accesses depending on the location and coherence state of the data. They are used to perform an in-depth analysis of the characteristics of memory accesses in contemporary multi-processor systems, which identifies potential bottlenecks. However, in order to localize performance problems, it also has to be determined to which extend the application performance is limited by certain resources. Therefore, a methodology to derive metrics for the utilization of individual components in the memory hierarchy as well as waiting times caused by memory accesses is developed in the second step. The approach is based on hardware performance counters, which record the number of certain hardware events. The developed micro-benchmarks are used to selectively stress individual components, which can be used to identify the events that provide a reasonable assessment for the utilization of the respective component and the amount of time that is spent waiting for memory accesses to complete. Finally, the knowledge gained from this process is used to implement a visualization of memory related performance issues in existing performance analysis tools. The results of the micro-benchmarks reveal that the increasing number of cores per processor and the usage of multiple processors per node leads to complex systems with vastly different performance characteristics of memory accesses depending on the location of the accessed data. Furthermore, it can be observed that the aggregated throughput of shared resources in multi-core processors does not necessarily scale linearly with the number of cores that access them concurrently, which limits the scalability of parallel applications. It is shown that the proposed methodology for the identification of meaningful hardware performance counters yields useful metrics for the localization of memory related performance limitations

    Comparison of MPI benchmark programs on an SGI altix ccNUMA shared memory machine

    No full text
    The results produced by five different MPI benchmark programs on an SGI Altix 3700 are analyzed and compared. There are significant differences in the results for some MPI operations. We investigate the reasons for these discrepancies, which are due to differences in the measurement techniques, implementation details and default configurations of the different benchmarks. The variation in results on the Altix are generally much greater than on a distributed memory machine, due primarily to the ccNUMA architecture and the importance of cache effects, as well as some implementation details of the SGI MPI libraries
    corecore