4 research outputs found

    Hardware counter based performance analysis, modelling, and improvement through thread migration in numa systems

    Get PDF
    [EN]These last years have seen an important evolution in the computational resources available in science and engineering. Currently, most high performance systems include several multicore processors and use a NUMA (Non Uniform Memory Access) memory architecture. In this context, data locality becomes a highly important issue for parallel codes performance. It is foreseeable that the complexity as SMP (Symmetric Multiprocessing) NUMA systems increases during the next years. These will increase both the number of cores and the memory complexity, including the various cache levels, which implies memory access latency will depend, increasingly, of the proximity or affinity of the different threads to the memory modules where their data reside. Improving the performance and scalability of parallel codes on multicore architectures may be quite complex. This way, memory management on parallel codes will become more complicated, especially from the point of view of a programmer who wishes to obtain the best performance. Not only this, but the problem worsens in the usual case with different processes in execution simultaneously. Automatically migrating executing threads among the cores and processors, depending on their behaviour, may improve performance of parallel programs. Furthermore, it may allow to simplify their development, since the programmer avoids to explicitly manage locality. Modern microprocessors include registers that give useful information at a low cost, usually known as hardware counters (HCs). HCs are not commonly used due to a lack of tools to easily obtain their data. These HCs, in modern processors, allow to obtain the memory access latency during cache miss resolutions, and even the memory address that leads to the event. This opens the door to the development of new techniques for performance improvement based on this information. A procedure to easily and automatically obtain data about a shared memory parallel code execution on SMP multicore and NUMA systems, to model it using the hardware counters of modern processors, alongside additional information, as the memory access latencies from different threads. This procedure will be used during a parallel program execution, at runtime, to model its performance. This information will be used to improve the efficiency of the execution of said parallel codes automatically and transparently to the user.[GL]Hoxe en d铆a, a maior铆a dos sistemas de computaci贸n son multicore e mesmo multiprocessador. Nestes sistemas, o comportamento dos accesos 谩 memoria de cada f铆o para os distintos nodos de memoria 茅 un dos aspectos que m谩is significativamente afectan o rendemento de calquera c贸digo. Este feito 茅 cada vez m谩is relevante a medida que aumenta o chamado "memory wall". Neste traballo, esta cuesti贸n foi abordada baixo dous puntos de vista. Desde o punto de vista dun programador de aplicaci贸ns paralelas, desenvolv茅ronse ferramentas e modelos para caracterizar o comportamento de c贸digos e axudao para a s煤a aplicaci贸n. Desde o punto de vista dun usuario de aplicaci贸ns paralelas, desenvolveuse unha ferramenta de migraci贸n para seleccionar e adaptar, automaticamente durante a execuci贸n, a colocaci贸n de f铆os no sistema para mellorar o seu funcionamento. Todas estas ferramentas fan uso de datos de rendemento en tempo de execuci贸n obtidos a partir de Contadores Hardware (HC) presentes nos procesadores Intel. En comparaci贸n cos "software profilers", os HC proporcionan, cunha baixa sobrecarga, unha informaci贸n de rendemento detallada e rica referente 谩s unidades funcionais, caches, acceso 谩 memoria principal por parte da CPU, etc. Outra vantaxe de usalos 茅 que non precisa ningunha modificaci贸n do c贸digo fonte. Con todo, os tipos e os significados dos contadores hardware var铆an dunha arquitectura a outra debido 谩 variaci贸n nas organizaci贸ns do hardware. Ademais, pode haber dificultades para correlacionar as m茅tricas de rendemento de baixo nivel co c贸digo fonte orixinal. O n煤mero limitado de rexistros para almacenar os contadores moitas veces pode forzar aos usuarios a realizar m煤ltiples medici贸ns para recoller todas as m茅tricas de rendemento desexadas. En concreto, neste traballo, utiliz谩ronse os Precise Event Based Sampling (PEBS, MOSTRAXE BASEADO EN EVENTOS PRECISOS) nos procesadores Intel modernos e os Event Address Register (EARs, REXISTROS DE ENDEREZO DE EVENTO) nos procesadores Itanium 2. O procesador Itanium 2 ofrece un conxunto de rexistros, os EARs que rexistran os enderezos de instruci贸n e datos dos fallos cach茅, e os enderezos de instruci贸n e datos de fallos de TLB [25]. Cando se usan para capturar fallos cach茅, os EARs permiten a detecci贸n das latencias maiores de 4 ciclos. Xa que os accesos de punto flotante sempre provocan un fallo (os datos de punto flotante son sempre almacenados na L2D), calquer acceso pode ser potencialmente detectado. Os EARs permiten a mostraxe estat铆stica, configurando un contador de rendemento para contar as aparici贸ns dun determinado evento. O PEBS usa un mecanismo de interrupci贸n cos HC para almacenar un conxunto de informaci贸n sobre o estado da arquitectura para o procesador. A informaci贸n ofrece o estado arquitect贸nico da instruci贸n executada despois da instruci贸n que causou o evento. Xunto con esta informaci贸n, que incl煤e o estado de todos os rexistros, os procesadores Sandy Bridge pos煤en un sistema de medici贸n da latencia a memoria. Ista 茅 un medio para caracterizar a latencia de carga media para os diferentes niveis da xerarqu铆a de memoria. A latencia 茅 medida dende a expedici贸n da instrucci贸n ata cando os datos son globalmente observables, e dicir, cando chegan ao procesador. Adem谩is da latencia, o PEBS permite co帽ecer a orixe dos datos e o nivel de memoria de onde se leron. A diferenza dos EARs, o PEBS permite tam茅n medir a latencia de operaci贸ns enteiras ou de almacenamento de dato

    Extended collectives library for unified parallel C

    Get PDF
    [Abstract] Current multicore processors mitigate single-core processor problems (e.g., power, memory and instruction-level parallelism walls), but they have raised the programmability wall. In this scenario, the use of a suitable parallel programming model is key to facilitate a paradigm shift from sequential application development while maximizing the productivity of code developers. At this point, the PGAS (Partitioned Global Address Space) paradigm represents a relevant research advance for its application to multicore systems, as its memory model, with a shared memory view while providing private memory for taking advantage of data locality, mimics the memory structure provided by these architectures. Unified Parallel C (UPC), a PGAS-based extension of ANSI C, has been grabbing the attention of developers for the last years. Nevertheless, the focus on improving performance of current UPC compilers/ runtimes has been relegating the goal of providing higher programmability, where the available constructs have not always guaranteed good performance. Therefore, this Thesis focuses on making original contributions to the state of the art of UPC programmability by means of two main tasks: (1) presenting an analytical and empirical study of the features of the language, and (2) providing new functionalities that favor programmability, while not hampering performance. Thus, the main contribution of this Thesis is the development of a library of extended collective functions, which complements and improves the existing UPC standard library with programmable constructs based on efficient algorithms. A UPC MapReduce framework (UPC-MR) has also been implemented to support this highly scalable computing model for UPC applications. Finally, the analysis and development of relevant kernels and applications (e.g., a large parallel particle simulation based on Brownian dynamics) confirm the usability of these libraries, concluding that UPC can provide high performance and scalability, especially for environments with a large number of threads at a competitive development cost

    UPCBLAS : a numerical library for unified parallel C with architecture-aware optimizations

    Get PDF
    [Abstract] The popularity of Partitioned Global Address Space (PGAS) languages has increased during the last years thanks to their high programmability and performance through an efficient exploitation of data locality, especially on hierarchical architectures like multicore clusters. This PhD Thesis describes UPCBLAS, a parallel library for numerical computation using the PGAS Unified Parallel C (UPC) language. The routines are built on top of sequential BLAS and SparseBLAS functions and exploit the particularities of the PGAS paradigm, taking into account data locality in order to achieve a good performance. However, the growing complexity in computer system hierarchies due to the increase in the number of cores per processor, levels of cache (some of them shared) and the number of processors per node, as well as the high-speed interconnects, demands the use of new optimization techniques and libraries that take advantage of their features. For this reason, this Thesis also presents Servet, a suite of benchmarks focused on detecting a set of parameters with high in uence on the overall performance of multicore systems. UPCBLAS routines use the hardware parameters provided by Servet to implement optimization techniques that improve their performance. The performance of the library has been experimentally evaluated on several multicore supercomputers and compared to message-passing-based parallel numerical libraries, demonstrating good scalability and efficiency. UPCBLAS has also been used to develop more complex numerical codes in order to demonstrate that it is a good alternative to MPI-based libraries for increasing the productivity of numerical application developers

    Performance counter-based strategies to improve data locality on multiprocessor systems: reordering and page migration techniques

    Get PDF
    In this dissertation we approach the study of Precise Event-Based Sampling (PEBS) techniques to improve the performance of applications on a NUMA, Itanium2-based system. We demonstrate that a low-cost, PEBS profiling can support strategies to improve the performance of an important group of computational and scientific codes in runtime. In addition, the accurate information provided by the new Event Adress Registers (EAR) of the Intel Itanium architecture helps foster the development of new data allocation strategies. Following this line, we have also developed a series of dynamic page migration PEBS strategies. Specifically, two problems are addressed: how to improve the performance of locality optimisation techniques for irregular codes in runtime, particularising for the Sparse Matrix-Vector product kernel, and how to develop strategies for dynamic page migration. To summarise, the main contributions of this dissertation are: 1. A study of the different factors that affect the performance, as well as data and thread allocation policies, in the FinisTerrae supercomputer, the target platform in which this thesis relies on. 2. The implementation of a performance model for FinisTerrae. 3. The development of hardware counter-based strategies to assist reordering techniques for irregular codes in order to reduce their cost and improve their behaviour. 4. The development of novel hardware counter-guided, dynamic page migration algorithms that take advantage of the new features provided by the PEBS. As a software contribution, we present a user-level page-migration framework to monitor, sample and control an application in runtime
    corecore