36 research outputs found

    Bandwidth-Aware Dynamic Prefetch Configuration for IBM POWER8

    Full text link
    © 2020 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] Advanced hardware prefetch engines are being integrated in current high-performance processors. Prefetching can boost the performance of most applications, however, the induced bandwidth consumption can lead the system to a high contention for main memory bandwidth, which is a scarce resource in current multicores. In such a case, the system performance can be severely damaged. This article characterizes the applications' behavior in an IBM POWER8 machine, which presents many prefetch settings, varying the bandwidth contention. The study reveals that the best prefetch setting for each application depends on the main memory bandwidth availability, that is, it depends on the co-running applications. Based on this study, we propose Bandwidth-Aware Prefetch Configuration (BAPC) a scalable adaptive prefetching algorithm that improves the performance of multi-program workloads. BAPC increases the performance of the applications in a 12, 15, and 16 percent of 6-, 8-, and 10-application workloads over the IBM POWER8 default configuration. In addition, BAPC reduces bandwidth consumption in 39, 42, and 45 percent, respectively.This work was supported in part by Ministerio de Ciencia, Innovacion y Universidades and the European ERDF under Grant RTI2018-098156-B-C51, Generalitat Valenciana under Grant AICO/2019/317, and Universitat Politenica de Valencia (PAID-06-18) under Grant SP20180140.Navarro, C.; Feliu-Pérez, J.; Petit Martí, SV.; Gómez Requena, ME.; Sahuquillo Borrás, J. (2020). Bandwidth-Aware Dynamic Prefetch Configuration for IBM POWER8. IEEE Transactions on Parallel and Distributed Systems. 31(8):1970-1982. https://doi.org/10.1109/TPDS.2020.2982392S1970198231

    DeepP: Deep Learning Multi-Program Prefetch Configuration for the IBM POWER 8

    Full text link
    [EN] Current multi-core processors implement sophisticated hardware prefetchers, that can be configured by application (PID),to improve the system performance. When running multiple applications, each application can present different prefetch requirements,hence different configurations can be used. Setting the optimal prefetch configuration for each application is a complex task since itdoes not only depend on the application characteristics but also on the interference at the shared memory resources (e.g. memorybandwidth). In his paper, we proposeDeepP, a deep learning approach for the IBM POWER8 that identifies at run-time the bestprefetch configuration for each application in a workload. To this end, the neural network predicts the performance of each applicationunder the studied prefetch configurations by using a set of performance events. The prediction accuracy of the network is improvedthanks to a dynamic training methodology that allows learning the impact of dynamic changes of the prefetch configuration onperformance. At run-time, the devised network infers the best prefetch configuration for each application and adjusts it dynamically.Experimental results show that the proposed approach improves performance, on average, by 5,8%, 6,7%, and 15,8% compared tothe default prefetch configuration across different 6-, 8-, and 10-application workloads, respectively.This work was supported in part by Ministerio de Ciencia, Innovacion y Universidades and the European ERDF under Grant RTI2018-098156-B-C51, in part by Generalitat Valenciana under Grant AICO/2021/266, and in part by Ayudas Contratos predoctorales UPV -subprograma 1 (PAID-01-20). The work of Josue Feliu was supported by a Juan de la Cierva Formacion Contract under Grant FJC2018-036021-I.Lurbe-Sempere, M.; Feliu-Pérez, J.; Petit Martí, SV.; Gómez Requena, ME.; Sahuquillo Borrás, J. (2022). DeepP: Deep Learning Multi-Program Prefetch Configuration for the IBM POWER 8. IEEE Transactions on Computers. 71(10):2646-2658. https://doi.org/10.1109/TC.2021.313999726462658711

    PRISM: an intelligent adaptation of prefetch and SMT levels

    Get PDF
    Current microprocessors include hardware to optimize some specifics workloads. In general, these hardware knobs are set on a default configuration on the booting process of the machine. This default behavior cannot be beneficial for all types of workloads and they are not controlled by anyone but the end user, who needs to know what configuration is the best one for the workload running. Some of these knobs are: (1) the Simultaneous MultiThreading level, which specifies the number of threads that can run simultaneously on a physical CPU, and (2) the data prefetch engine, that manages the prefetches on memory. Parallel programming models are here to stay, and one programming model that succeed in allowing programmers to easily parallelize applications is Open Multi Processing (OMP). Also, the architecture of microprocessors is getting more complex that end users cannot afford to optimize their workloads for all the architectural details. These architectural knobs can help to increase performance but it is needed an automatic and adaptive system managing them. In this work we propose an independent library for OpenMP runtimes to increase performance up to 220% (14.7% on average) while reducing dynamic power consumption up to 13% (2% on average) on a real POWER8 processor

    Heterogeneous system and application communication modeling

    Get PDF
    With the end of Dennard scaling, high-performance computing increasingly relies on heterogeneous systems with specialized hardware to improve application performance. This trend has driven up the complexity of high-performance software development, as developers must manage multiple programming systems and develop system-tuned code to utilize specialized hardware. In addition, it has exacerbated existing challenges of data placement as the specialized hardware often has local memories to fuel its computational demands. In addition to using appropriate software resources to target application computation at the best hardware for the job, application developers now must manage data movement and placement within their application, which also must be specifically tuned to the target system. Instead of relying on the application developer to have specialized knowledge of system characteristics and specialized expertise in multiple programming systems, this work proposes a heterogeneous system communication library that automatically chooses data location and data movement for high-performance application development and execution on heterogeneous systems. This work presents the foundational components of that library: a systematic approach for characterization of system communication links and application communication demands

    Performance Analysis of IBM POWER8 Processors

    Get PDF
    Práce se zabývá systémem IBM Power8 v porovnání s dnes běžně používanými řešeními s procesory Intel Xeon. Výkonnost je vyhodnocována nejen na úrovni celého systému, ale také na úrovni jednotlivých vláken a jader a paměti. Různé metriky jsou demonstrovány na typických optimalizovaných algoritmech. Testovaný stroj Power8 disponuje extrémně rychlou pamětí poskytující rychlost až 145 GB/s mezi pamětí a procesorem, které se dnešní procesory Intel nevyrovnají. Výpočetní síla je pouze srovnatelná (Násobení matic) nebo slabší (N-body simulace, dělení, složitější algoritmy) v porovnání s aktuálním Intel Haswell-EP. Procesor IBM Power8 je dnes schopný konkurovat procesorům Intel a bude zajímavé sledovat následující generaci Power9 a jeho výkonnost v porovnání s aktuálními a budoucími procesory Intel.This paper describes the IBM Power8 system in comparison to the Intel Xeon processors, widely used in today’s solutions. The performance is not evaluated only on the whole system level but also on the level of threads, cores and a memory. Different metrics are demonstrated on typical optimized algorithms. The benchmarked Power8 processor provides extremely fast memory providing sustainable bandwidth up to 145 GB/s between main memory and processor, which Intel is unable to compete. Computation power is comparable (Matrix multiplication) or worse (N-body simulation, division, more complex algorithms) in comparison with current Intel Haswell-EP. The IBM Power8 is able to compete Intel processors these days and it will be interesting to observe the future generation of Power9 and its performance in comparison to current and future Intel processors.

    Heterogeneous system and application communication modeling

    Get PDF
    With the end of Dennard scaling, high-performance computing increasingly relies on heterogeneous systems with specialized hardware to improve application performance. This trend has driven up the complexity of high-performance software development, as developers must manage multiple programming systems and develop system-tuned code to utilize specialized hardware. In addition, it has exacerbated existing challenges of data placement as the specialized hardware often has local memories to fuel its computational demands. In addition to using appropriate software resources to target application computation at the best hardware for the job, application developers now must manage data movement and placement within their application, which also must be specifically tuned to the target system. Instead of relying on the application developer to have specialized knowledge of system characteristics and specialized expertise in multiple programming systems, this work proposes a heterogeneous system communication library that automatically chooses data location and data movement for high-performance application development and execution on heterogeneous systems. This work presents the foundational components of that library: a systematic approach for characterization of system communication links and application communication demands

    Adaptive Prefetching and Cache Partitioning for Multicore Processors

    Get PDF
    El acceso a la memoria principal en los procesadores actuales supone un importante cuello de botella para las prestaciones, dado que los diferentes núcleos compiten por el limitado ancho de banda de memoria, agravando la brecha entre las prestaciones del procesador y las de la memoria principal. Distintas técnicas atacan este problema, siendo las más relevantes el uso de jerarquías de caché multinivel y la prebúsqueda. Las cachés jerárquicas aprovechan la localidad temporal y espacial que en general presentan los programas en el acceso a los datos, para mitigar las enormes latencias de acceso a memoria principal. Para limitar el número de accesos a la memoria DRAM, fuera del chip, los procesadores actuales cuentan con grandes cachés de último nivel (LLC). Para mejorar su utilización y reducir costes, estas cachés suelen compartirse entre todos los núcleos del procesador. Este enfoque mejora significativamente el rendimiento de la mayoría de las aplicaciones en comparación con el uso de cachés privados más pequeños. Compartir la caché, sin embargo, presenta una problema importante: la interferencia entre aplicaciones. La prebúsqueda, por otro lado, trae bloques de datos a las cachés antes de que el procesador los solicite, ocultando la latencia de memoria principal. Desafortunadamente, dado que la prebúsqueda es una técnica especulativa, si no tiene éxito puede contaminar la caché con bloques que no se usarán. Además, las prebúsquedas interfieren con los accesos a memoria normales, tanto los del núcleo que emite las prebúsquedas como los de los demás. Esta tesis se centra en reducir la interferencia entre aplicaciones, tanto en las caché compartidas como en el acceso a la memoria principal. Para reducir la interferencia entre aplicaciones en el acceso a la memoria principal, el mecanismo propuesto en esta disertación regula la agresividad de cada prebuscador, activando o desactivando selectivamente algunos de ellos, dependiendo de su rendimiento individual y de los requisitos de ancho de banda de memoria principal de los otros núcleos. Con respecto a la interferencia en cachés compartidos, esta tesis propone dos técnicas de particionado para la LLC, las cuales otorgan más espacio de caché a las aplicaciones que progresan más lentamente debido a la interferencia entre aplicaciones. La primera propuesta de particionado de caché requiere hardware específico no disponible en procesadores comerciales, por lo que se ha evaluado utilizando un entorno de simulación. La segunda propuesta de particionado de caché presenta una familia de políticas que superan las limitaciones en el número de particiones y en el número de vías de caché disponibles mediante la agrupación de aplicaciones en clústeres y la superposición de particiones de caché, por lo que varias aplicaciones comparten las mismas vías. Dado que se ha implementado utilizando los mecanismos para el particionado de la LLC que presentan algunos procesadores Intel modernos, esta propuesta ha sido evaluada en una máquina real. Los resultados experimentales muestran que el mecanismo de prebúsqueda selectiva propuesto en esta tesis reduce el número de solicitudes de memoria principal en un 20%, cosa que se traduce en mejoras en la equidad del sistema, el rendimiento y el consumo de energía. Por otro lado, con respecto a los esquemas de partición propuestos, en comparación con un sistema sin particiones, ambas propuestas reducen la iniquidad del sistema en un promedio de más del 25%, independientemente de la cantidad de aplicaciones en ejecución, y esta reducción en la injusticia no afecta negativamente al rendimiento.Accessing main memory represents a major performance bottleneck in current processors, since the different cores compete among them for the limited offchip bandwidth, aggravating even more the so called memory wall. Several techniques have been applied to deal with the core-memory performance gap, with the most preeminent ones being prefetching and hierarchical caching. Hierarchical caches leverage the temporal and spacial locality of the accessed data, mitigating the huge main memory access latencies. To limit the number of accesses to the off-chip DRAM memory, current processors feature large Last Level Caches. These caches are shared between all the cores to improve the utilization of the cache space and reduce cost. This approach significantly improves the performance of most applications compared to using smaller private caches. Cache sharing, however, presents an important shortcoming: the interference between applications. Prefetching, on the other hand, brings data blocks to the caches before they are requested, hiding the main memory latency. Unfortunately, since prefetching is a speculative technique, inaccurate prefetches may pollute the cache with blocks that will not be used. In addition, the prefetches interfere with the regular memory requests, both the ones from the application running on the core that issued the prefetches and the others. This thesis focuses on reducing the inter-application interference, both in the shared cache and in the access to the main memory. To reduce the interapplication interference in the access to main memory, the proposed approach regulates the aggressiveness of each core prefetcher, and selectively activates or deactivates some of them, depending on their individual performance and the main memory bandwidth requirements of the other cores. With respect to interference in shared caches, this thesis proposes two LLC partitioning techniques that give more cache space to the applications that have their progress diminished due inter-application interferences. The first cache partitioning proposal requires dedicated hardware not available in commercial processors, so it has been evaluated using a simulation framework. The second proposal dealing with cache partitioning presents a family of partitioning policies that overcome the limitations in the number of partitions and the number of available ways by grouping applications and overlapping cache partitions, so multiple applications share the same ways. Since it has been implemented using the cache partitioning features of modern Intel processors it has been evaluated in a real machine. Experimental results show that the proposed selective prefetching mechanism reduces the number of main memory requests by 20%, which translates to improvements in unfairness, performance, and energy consumption. On the other hand, regarding the proposed partitioning schemes, compared to a system with no partitioning, both reduce unfairness more than 25% on average, regardless of the number of applications running in the multicore, and this reduction in unfairness does not negatively affect the performance.L'accés a la memòria principal en els processadors actuals suposa un important coll d'ampolla per a les prestacions, ja que els diferents nuclis competeixen pel limitat ample de banda de memòria, agreujant la bretxa entre les prestacions del processador i les de la memòria principal. Diferents tècniques ataquen aquest problema, sent les més rellevants l'ús de jerarquies de memòria cau multinivell i la prebusca. Les memòries cau jeràrquiques aprofiten la localitat temporal i espacial que en general presenten els programes en l'accés a les dades per mitigar les enormes latències d'accés a memòria principal. Per limitar el nombre d'accessos a la memòria DRAM, fora del xip, els processadors actuals compten amb grans caus d'últim nivell (LLC). Per millorar la seva utilització i reduir costos, aquestes memòries cau solen compartir-se entre tots els nuclis del processador. Aquest enfocament millora significativament el rendiment de la majoria de les aplicacions en comparació amb l'ús de caus privades més menudes. Compartir la memòria cau, no obstant, presenta una problema important: la interferencia entre aplicacions. La prebusca, per altra banda, porta blocs de dades a les memòries cau abans que el processador els sol·licite, ocultant la latència de memòria principal. Desafortunadament, donat que la prebusca és una técnica especulativa, si no té èxit pot contaminar la memòria cau amb blocs que no fan falta. A més, les prebusques interfereixen amb els accessos normals a memòria, tant els del nucli que emet les prebusques com els dels altres. Aquesta tesi es centra en reduir la interferència entre aplicacions, tant en les cau compartides com en l'accés a la memòria principal. Per reduir la interferència entre aplicacions en l'accés a la memòria principal, el mecanismo proposat en aquesta dissertació regula l'agressivitat de cada prebuscador, activant o desactivant selectivament alguns d'ells, en funció del seu rendiment individual i dels requisits d'ample de banda de memòria principal dels altres nuclis. Pel que fa a la interferència en caus compartides, aquesta tesi proposa dues tècniques de particionat per a la LLC, les quals atorguen més espai de memòria cau a les aplicacions que progressen més lentament a causa de la interferència entre aplicacions. La primera proposta per al particionat de memòria cau requereix hardware específic no disponible en processadors comercials, per la qual cosa s'ha avaluat utilitzant un entorn de simulació. La segona proposta de particionat per a memòries cau presenta una família de polítiques que superen les limitacions en el nombre de particions i en el nombre de vies de memòria cau disponibles mitjan¿ cant l'agrupació d'aplicacions en clústers i la superposició de particions de memòria cau, de manera que diverses aplicacions comparteixen les mateixes vies. Atès que s'ha implementat utilitzant els mecanismes per al particionat de la LLC que ofereixen alguns processadors Intel moderns, aquesta proposta s'ha avaluat en una màquina real. Els resultats experimentals mostren que el mecanisme de prebusca selectiva proposat en aquesta tesi redueix el nombre de sol·licituds a la memòria principal en un 20%, cosa que es tradueix en millores en l'equitat del sistema, el rendiment i el consum d'energia. Per altra banda, pel que fa als esquemes de particiónat proposats, en comparació amb un sistema sense particions, ambdues propostes redueixen la iniquitat del sistema en més d'un 25% de mitjana, independentment de la quantitat d'aplicacions en execució, i aquesta reducció en la iniquitat no afecta negativament el rendiment.Selfa Oliver, V. (2018). Adaptive Prefetching and Cache Partitioning for Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/112423TESI

    A research-oriented course on Advanced Multicore Architecture: Contents and active learning methodologies

    Full text link
    [EN] The fast evolution of multicore processors makes it difficult for professors to offer computer architecture courses with updated contents. To deal with this shortcoming that could discourage students, the most appropriate solution is a research-oriented course based on current microprocessor industry trends. Additionally, we also seek to improve the students' skills by applying active learning methodologies, where teachers act as guiders and resource providers while students take the responsibility for their learning. In this paper, we present the Advanced Multicore Architecture (AMA) course, which follows a research-oriented approach to introduce students in architectural breakthroughs and uses active learning methodologies to enable students to develop practical research skills such as critical analysis of research papers or communication abilities. To this end five main activities are used: (i) lectures dealing with key theoretical concepts, (ii) paper review & discussion, (iii) research-oriented practical exercises, (iv) lab sessions with a state-of-the-art multicore simulator, and (v) paper presentation. An important part of all these activities is driven by active learning methodologies. Special emphasis is put on the practical side by allocating 40% of the time to labs and exercises. This work also includes an assessment study that analyzes both the course contents and the used methodology (both of them compared to other courses).This work was supported in part by the Spanish Ministerio de Economia y Competitividad (MINECO) and by Plan E funds under Grant TIN2014-62246-EXP and Grant TIN2015-66972-C5-1-R, and by Generalitat Valenciana under grant AICO/2016/059. Authors also would like to thank Onur Mutlu for making available online his valuable teaching material.Petit Martí, SV.; Sahuquillo Borrás, J.; Gómez Requena, ME.; Selfa-Oliver, V. (2017). A research-oriented course on Advanced Multicore Architecture: Contents and active learning methodologies. Journal of Parallel and Distributed Computing. 105:63-72. https://doi.org/10.1016/j.jpdc.2017.01.011S637210

    Energy-efficient data prefetch buffering for low-end embedded processors

    Get PDF
    An energy-efficient architecture should jointly optimize energy consumption and throughput, as captured by the Energy-Delay-Square Product (ED2P) metric. This paper introduces a prefetch data buffer micro-architecture, which achieves that goal with the aid of software-inserted control words to govern the prefetch process. The proposed architecture is aimed at low-end embedded processors, which, so as to reduce energy consumption, lack a cache-based memory hierarchy. By identifying after compilation which data should be prefetched and modifying the object code, the rate of prefetch misses is reduced. And by pre-computing memory addresses using auxiliary software after compilation and modifying the object code, address computation by hardware at run time is avoided, reducing pipeline stalls and, thus, improving throughput. Additionally in the case of branches, by prefetching two data items at any one time, alternative instruction outcomes are anticipated. The paper contains results from running a range of well-known and representative benchmarks on the proposed architecture. There was an improvement of 6−20% compared to an unbuffered architecture in execution times when tested over those seven benchmarks. Furthermore, the average ED2P for the buffered architecture when normalized against the same architecture without buffering was found to vary between 54% and 90% according to benchmarking, though there is a cost in code size increase. That is to say, for the benchmarks tested there was a net energy efficiency improvement of between 10% and 46% in comparison with the equivalent unbuffered architecture with a lower area overhead

    Improving the efficiency of multicore systems through software and hardware cooperation

    Get PDF
    Increasing processors' clock frequency has traditionally been one of the largest drivers of performance improvements for computing systems. In the first half of the 2000s, however, it became clear that continuing to increase frequency was not a viable solution anymore. Power consumption and power density became prohibitively costly, and processor manufacturers moved to multicore designs. This new paradigm introduced multiple challenges not present in single-threaded processors. Applications running on multicore systems share different resources such as the cache hierarchy and the memory bus. Resource sharing occurs at much finer degree when cores support multithreading as well. In this case, applications share the processor¿s pipeline too. Running multiple applications on the same processor allows for better utilization of its resources¿which otherwise may just lie idle if an application does not use them. But sharing resources may create interferences between applications running on the system. While the degree of these interferences depends on the nature of the applications, it is typically desirable to reduce them in order to improve efficiency. Most currently available processors expose a set of sensors and actuators that software can use to monitor and control resource sharing among the applications running on a system. But it is typically up to end users to analyze their workloads of interest and to manually use the actuators provided by the processor. Because of this, in many cases the different mechanisms for controlling resource sharing are simply left unused. In this thesis we present different techniques that rely on software/hardware interaction to monitor and improve application interference¿and thus improve system efficiency. First we conduct a quantitative study showing the benefits of hardware/software cooperation on system efficiency. Then we narrow our focus on a given hardware knob: data prefetching. Specifically we develop and evaluate several adaptive solutions for improving the efficiency of hardware data prefetching on multicore systems. The impact of the solutions presented in this thesis, however, goes beyond the particular case of data prefetching. They serve as illustrative examples for developing software/hardware cooperation schemes that enable the efficient sharing of resources in multicore systems.L'increment de la freqüència dels processadors ha estat tradicionalment un dels majors responsables de la millora de rendiment dels sistemes de computació. Tanmateix, a la primera meitat del segle XXI es va fer evident que continuar incrementant la freqüència ja no era una solució viable. El consum de potència i la densitat de potència van esdevenir massa costosos, i els dissenyadors de processadors van adoptar dissenys "multicore". Aquest nou paradigma va introduir molts reptes que no eren presents als processadors "single-threaded". Les aplicacions que s'executen a processadors multicore comparteixen diferent recursos tal i com la jerarquia de "cache" i el bus de memòria. En processadors que suporten "multi-threading" encara comparteixen més recursos: en aquest cas les aplicacions també comparteixen els recursos del "pipeline". Executar diverses aplicacions en un processador permet una millor utilització dels seus recursos, que d'altra forma podrien no tenir cap utilitat si l'aplicació en execució no els utilitzés. Compartir recursos, però, pot crear interferències entre les aplicacions executant-se al sistema. Encara que el nivell d'aquestes interferències depèn de les aplicacions que s'executen conjuntament, normalment és desitjable reduir-les per tal de millorar la eficiència. Molts dels processadors actuals exposen un conjunt sensors i actuadors que el software pot utilitzar per supervisar i controlar la compartició de recursos entre les diferents aplicacions executant-se al sistema. En general és responsabilitat dels usuaris analitzar les aplicacions del seu interès i després configurar els actuadors de forma manual. Això suposa una dificultat afegida i per aquest motiu, en molts casos els diferents mecanismes per controlar com es comparteixen els recursos senzillament no es fan servir. En aquesta tesi, presentem diferents tècniques basades en la interacció del software i el hardware per supervisar i reduir la interferència entre aplicacions, i d'aquesta forma millorar la eficiència del sistema. Primer es presenta un estudi quantitatiu que mostra els beneficis de la cooperació entre software i hardware en la eficiència del sistema. Després el focus es centra en un actuador en concret: "data prefetching". En concret, desenvolupem i avaluem diferents solucions adaptatives per millorar la eficiència de hardware data prefetching a sistemes multicore. L'impacte de les solucions presentades a aquesta tesi, però, no es limiten a aquest cas concret. Al contrari, serveixen com exemples il·lustratius per desenvolupar tècniques de cooperació software i hardware que permetin compartir els recursos en sistemes multicore de forma eficient. La compartició de recursos en un processador és un factor crucial que afecta significativament a la seva eficiència. Però, altres nivells d'un sistema de computació també comparteixen recursos. En grans instal·lacions de computació com els "datacenters", les aplicacions també poden compartir altres recursos com la xarxa o l'emmagatzemament. Com a cas d'estudi considerem el disseny d'un sistema d'un sistema de comptabilitat d'energia basat en la cooperació entre el software i el hardware per a grans instal·lacions de computació. En aquest context, explorem diverses alternatives per als sensors i actuadors que es requereixen, així com també analitzem els diferents aspectes claus en el disseny d'un sistema d'aquestes característiques
    corecore