    Mechanisms to improve the efficiency of hardware data prefetchers

    A well known performance bottleneck in computer architecture is the so-called memory wall. This term refers to the huge disparity between on-chip and off-chip access latencies. Historically speaking, the operating frequency of processors has increased at a steady pace, while most past advances in memory technology have been in density, not speed. Nowadays, the trend for ever increasing processor operating frequencies has been replaced by an increasing number of CPU cores per chip. This will continue to exacerbate the memory wall problem, as several cores now have to compete for off-chip data access. As multi-core systems pack more and more cores, it is expected that the access latency as observed by each core will continue to increase. Although the causes of the memory wall have changed, it is, and will continue to be in the near future, a very significant challenge in terms of computer architecture design. Prefetching has been an important technique to amortize the effect of the memory wall. With prefetching, data or instructions that are expected to be used in the near future are speculatively moved up in the memory hierarchy, were the access latency is smaller. This dissertation focuses on hardware data prefetching at the last cache level before memory (last level cache, LLC). Prefetching at the LLC usually offers the best performance increase, as this is where the disparity between hit and miss latencies is the largest. Hardware prefetchers operate by examining the miss address stream generated by the cache and identifying patterns and correlations between the misses. Most prefetchers divide the global miss stream in several sub-streams, according to some pre-specified criteria. This process is known as localization. The benefits of localization are well established: it increases the accuracy of the predictions and helps filtering out spurious, non-predictable misses. However localization has one important drawback: since the misses are classified into different sub-streams, important chronological information is lost. A consequence of this is that most localizing prefetchers issue prefetches in an untimely manner, fetching data too far in advance. This behavior promotes data pollution in the cache. The first part of this thesis proposes a new class of prefetchers based on the novel concept of Stream Chaining. With Stream Chaining, the prefetcher tries to reconstruct the chronological information lost in the process of localization, while at the same time keeping its benefits. We describe two novel Stream Chaining prefetching algorithms based on two state of the art localizing prefetchers: PC/DC and C/DC. We show how both prefetchers issue prefetches in a more timely manner than their nonchaining counterparts, increasing performance by as much as 55% (10% on average) on a suite of sequential benchmarks, while consuming roughly the same amount of memory bandwidth. In order to hide the effects of the memory wall, hardware prefetchers are usually configured to aggressively prefetch as much data as possible. However, a highly aggressive prefetcher can have negative effects on performance. Factors such as prefetching accuracy, cache pollution and memory bandwidth consumption have to be taken into account. This is specially important in the context of multi-core systems, where typically each core has its own prefetching engine and there is high competition for accessing memory. Several prefetch throttling and filtering mechanisms have been proposed to maximize the effect of prefetching in multi-core systems. The general strategy behind these heuristics is to promote prefetches that are more likely to be used and cause less interference. Traditionally these methods operate at the source level, i.e., directly into the prefetch engine they are assigned to control. In multi-core systems all prefetches are aggregated in a FIFO-like data structure called the Prefetch Request Queue (PRQ), where they wait to be dispatched to memory. The second part of this thesis shows that a traditional FIFO PRQ does not promote a timely prefetching behavior and usually hinders part of the performance benefits achieved by throttling heuristics. We propose a novel approach to prefetch aggressiveness control in multi-cores that performs throttling at the PRQ (i.e., global) level, using global knowledge of the metrics of all prefetchers and information about the global state of the PRQ. To do this, we introduce the Resizable Prefetching Heap (RPH), a data structure modeled after a binary heap that promotes timely dispatch of prefetches as well as fairness in the distribution of prefetching bandwidth. The RPH is designed as a drop-in replacement of traditional FIFO PRQs. We compare our proposal against a state-of-the-art source-level throttling algorithm (HPAC) in a 8-core system. Unlike previous research, we evaluate both multiprogrammed and multithreaded (parallel) workloads, using a modern prefetching algorithm (C/DC). Our experimental results show that RPH-based throttling increases the throttling performance benefits obtained by HPAC by as much as 148% (53.8% average) in multiprogrammed workloads and as much as 237% (22.5% average) in parallel benchmarks, while consuming roughly the same amount of memory bandwidth. When comparing the speedup over fixed degree prefetching, RPH increased the average speedup of HPAC from 7.1% to 10.9% in multiprogrammed workloads, and from 5.1% to 7.9% in parallel benchmarks

    Accurate and complexity-effective spatial pattern prediction

    Recent research suggests that there are large variations in a cache's spatial usage, both within and across programs. Unfortunately, conventional caches typically employ fixed cache line sizes to balance the exploitation of spatial and temporal locality, and to avoid prohibitive cache fill bandwidth demands. The resulting inability of conventional caches to exploit spatial variations leads to sub-optimal performance and unnecessary cache power dissipation. This paper describes the Spatial Pattern Predictor (SPP), a cost-effective hardware mechanism that accurately predicts reference patterns within a spatial group (i.e., a contiguous region of data in memory) at runtime. The key observation enabling an accurate, yet low-cost, SPP design is that spatial patterns correlate well with instruction addresses and data reference offsets within a cache line. We require only a small amount of predictor memory to store the predicted patterns. Simulation results for a 64-Kbyte 2-way set- associative L1 data cache with 64-byte lines show that: (1) a 256-entry tag- less direct-mapped SPP can achieve, on average, a prediction coverage of 95%, over-predicting the patterns by only 8%, (2) assuming a 70nm process technology, the SPP helps reduce leakage energy in the base cache by 41% on average, incurring less than 1% performance degradation, and (3) prefetching spatial groups of up to 512 bytes using SPP improves execution time by 33% on average and up to a factor of two

    Die-Stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache

    Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories — block-based and page-based. The former organize data in conventional blocks (e.g., 64B), ensuring low off-chip bandwidth utilization, but co-locate tags and data in the stacked DRAM, incurring high lookup latency. Furthermore, such designs suffer from low hit ratios due to poor temporal locality. In contrast, page-based caches, which manage data at larger granularity (e.g., 4KB pages), allow for reduced tag array overhead and fast lookup, and leverage high spatial locality at the cost of moving large amounts of data on and off the chip. This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors. Footprint Cache allocates data at the granularity of pages, but identifies and fetches only those blocks within a page that will be touched during the page's residency in the cache — i.e., the page's footprint. In doing so, Footprint Cache eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency. Cycle-accurate simulation results of a 16-core server with up to 512MB Footprint Cache indicate a 57% performance improvement over a baseline chip without a die-stacked cache. Compared to a state-of-the-art block-based design, our design improves performance by 13% while reducing dynamic energy of stacked DRAM by 24%

    Improving the efficiency of multicore systems through software and hardware cooperation

    Increasing processors' clock frequency has traditionally been one of the largest drivers of performance improvements for computing systems. In the first half of the 2000s, however, it became clear that continuing to increase frequency was not a viable solution anymore. Power consumption and power density became prohibitively costly, and processor manufacturers moved to multicore designs. This new paradigm introduced multiple challenges not present in single-threaded processors. Applications running on multicore systems share different resources such as the cache hierarchy and the memory bus. Resource sharing occurs at much finer degree when cores support multithreading as well. In this case, applications share the processor¿s pipeline too. Running multiple applications on the same processor allows for better utilization of its resources¿which otherwise may just lie idle if an application does not use them. But sharing resources may create interferences between applications running on the system. While the degree of these interferences depends on the nature of the applications, it is typically desirable to reduce them in order to improve efficiency. Most currently available processors expose a set of sensors and actuators that software can use to monitor and control resource sharing among the applications running on a system. But it is typically up to end users to analyze their workloads of interest and to manually use the actuators provided by the processor. Because of this, in many cases the different mechanisms for controlling resource sharing are simply left unused. In this thesis we present different techniques that rely on software/hardware interaction to monitor and improve application interference¿and thus improve system efficiency. First we conduct a quantitative study showing the benefits of hardware/software cooperation on system efficiency. Then we narrow our focus on a given hardware knob: data prefetching. Specifically we develop and evaluate several adaptive solutions for improving the efficiency of hardware data prefetching on multicore systems. The impact of the solutions presented in this thesis, however, goes beyond the particular case of data prefetching. They serve as illustrative examples for developing software/hardware cooperation schemes that enable the efficient sharing of resources in multicore systems.L'increment de la freqüència dels processadors ha estat tradicionalment un dels majors responsables de la millora de rendiment dels sistemes de computació. Tanmateix, a la primera meitat del segle XXI es va fer evident que continuar incrementant la freqüència ja no era una solució viable. El consum de potència i la densitat de potència van esdevenir massa costosos, i els dissenyadors de processadors van adoptar dissenys "multicore". Aquest nou paradigma va introduir molts reptes que no eren presents als processadors "single-threaded". Les aplicacions que s'executen a processadors multicore comparteixen diferent recursos tal i com la jerarquia de "cache" i el bus de memòria. En processadors que suporten "multi-threading" encara comparteixen més recursos: en aquest cas les aplicacions també comparteixen els recursos del "pipeline". Executar diverses aplicacions en un processador permet una millor utilització dels seus recursos, que d'altra forma podrien no tenir cap utilitat si l'aplicació en execució no els utilitzés. Compartir recursos, però, pot crear interferències entre les aplicacions executant-se al sistema. Encara que el nivell d'aquestes interferències depèn de les aplicacions que s'executen conjuntament, normalment és desitjable reduir-les per tal de millorar la eficiència. Molts dels processadors actuals exposen un conjunt sensors i actuadors que el software pot utilitzar per supervisar i controlar la compartició de recursos entre les diferents aplicacions executant-se al sistema. En general és responsabilitat dels usuaris analitzar les aplicacions del seu interès i després configurar els actuadors de forma manual. Això suposa una dificultat afegida i per aquest motiu, en molts casos els diferents mecanismes per controlar com es comparteixen els recursos senzillament no es fan servir. En aquesta tesi, presentem diferents tècniques basades en la interacció del software i el hardware per supervisar i reduir la interferència entre aplicacions, i d'aquesta forma millorar la eficiència del sistema. Primer es presenta un estudi quantitatiu que mostra els beneficis de la cooperació entre software i hardware en la eficiència del sistema. Després el focus es centra en un actuador en concret: "data prefetching". En concret, desenvolupem i avaluem diferents solucions adaptatives per millorar la eficiència de hardware data prefetching a sistemes multicore. L'impacte de les solucions presentades a aquesta tesi, però, no es limiten a aquest cas concret. Al contrari, serveixen com exemples il·lustratius per desenvolupar tècniques de cooperació software i hardware que permetin compartir els recursos en sistemes multicore de forma eficient. La compartició de recursos en un processador és un factor crucial que afecta significativament a la seva eficiència. Però, altres nivells d'un sistema de computació també comparteixen recursos. En grans instal·lacions de computació com els "datacenters", les aplicacions també poden compartir altres recursos com la xarxa o l'emmagatzemament. Com a cas d'estudi considerem el disseny d'un sistema d'un sistema de comptabilitat d'energia basat en la cooperació entre el software i el hardware per a grans instal·lacions de computació. En aquest context, explorem diverses alternatives per als sensors i actuadors que es requereixen, així com també analitzem els diferents aspectes claus en el disseny d'un sistema d'aquestes característiques

    A novel access pattern-based multi-core memory architecture

    Increasingly High-Performance Computing (HPC) applications run on heterogeneous multi-core platforms. The basic reason of the growing popularity of these architectures is their low power consumption, and high throughput oriented nature. However, this throughput imposes a requirement on the data to be supplied in a high throughput manner for the multi-core system. This results in the necessity of an efficient management of on-chip and off-chip memory data transfers, which is a significant challenge. Complex regular and irregular memory data transfer patterns are becoming widely dominant for a range of application domains including the scientific, image and signal processing. Data accesses can be arranged in independent patterns that an efficient memory management can exploit. The software based approaches using general purpose caches and on-chip memories are beneficial to some extent. However, the task of efficient data management for the throughput oriented devices could be improved by providing hardware mechanisms that exploit the knowledge of access patterns in memory management and scheduling of accesses for a heterogeneous multi-core architecture. The focus of this thesis is to present architectural explorations for a novel access pattern-based multi-core memory architecture. In general, the thesis covers four main aspects of memory system in this research. These aspects can be categorized as: i) Uni-core Memory System for Regular Data Pattern. ii) Multi-core Memory System for Regular Data Pattern. iii) Uni-core Memory System for Irregular Data Pattern. and iv) Multi-core Memory System for Irregular Data Pattern.Les aplicacions de computació d'alt rendiment (HPC) s'executen cada vegada més en plataformes heterogènies de múltiples nuclis. El motiu bàsic de la creixent popularitat d'aquestes arquitectures és el seu baix consum i la seva natura orientada a alt throughput. No obstant, aquest thoughput imposa el requeriment de que les dades es proporcionin al sistema també amb alt throughput. Això resulta en la necessitat de gestionar eficientment les trasferències de memòria (dins i fora del chip), un repte significatiu. Els patrons de transferències de memòria regulars però complexos així com els irregulars són cada vegada més dominants per a diversos dominis d'aplicacions, incloent el científic i el processat d'imagte i senyals. Aquests accessos a dades poden ser organitzats en patrons independents que un gestor de memòria eficient pot explotar. Els mètodes basats en programari emprant memòries cau de propòsit general i memòries al chip són beneficioses fins a cert punt. No obstant, la tasca de gestionar eficientment les transferències de dades per a dispositius orientats a throughput pot ser millorada oferint mecanismes hardware que explotin el coneixement dels patrons d'accés de les aplicacions, així com la planificació dels accessos a una arquitectura de múltiples nuclis. Aquesta tesis està enfocada a explorar una arquitectura de memòria novedosa per a processadors de múltiples nuclis, basada en els patrons d'accés. En general, la recerca de la tesis cobreix quatres aspectes principals del sistema de memòria. Aquests aspectes són: i) sistema de memòria per a un únic nucli amb patrons regulars, ii) sistema de memòria per a múltiples nuclis amb patrons regulars, iii) sistema de memòria per a un únic nucli amb patrons irregulars, iv) sistema de memòria per a múltiples nuclis amb patrons irregulars

    Modeling the power consumption of computing systems and applications through machine learning techniques

    Au cours des dernières années, le nombre de systèmes informatiques n'a pas cesser d'augmenter. Les centres de données sont peu à peu devenus des équipements hautement demandés et font partie des plus consommateurs en énergie. L'utilisation des centres de données se partage entre le calcul intensif et les services web, aussi appelés informatique en nuage. La rapidité de calcul est primordiale pour le calcul intensif, mais pour les autres services ce paramètre peut varier selon les accords signés sur la qualité de service. Certains centres de données sont dits hybrides car ils combinent plusieurs types de services. Toutes ces infrastructures sont extrêmement énergivores. Dans ce présent manuscrit nous étudions les modèles de consommation énergétiques des systèmes informatiques. De tels modèles permettent une meilleure compréhension des serveurs informatiques et de leur façon de consommer l'énergie. Ils représentent donc un premier pas vers une meilleure gestion de ces systèmes, que ce soit pour faire des économies d'énergie ou pour facturer l'électricité à la charge des utilisateurs finaux. Les politiques de gestion et de contrôle de l'énergie comportent de nombreuses limites. En effet, la plupart des algorithmes d'ordonnancement sensibles à l'énergie utilisent des modèles de consommation restreints qui renferment un certain nombre de problèmes ouverts. De précédents travaux dans le domaine suggèrent d'utiliser les informations de contrôle fournies par le système informatique lui-même pour surveiller la consommation énergétique des applications. Néanmoins, ces modèles sont soit trop dépendants du type d'application, soit manquent de précision. Ce manuscrit présente des techniques permettant d'améliorer la précision des modèles de puissance en abordant des problèmes à plusieurs niveaux: depuis l'acquisition des mesures de puissance jusqu'à la définition d'une charge de travail générique permettant de créer un modèle lui aussi générique, c'est-à-dire qui pourra être utilisé pour des charges de travail hétérogènes. Pour atteindre un tel but, nous proposons d'utiliser des techniques d'apprentissage automatique.Les modèles d'apprentissage automatique sont facilement adaptables à l'architecture et sont le cœur de cette recherche. Ces travaux évaluent l'utilisation des réseaux de neurones artificiels et la régression linéaire comme technique d'apprentissage automatique pour faire de la modélisation statistique non linéaire. De tels modèles sont créés par une approche orientée données afin de pouvoir adapter les paramètres en fonction des informations collectées pendant l'exécution de charges de travail synthétiques. L'utilisation des techniques d'apprentissage automatique a pour but d'atteindre des estimateurs de très haute précision à la fois au niveau application et au niveau système. La méthodologie proposée est indépendante de l'architecture cible et peut facilement être reproductible quel que soit l'environnement. Les résultats montrent que l'utilisation de réseaux de neurones artificiels permet de créer des estimations très précises. Cependant, en raison de contraintes de modélisation, cette technique n'est pas applicable au niveau processus. Pour ce dernier, des modèles prédéfinis doivent être calibrés afin d'atteindre de bons résultats.The number of computing systems is continuously increasing during the last years. The popularity of data centers turned them into one of the most power demanding facilities. The use of data centers is divided into high performance computing (HPC) and Internet services, or Clouds. Computing speed is crucial in HPC environments, while on Cloud systems it may vary according to their service-level agreements. Some data centers even propose hybrid environments, all of them are energy hungry. The present work is a study on power models for computing systems. These models allow a better understanding of the energy consumption of computers, and can be used as a first step towards better monitoring and management policies of such systems either to enhance their energy savings, or to account the energy to charge end-users. Energy management and control policies are subject to many limitations. Most energy-aware scheduling algorithms use restricted power models which have a number of open problems. Previous works in power modeling of computing systems proposed the use of system information to monitor the power consumption of applications. However, these models are either too specific for a given kind of application, or they lack of accuracy. This report presents techniques to enhance the accuracy of power models by tackling the issues since the measurements acquisition until the definition of a generic workload to enable the creation of a generic model, i.e. a model that can be used for heterogeneous workloads. To achieve such models, the use of machine learning techniques is proposed. Machine learning models are architecture adaptive and are used as the core of this research. More specifically, this work evaluates the use of artificial neural networks (ANN) and linear regression (LR) as machine learning techniques to perform non-linear statistical modeling.Such models are created through a data-driven approach, enabling adaptation of their parameters based on the information collected while running synthetic workloads. The use of machine learning techniques intends to achieve high accuracy application- and system-level estimators. The proposed methodology is architecture independent and can be easily reproduced in new environments.The results show that the use of artificial neural networks enables the creation of high accurate estimators. However, it cannot be applied at the process-level due to modeling constraints. For such case, predefined models can be calibrated to achieve fair results.% The use of process-level models enables the estimation of virtual machines' power consumption that can be used for Cloud provisioning

    FPGA-based high-performance neural network acceleration

    In the last ten years, Artificial Intelligence through Deep Neural Networks (DNNs) has penetrated virtually every aspect of science, technology, and business. Advances are rapid with thousands of papers being published annually. Many types of DNNs have been and continue to be developed -- in this thesis, we address Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Graph Neural Networks (GNNs) -- each with a different set of target applications and implementation challenges. The overall problem for all of these Neural Networks (NNs) is that their target applications generally pose stringent constraints on latency and throughput, but also have strict accuracy requirements. Much research has therefore gone into all aspects of improving NN quality and performance: algorithms, code optimization, acceleration with GPUs, and acceleration with hardware, both dedicated ASICs and off-the-shelf FPGAs. In this thesis, we concentrate on the last of these approaches. There have been many previous efforts in creating hardware to accelerate NNs. The problem designers face is that optimal NN models typically have significant irregularities, making them hardware unfriendly. One commonly used approach is to train NN models to follow regular computation and data patterns. This approach, however, can hurt the models' accuracy or lead to models with non-negligible redundancies. This dissertation takes a different approach. Instead of regularizing the model, we create architectures friendly to irregular models. Our thesis is that high-accuracy and high-performance NN inference and training can be achieved by creating a series of novel irregularity-aware architectures for Field-Programmable Gate Arrays (FPGAs). In four different studies on four different NN types, we find that this approach results in speedups of 2.1x to 3255x compared with carefully selected prior art; for inference, there is no change in accuracy. The bulk of this dissertation revolves around these studies, the various workload balancing techniques, and the resulting NN acceleration architectures. In particular, we propose four different architectures to handle, respectively, data structure level, operation level, bit level, and model level irregularities. At the data structure level, we propose AWB-GCN, which uses runtime workload rebalancing to handle Sparse Matrices Multiplications (SpMM) on extremely sparse and unbalanced input. With GNN inference as a case study, AWB-GCN achieves over 90% system efficiency, guarantees efficient off-chip memory access, and provides considerable speedups over CPUs (3255x), GPUs (80x), and a prior ASIC accelerator (5.1x). At the operation level, we propose O3BNN-R, which can detect redundant operations and prune them at run time. This works even for those that are highly data-dependent and unpredictable. With Binarized NNs (BNNs) as a case study, O3BNN-R can prune over 30% of the operations, without any accuracy loss, yielding speedups over state-of-the-art implementations on CPUs (1122x), GPUs (2.3x), and FPGAs (2.1x). At the bit level, we propose CQNN. CQNN embeds a Coarse-Grained Reconfigurable Architecture (CGRA) which can be programmed at runtime to support NN functions with various data-width requirements. Results show that CQNN can deliver us-level Quantized NN (QNN) inference. At the model level, we propose FPDeep, especially for training. In order to address model-level irregularity, FPDeep uses a novel model partitioning schemes to balance workload and storage among nodes. By using a hybrid of model and layer parallelism to train DNNs, FPDeep avoids the large gap that commonly occurs between training and testing accuracy due to the improper convergence to sharp minimizers (caused by large training batches). Results show that FPDeep provides scalable, fast, and accurate training and leads to 6.6x higher energy efficiency than GPUs

    Proceedings of the 7th International Conference on PGAS Programming Models

