Search CORE

44 research outputs found

A Study of Energy and Locality Effects using Space-filling Curves

Author: Jahre Magnus
Meyer Jan Christian
Reissmann Nico
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 20/06/2016
Field of study

The cost of energy is becoming an increasingly important driver for the operating cost of HPC systems, adding yet another facet to the challenge of producing efficient code. In this paper, we investigate the energy implications of trading computation for locality using Hilbert and Morton space-filling curves with dense matrix-matrix multiplication. The advantage of these curves is that they exhibit an inherent tiling effect without requiring specific architecture tuning. By accessing the matrices in the order determined by the space-filling curves, we can trade computation for locality. The index computation overhead of the Morton curve is found to be balanced against its locality and energy efficiency, while the overhead of the Hilbert curve outweighs its improvements on our test system.Comment: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW

arXiv.org e-Print Archive

CiteSeerX

GDP : using dataflow properties to accurately estimate interference-free performance at runtime

Author: Eeckhout Lieven
Jahre Magnus
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Multi-core memory systems commonly share resources between processors. Resource sharing improves utilization at the cost of increased inter-application interference which may lead to priority inversion, missed deadlines and unpredictable interactive performance. A key component to effectively manage multi-core resources is performance accounting which aims to accurately estimate interference-free application performance. Previously proposed accounting systems are either invasive or transparent. Invasive accounting systems can be accurate, but slow down latency-sensitive processes. Transparent accounting systems do not affect performance, but tend to provide less accurate performance estimates. We propose a novel class of performance accounting systems that achieve both performance-transparency and superior accuracy. We call the approach dataflow accounting, and the key idea is to track dynamic dataflow properties and use these to estimate interference-free performance. Our main contribution is Graph-based Dynamic Performance (GDP) accounting. GDP dynamically builds a dataflow graph of load requests and periods where the processor commits instructions. This graph concisely represents the relationship between memory loads and forward progress in program execution. More specifically, GDP estimates interference-free stall cycles by multiplying the critical path length of the dataflow graph with the estimated interference-free memory latency. GDP is very accurate with mean IPC estimation errors of 3.4% and 9.8% for our 4- and 8-core processors, respectively. When GDP is used in a cache partitioning policy, we observe average system throughput improvements of 11.9% and 20.8% compared to partitioning using the state-of-the-art Application Slowdown Model

Crossref

Ghent University Academic Bibliography

NORA - Norwegian Open Research Archives

HSM : a hybrid slowdown model for multitasking GPUs

Author: Eeckhout Lieven
Jahre Magnus
Zhao Xia
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

Graphics Processing Units (GPUs) are increasingly widely used in the cloud to accelerate compute-heavy tasks. However, GPU-compute applications stress the GPU architecture in different ways - leading to suboptimal resource utilization when a single GPU is used to run a single application. One solution is to use the GPU in a multitasking fashion to improve utilization. Unfortunately, multitasking leads to destructive interference between co-running applications which causes fairness issues and Quality-of-Service (QoS) violations. We propose the Hybrid Slowdown Model (HSM) to dynamically and accurately predict application slowdown due to interference. HSM overcomes the low accuracy of prior white-box models, and training and implementation overheads of pure black-box models, with a hybrid approach. More specifically, the white-box component of HSM builds upon the fundamental insight that effective bandwidth utilization is proportional to DRAM row buffer hit rate, and the black-box component of HSM uses linear regression to relate row buffer hit rate to performance. HSM accurately predicts application slowdown with an average error of 6.8%, a significant improvement over the current state-of-the-art. In addition, we use HSM to guide various resource management schemes in multitasking GPUs: HSM-Fair significantly improves fairness (by 1.59x on average) compared to even partitioning, whereas HSM-QoS improves system throughput (by 18.9% on average) compared to proportional SM partitioning while maintaining the QoS target for the high-priority application in challenging mixed memory/compute-bound multi-program workloads

Crossref

Ghent University Academic Bibliography

NORA - Norwegian Open Research Archives

Get Out of the Valley: Power-Efficient Address Mapping for GPUs

Author: Eeckhout Lieven
Jahre Magnus
Liu Yuxi
Luo Yingwei
Wang Xiaolin
Wang Zhenlin
Zhao Xia
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

GPU memory systems adopt a multi-dimensional hardware structure to provide the bandwidth necessary to support 100s to 1000s of concurrent threads. On the software side, GPU-compute workloads also use multi-dimensional structures to organize the threads. We observe that these structures can combine unfavorably and create significant resource imbalance in the memory subsystem causing low performance and poor power-efficiency. The key issue is that it is highly application-dependent which memory address bits exhibit high variability. To solve this problem, we first provide an entropy analysis approach tailored for the highly concurrent memory request behavior in GPU-compute workloads. Our window-based entropy metric captures the information content of each address bit of the memory requests that are likely to co-exist in the memory system at runtime. Using this metric, we find that GPU-compute workloads exhibit entropy valleys distributed throughout the lower order address bits. This indicates that efficient GPU-address mapping schemes need to harvest entropy from broad address-bit ranges and concentrate the entropy into the bits used for channel and bank selection in the memory subsystem. This insight leads us to propose the Page Address Entropy (PAE) mapping scheme which concentrates the entropy of the row, channel and bank bits of the input address into the bank and channel bits of the output address. PAE maps straightforwardly to hardware and can be implemented with a tree of XOR-gates. PAE improves performance by 1.31 x and power-efficiency by 1.25 x compared to state-of-the-art permutation-based address mapping

Michigan Technological University

Crossref

Ghent University Academic Bibliography

NORA - Norwegian Open Research Archives

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Author: Blott Michaela
Fraser Nicholas J.
Gambardella Giulio
Jahre Magnus
Leong Philip
Umuroglu Yaman
Vissers Kees
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/12/2016
Field of study

Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 {\mu}s latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable Gate Arrays, February 201

arXiv.org e-Print Archive

Crossref

NORA - Norwegian Open Research Archives

Modeling emerging memory-divergent GPU applications

Author: Adileh Almutaz
Eeckhout Lieven
Jahre Magnus
Wang Lu
Wang Zhiying
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Analytical performance models yield valuable architectural insight without incurring the excessive runtime overheads of simulation. In this work, we study contemporary GPU applications and find that the key performance-related behavior of such applications is distinct from traditional GPU applications. The key issue is that these GPU applications are memory-intensive and have poor spatial locality, which implies that the loads of different threads commonly access different cache blocks. Such memory-divergent applications quickly exhaust the number of misses the L1 cache can process concurrently, and thereby cripple the GPU's ability to use Memory-Level Parallelism (MLP) and Thread-Level Parallelism (TLP) to hide memory latencies. Our Memory Divergence Model (MDM) is able to accurately represent this behavior and thereby reduces average performance prediction error by 14x compared to the state-of-the-art GPUMech approach across our memory-divergent applications

Ghent University Academic Bibliography

Hydrokarboners oppstandelse : En empirisk analyse av endringer i investorsentimentet for oljeog gasselskaper i utviklede markeder som følge av Russlands invasjon av Ukraina i 2022

Author: Jahre Oscar Bogilovic
Langgård Magnus Vormeland
Publication venue
Publication date: 01/01/2022
Field of study

Som en følge av Russlands invasjon av Ukraina 24. februar 2022, har investorsentimentet for olje- og gasselskaper i utviklede markeder sett en signifikant endring. Funnene i oppgaven viser at investorer betaler relativt mer for olje- og gasselskaper etter krigen sammenliknet med før krigen. Primært er endringen i investorsentimentet drevet av måten krigen har endret etterspørselsbildet etter ikke-russisk gass på kort og mellomlang sikt. Funnene indikerer at investorer har fått en økt oppfatning om en tregere energiovergang og forventer større fremtidige kontantstrømmer for olje- og gasselskaper i utviklede markeder. På den andre siden finner vi gjennom analyser av relativ idiosynkratisk risiko tegn til at investorer priser inn en høyere overgangsrisiko etter krigens utbrudd, noe som kan skyldes økt klimapolitisk usikkerhet. Videre finner vi at effekten av Ukraina-krigen på olje- og gasselskapene sin meravkastning er større i ikke-europeiske utviklede markeder sammenliknet med i Europa. Forskjellene indikerer at investorene priser inn en relativt raskere overgang til fornybare energikilder i Europa. Dette kan skyldes REPowerEU som har garantert for enorme investeringer i fornybare energikilder for å sørge for Europas energisikkerhet. I tillegg har det blitt gjennomført betydelige investeringer i LNG-infrastruktur for å øke importen av LNG fra ikke-europeiske selskaper på mellomlang sikt.nhhma

NHH Brage

Managing Shared Resources in Chip Multiprocessor Memory Systems

Author: Jahre Magnus
Publication venue: NTNU
Publication date: 01/01/2010
Field of study

Chip Multiprocessors (CMPs) have become the architecture of choice for high-performance general-purpose processors. CMPs often share memory system units between processes. This may result in independent processes competing for memory bandwidth. Such competition can cause destructive interference which reduces performance predictability, decreases operating system scheduler effciency and complicates billing for cloud computing providers. In this thesis, we reduce the eects of these problems by managing miss band-width. We use dynamic interference feedback to choose the number of Miss Information/Status Holding Registers (MSHRs) available in last-level private cache of each processor. Furthermore, we provide two dierent allocation approaches that use this mechanism to improve system performance. The rst approach uses simple measurements to decide miss bandwidth allocations and performance feedback to determine if the allocations are bene cial. The second approach selects its allocations based on a miss bandwidth performance model. This model leverages a novel interference measurement scheme called the Dynamic Interference Estimation Framework (DIEF). DIEF provides accurate estimates of the average memory latency a process would experience with exclusive access to all hardware-managed shared resources. We also investigate the eects of managing memory bandwidth to increase memory bus utilization. Here, we choose prefetches to effciently utilize the complex DRAM structure of banks, rows and columns. This policy makes prefetches cheaper than demand accesses and increases the performance of processes with predictable access patterns. In addition, effcient prefetch scheduling reduces the degree to which prefetches interfere with the demand accesses of other processes

NORA - Norwegian Open Research Archives