44 research outputs found

    A Study of Energy and Locality Effects using Space-filling Curves

    Full text link
    The cost of energy is becoming an increasingly important driver for the operating cost of HPC systems, adding yet another facet to the challenge of producing efficient code. In this paper, we investigate the energy implications of trading computation for locality using Hilbert and Morton space-filling curves with dense matrix-matrix multiplication. The advantage of these curves is that they exhibit an inherent tiling effect without requiring specific architecture tuning. By accessing the matrices in the order determined by the space-filling curves, we can trade computation for locality. The index computation overhead of the Morton curve is found to be balanced against its locality and energy efficiency, while the overhead of the Hilbert curve outweighs its improvements on our test system.Comment: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW

    GDP : using dataflow properties to accurately estimate interference-free performance at runtime

    Get PDF
    Multi-core memory systems commonly share resources between processors. Resource sharing improves utilization at the cost of increased inter-application interference which may lead to priority inversion, missed deadlines and unpredictable interactive performance. A key component to effectively manage multi-core resources is performance accounting which aims to accurately estimate interference-free application performance. Previously proposed accounting systems are either invasive or transparent. Invasive accounting systems can be accurate, but slow down latency-sensitive processes. Transparent accounting systems do not affect performance, but tend to provide less accurate performance estimates. We propose a novel class of performance accounting systems that achieve both performance-transparency and superior accuracy. We call the approach dataflow accounting, and the key idea is to track dynamic dataflow properties and use these to estimate interference-free performance. Our main contribution is Graph-based Dynamic Performance (GDP) accounting. GDP dynamically builds a dataflow graph of load requests and periods where the processor commits instructions. This graph concisely represents the relationship between memory loads and forward progress in program execution. More specifically, GDP estimates interference-free stall cycles by multiplying the critical path length of the dataflow graph with the estimated interference-free memory latency. GDP is very accurate with mean IPC estimation errors of 3.4% and 9.8% for our 4- and 8-core processors, respectively. When GDP is used in a cache partitioning policy, we observe average system throughput improvements of 11.9% and 20.8% compared to partitioning using the state-of-the-art Application Slowdown Model

    HSM : a hybrid slowdown model for multitasking GPUs

    Get PDF
    Graphics Processing Units (GPUs) are increasingly widely used in the cloud to accelerate compute-heavy tasks. However, GPU-compute applications stress the GPU architecture in different ways - leading to suboptimal resource utilization when a single GPU is used to run a single application. One solution is to use the GPU in a multitasking fashion to improve utilization. Unfortunately, multitasking leads to destructive interference between co-running applications which causes fairness issues and Quality-of-Service (QoS) violations. We propose the Hybrid Slowdown Model (HSM) to dynamically and accurately predict application slowdown due to interference. HSM overcomes the low accuracy of prior white-box models, and training and implementation overheads of pure black-box models, with a hybrid approach. More specifically, the white-box component of HSM builds upon the fundamental insight that effective bandwidth utilization is proportional to DRAM row buffer hit rate, and the black-box component of HSM uses linear regression to relate row buffer hit rate to performance. HSM accurately predicts application slowdown with an average error of 6.8%, a significant improvement over the current state-of-the-art. In addition, we use HSM to guide various resource management schemes in multitasking GPUs: HSM-Fair significantly improves fairness (by 1.59x on average) compared to even partitioning, whereas HSM-QoS improves system throughput (by 18.9% on average) compared to proportional SM partitioning while maintaining the QoS target for the high-priority application in challenging mixed memory/compute-bound multi-program workloads

    Get Out of the Valley: Power-Efficient Address Mapping for GPUs

    Get PDF
    GPU memory systems adopt a multi-dimensional hardware structure to provide the bandwidth necessary to support 100s to 1000s of concurrent threads. On the software side, GPU-compute workloads also use multi-dimensional structures to organize the threads. We observe that these structures can combine unfavorably and create significant resource imbalance in the memory subsystem causing low performance and poor power-efficiency. The key issue is that it is highly application-dependent which memory address bits exhibit high variability. To solve this problem, we first provide an entropy analysis approach tailored for the highly concurrent memory request behavior in GPU-compute workloads. Our window-based entropy metric captures the information content of each address bit of the memory requests that are likely to co-exist in the memory system at runtime. Using this metric, we find that GPU-compute workloads exhibit entropy valleys distributed throughout the lower order address bits. This indicates that efficient GPU-address mapping schemes need to harvest entropy from broad address-bit ranges and concentrate the entropy into the bits used for channel and bank selection in the memory subsystem. This insight leads us to propose the Page Address Entropy (PAE) mapping scheme which concentrates the entropy of the row, channel and bank bits of the input address into the bank and channel bits of the output address. PAE maps straightforwardly to hardware and can be implemented with a tree of XOR-gates. PAE improves performance by 1.31 x and power-efficiency by 1.25 x compared to state-of-the-art permutation-based address mapping

    FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

    Full text link
    Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 {\mu}s latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable Gate Arrays, February 201

    Modeling emerging memory-divergent GPU applications

    Get PDF
    Analytical performance models yield valuable architectural insight without incurring the excessive runtime overheads of simulation. In this work, we study contemporary GPU applications and find that the key performance-related behavior of such applications is distinct from traditional GPU applications. The key issue is that these GPU applications are memory-intensive and have poor spatial locality, which implies that the loads of different threads commonly access different cache blocks. Such memory-divergent applications quickly exhaust the number of misses the L1 cache can process concurrently, and thereby cripple the GPU's ability to use Memory-Level Parallelism (MLP) and Thread-Level Parallelism (TLP) to hide memory latencies. Our Memory Divergence Model (MDM) is able to accurately represent this behavior and thereby reduces average performance prediction error by 14x compared to the state-of-the-art GPUMech approach across our memory-divergent applications

    Hydrokarboners oppstandelse : En empirisk analyse av endringer i investorsentimentet for oljeog gasselskaper i utviklede markeder som følge av Russlands invasjon av Ukraina i 2022

    Get PDF
    Som en følge av Russlands invasjon av Ukraina 24. februar 2022, har investorsentimentet for olje- og gasselskaper i utviklede markeder sett en signifikant endring. Funnene i oppgaven viser at investorer betaler relativt mer for olje- og gasselskaper etter krigen sammenliknet med før krigen. Primært er endringen i investorsentimentet drevet av måten krigen har endret etterspørselsbildet etter ikke-russisk gass på kort og mellomlang sikt. Funnene indikerer at investorer har fått en økt oppfatning om en tregere energiovergang og forventer større fremtidige kontantstrømmer for olje- og gasselskaper i utviklede markeder. På den andre siden finner vi gjennom analyser av relativ idiosynkratisk risiko tegn til at investorer priser inn en høyere overgangsrisiko etter krigens utbrudd, noe som kan skyldes økt klimapolitisk usikkerhet. Videre finner vi at effekten av Ukraina-krigen på olje- og gasselskapene sin meravkastning er større i ikke-europeiske utviklede markeder sammenliknet med i Europa. Forskjellene indikerer at investorene priser inn en relativt raskere overgang til fornybare energikilder i Europa. Dette kan skyldes REPowerEU som har garantert for enorme investeringer i fornybare energikilder for å sørge for Europas energisikkerhet. I tillegg har det blitt gjennomført betydelige investeringer i LNG-infrastruktur for å øke importen av LNG fra ikke-europeiske selskaper på mellomlang sikt.nhhma

    Managing Shared Resources in Chip Multiprocessor Memory Systems

    No full text
    Chip Multiprocessors (CMPs) have become the architecture of choice for high-performance general-purpose processors. CMPs often share memory system units between processes. This may result in independent processes competing for memory bandwidth. Such competition can cause destructive interference which reduces performance predictability, decreases operating system scheduler effciency and complicates billing for cloud computing providers. In this thesis, we reduce the eects of these problems by managing miss band-width. We use dynamic interference feedback to choose the number of Miss Information/Status Holding Registers (MSHRs) available in last-level private cache of each processor. Furthermore, we provide two dierent allocation approaches that use this mechanism to improve system performance. The rst approach uses simple measurements to decide miss bandwidth allocations and performance feedback to determine if the allocations are bene cial. The second approach selects its allocations based on a miss bandwidth performance model. This model leverages a novel interference measurement scheme called the Dynamic Interference Estimation Framework (DIEF). DIEF provides accurate estimates of the average memory latency a process would experience with exclusive access to all hardware-managed shared resources. We also investigate the eects of managing memory bandwidth to increase memory bus utilization. Here, we choose prefetches to effciently utilize the complex DRAM structure of banks, rows and columns. This policy makes prefetches cheaper than demand accesses and increases the performance of processes with predictable access patterns. In addition, effcient prefetch scheduling reduces the degree to which prefetches interfere with the demand accesses of other processes
    corecore