Search CORE

20,023 research outputs found

Locality-Adaptive Parallel Hash Joins Using Hardware Transactional Memory

Author: Madden S
Pirk H
Shanbhag A
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 07/04/2016
Field of study

Previous work [1] has claimed that the best performing implementation of in-memory hash joins is based on (radix-)partitioning of the build-side input. Indeed, despite the overhead of partitioning, the benefits from increased cache-locality and synchronization free parallelism in the build-phase outweigh the costs when the input data is randomly ordered. However, many datasets already exhibit significant spatial locality (i.e., non-randomness) due to the way data items enter the database: through periodic ETL or trickle loaded in the form of transactions. In such cases, the first benefit of partitioning — increased locality — is largely irrelevant. In this paper, we demonstrate how hardware transactional memory (HTM) can render the other benefit, freedom from synchronization, irrelevant as well. Specifically, using careful analysis and engineering, we develop an adaptive hash join implementation that outperforms parallel radix-partitioned hash joins as well as sort-merge joins on data with high spatial locality. In addition, we show how, through lightweight (less than 1% overhead) runtime monitoring of the transaction abort rate, our implementation can detect inputs with low spatial locality and dynamically fall back to radix-partitioning of the build-side input. The result is a hash join implementation that is more than 3 times faster than the state-of-the-art on high-locality data and never more than 1% slower

Spiral - Imperial College Digital Repository

Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques

Author: Kim Kyu Yeun
Publication venue: Graduate School of UNIST
Publication date: 01/02/2020
Field of study

Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques. In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings. Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity). Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos

ScholarWorks@UNIST

Prioritized Garbage Collection: Explicit GC Support for Software Caches

Author: Berger Emery D.
Guyer Samuel Z.
Nunez Diogenes
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 15/10/2016
Field of study

Programmers routinely trade space for time to increase performance, often in the form of caching or memoization. In managed languages like Java or JavaScript, however, this space-time tradeoff is complex. Using more space translates into higher garbage collection costs, especially at the limit of available memory. Existing runtime systems provide limited support for space-sensitive algorithms, forcing programmers into difficult and often brittle choices about provisioning. This paper presents prioritized garbage collection, a cooperative programming language and runtime solution to this problem. Prioritized GC provides an interface similar to soft references, called priority references, which identify objects that the collector can reclaim eagerly if necessary. The key difference is an API for defining the policy that governs when priority references are cleared and in what order. Application code specifies a priority value for each reference and a target memory bound. The collector reclaims references, lowest priority first, until the total memory footprint of the cache fits within the bound. We use this API to implement a space-aware least-recently-used (LRU) cache, called a Sache, that is a drop-in replacement for existing caches, such as Google's Guava library. The garbage collector automatically grows and shrinks the Sache in response to available memory and workload with minimal provisioning information from the programmer. Using a Sache, it is almost impossible for an application to experience a memory leak, memory pressure, or an out-of-memory crash caused by software caching.Comment: to appear in OOPSLA 201

arXiv.org e-Print Archive

Crossref

HeteroCore GPU to exploit TLP-resource diversity

Author: Eeckhout Lieven
Wang Zhiying
Zhao Xia
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Ghent University Academic Bibliography

Adaptive Delivery in Caching Networks

Author: Blake Ian
Lampe Lutz
Saberali Seyed Ali
Saffar Hamidreza Ebrahimzadeh
Publication venue
Publication date: 01/01/2016
Field of study

The problem of content delivery in caching networks is investigated for scenarios where multiple users request identical files. Redundant user demands are likely when the file popularity distribution is highly non-uniform or the user demands are positively correlated. An adaptive method is proposed for the delivery of redundant demands in caching networks. Based on the redundancy pattern in the current demand vector, the proposed method decides between the transmission of uncoded messages or the coded messages of [1] for delivery. Moreover, a lower bound on the delivery rate of redundant requests is derived based on a cutset bound argument. The performance of the adaptive method is investigated through numerical examples of the delivery rate of several specific demand vectors as well as the average delivery rate of a caching network with correlated requests. The adaptive method is shown to considerably reduce the gap between the non-adaptive delivery rate and the lower bound. In some specific cases, using the adaptive method, this gap shrinks by almost 50% for the average rate.Comment: 8 pages,8 figures. Submitted to IEEE transaction on Communications in 2015. A short version of this article was published as an IEEE Communications Letter with DOI: 10.1109/LCOMM.2016.255814

arXiv.org e-Print Archive

Crossref

Impact of parameter variations on circuits and microarchitecture

Author: Bowman Keith
De Vivek
Ergin Oguz
González Colás Antonio María
Tschanz James W.
Unsal Osman Sabri
Vera Rivera Francisco Javier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Parameter variations, which are increasing along with advances in process technologies, affect both timing and power. Variability must be considered at both the circuit and microarchitectural design levels to keep pace with performance scaling and to keep power consumption within reasonable limits. This article presents an overview of the main sources of variability and surveys variation-tolerant circuit and microarchitectural approaches.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC