66 research outputs found
Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques
Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques.
In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings.
Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity).
Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos
High-Quality Fault-Resiliency in Fat-Tree Networks (Extended Abstract)
Coupling regular topologies with optimized routing algorithms is key in
pushing the performance of interconnection networks of HPC systems. In this
paper we present Dmodc, a fast deterministic routing algorithm for Parallel
Generalized Fat-Trees (PGFTs) which minimizes congestion risk even under
massive topology degradation caused by equipment failure. It applies a
modulo-based computation of forwarding tables among switches closer to the
destination, using only knowledge of subtrees for pre-modulo division. Dmodc
allows complete rerouting of topologies with tens of thousands of nodes in less
than a second, which greatly helps centralized fabric management react to
faults with high-quality routing tables and no impact to running applications
in current and future very large-scale HPC clusters. We compare Dmodc against
routing algorithms available in the InfiniBand control software (OpenSM) first
for routing execution time to show feasibility at scale, and then for
congestion risk under degradation to demonstrate robustness. The latter
comparison is done using static analysis of routing tables under random
permutation (RP), shift permutation (SP) and all-to-all (A2A) traffic patterns.
Results for Dmodc show A2A and RP congestion risks similar under heavy
degradation as the most stable algorithms compared, and near-optimal SP
congestion risk up to 1% of random degradation
Silicon circuits for chip-to-chip communications in multi-socket server board interconnects
Multi-socket server boards (MSBs) exploit the interconnection of multiple processor chips towards forming powerful cache coherent systems, with the interconnect technology comprising a key element in boosting processing performance. Here, we present an overview of the current electrical interconnects for MSBs, outlining the main challenges currently faced. We propose the use of silicon photonics (SiPho) towards advancing interconnect throughput, socket connectivity and energy efficiency in MSB layouts, enabling a flat-topology wavelength division multiplexing (WDM)-based point-to-point (p2p) optical MSB interconnect scheme. We demonstrate WDM SiPho transceivers (TxRxs) co-assembled with their electronic circuits for up to 50 Gb/s line rate and 400 Gb/s aggregate data transmission and SiPho arrayed waveguide grating routers that can offer collision-less time of flight connectivity for up to 16 nodes. The capacity can scale to 2.8 Gb/s for an eight-socket MSB, when line rate scales to 50 Gb/s, yielding up to 69% energy reduction compared with the QuickPath Interconnect and highlighting the feasibility of single-hop p2p interconnects in MSB systems with >4 sockets
Workload Characterization for Exascale Computing Networks
[EN] Exascale computing is the next step in high performance computing provided by systems composed of millions of interconnected processing cores. In order to guide the design and implementation of such systems, multiple workload characterization studies and system performance evaluations are required.
This paper provides a workload characterization study in the context of the European Project ExaNeSt, which focuses, among others, on developing the network technology required to implement future exascale systems. In this work, we characterize different ExaNeSt applications from the computer network perspective by analyzing the distribution of messages, the dynamic bandwidth consumption, and the spatial communication patterns among cores.
The analysis highlights three main observations; i) message sizes are, in general, below 50 kB; ii) communication patterns are usually bursty; and iii) spatial communication among cores concentrate in hot spots for most applications. Taking into account these observations, one can conclude that in order to unclog congested network links, an exascale network must be designed to briefly support higher-than-average bandwidths in the vicinity of key network cores.This work was supported by the ExaNest project, funded by the European Union’s Horizon 2020 research and innovation
programme under grant agreement No 671553, and by the Spanish Ministerio de Econom´ıa y Competitividad
(MINECO) and Plan E funds under Grant TIN2015-66972-C5-1-R.Duro-Gómez, J.; Petit Martí, SV.; Sahuquillo Borrás, J.; Gómez Requena, ME. (2018). Workload Characterization for Exascale Computing Networks. IEEE Computer Society. 383-389. https://doi.org/10.1109/HPCS.2018.00069S38338
Node-Type-Based Load-Balancing Routing for Parallel Generalized Fat-Trees
High-Performance Computing (HPC) clusters are made up of a variety of node
types (usually compute, I/O, service, and GPGPU nodes) and applications don't
use nodes of a different type the same way. Resulting communication patterns
reflect organization of groups of nodes, and current optimal routing algorithms
for all-to-all patterns will not always maximize performance for group-specific
communications. Since application communication patterns are rarely available
beforehand, we choose to rely on node types as a good guess for node usage. We
provide a description of node type heterogeneity and analyse performance
degradation caused by unlucky repartition of nodes of the same type. We provide
an extension to routing algorithms for Parallel Generalized Fat-Tree topologies
(PGFTs) which balances load amongst groups of nodes of the same type. We show
how it removes these performance issues by comparing results in a variety of
situations against corresponding classical algorithms
- …