Search CORE

14 research outputs found

Data Traffic Reduction by Exploiting Data Criticality With A Compressor in GPU

Author: Byoun Jae Guen
Publication venue
Publication date: 18/01/2019
Field of study

Graphics Processing Units (GPUs) have been predominantly accepted for various general purpose applications due to a massive degree of parallelism. The demand for large-scale GPUs processing an enormous volume of data with high throughput has been rising rapidly. However, the performance of the massive parallelism workloads usually suffer from multiple constraints such as memory bandwidth, high memory latency, and power/energy cost. Also a bandwidth efficient network design is challenging in large-scale GPUs. In this research, we focus on mitigating network bottlenecks by effectively reducing the size of packets transferring through an interconnect network so that the overall system performance improves. The unused fraction of each L1 data cache block across a variety of benchmark suits is initially investigated to see inefficient cache usage. Then, categorizing memory access patterns into several types we introduce essential micro-architectural enhancements to support filtering out unnecessary words in packets throughout the reply path. A compression scheme (Dual Pattern Compression) adequate for packet compression is exploited to effectively reduce the size of reply packets. We demonstrate that our scheme effectively improves system performance. Our approach yields 39% IPC improvement across heterogeneous computing and text processing benchmarks over the baseline cooperating with DPC. Comparing this work with DPC, we achieved 5% IPC improvement for the overall benchmark suits and 20% IPC increase for favorable workloads to this scheme

Texas A&M Repository

Packet Compression in GPU Architectures

Author: Devpura Priyank
Publication venue
Publication date: 21/08/2017
Field of study

Graphical processing unit (GPU) can support multiple operations in parallel by executing it on multiple thread unit known as warp i.e. multiple threads running the same instruction. Each time miss happens at private cache of Streaming Multiprocessor (SM), the request is migrated over the network to shared L2 cache and then later down to Memory Controller (MC) for supplying memory block. The interconnect delay becomes a bottleneck due to a large number of requests from different SM and multiple replies from the MCs. The compression technique can be used to mitigate the performance bottleneck caused by a large volume of data. In this work, I apply various compression algorithms and propose a new compression scheme, Data Segment Matching (DSM). I apply approximation to the floating-point elements to improve compressibility and develop a prediction model to identify number of approximation bits. I focus on compression techniques to resolve this bottleneck. The evaluations using a cycle accurate simulator show that this scheme improves Instructions per Cycle (IPC) by 12% on an average across various benchmarks with compressibility 50% in integer type benchmarks and 35% in floating-point type benchmarks when the proposed scheme is applied to packet compression in the interconnection network

Texas A&M Repository

Bit Based Approximation for Approx-NoC: A Data Approximation Framework for Network-On-Chip Architectures

Author: Siravara Divya Ravi
Publication venue
Publication date: 16/10/2019
Field of study

The dawn of the big data era has led to the inception of new and creative compute paradigms that utilize heterogeneity, specialization, processor-in-memory and approximation due to the high demand for memory bandwidth and power. Relaxing the constraints of applications has led to approximate computing being put forth as a feasible solution for high performance computation. The latest fad such as machine learning, video/image processing, data analytics, neural networks and other data intensive applications have heightened the possibility of using approximate computing as a feasible solution as these applications allow imprecise output within a specific error range. This work presents Bit Based Approx-NoC, a hardware data approximation framework with a lightweight bit-based approximation technique for high performance NoCs. Bit-Based Approx-NoC facilitates approximate matching of data patterns, within a controllable error range, to compress them thereby reducing the data movement across the chip. The proposed work exploits the entropy between data words in order to increase their inherent compressibility. Evaluations in this work show on average 5% latency reduction and 14% throughput improvement compared to the state of the art NoC compression mechanisms

Texas A&M Repository

Recommended from our members

Network-on-Chip Synchronization

Author: Buckler Mark
Publication venue: ScholarWorks@UMass Amherst
Publication date: 07/11/2014
Field of study

Technology scaling has enabled the number of cores within a System on Chip (SoC) to increase significantly. Globally Asynchronous Locally Synchronous (GALS) systems using Dynamic Voltage and Frequency Scaling (DVFS) operate each of these cores on distinct and dynamic clock domains. The main communication method between these cores is increasingly more likely to be a Network-on-Chip (NoC). Typically, the interfaces between these clock domains experience multi-cycle synchronization latencies due to their use of “brute-force” synchronizers. This dissertation aims to improve the performance of NoCs and thereby SoCs as a whole by reducing this synchronization latency. First, a survey of NoC improvement techniques is presented. One such improvement technique: a multi-layer NoC, has been successfully simulated. Given how one of the most commonly used techniques is DVFS, a thorough analysis and simulation of brute-force synchronizer circuits in both current and future process technologies is presented. Unfortunately, a multi-cycle latency is unavoidable when using brute-force synchronizers, so predictive synchronizers which require only a single cycle of latency have been proposed. To demonstrate the impact of these predictive synchronizer circuits at a high level, multi-core system simulations incorporating these circuits have been completed. Multiple forms of GALS NoC configurations have been simulated, including multi-synchronous, NoC-synchronous, and single-synchronizer. Speedup on the SPLASH benchmark suite was measured to directly quantify the performance benefit of predictive synchronizers in a full system. Additionally, Mean Time Between Failures (MTBF) has been calculated for each NoC synchronizer configuration to determine the reliability benefit possible when using predictive synchronizers

ScholarWorks@UMass Amherst

Spatial Locality Speculation to Reduce Energy in Chip-Multiprocessor Networks-on-Chip

Author: Gratz Paul V.
Grot Boris
Jimenez Daniel A.
Kim Hyungjun
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/05/2014
Field of study

As processor chips become increasingly parallel, an efficient communication substrate is critical for meeting performance and energy targets. In this work, we target the root cause of network energy consumption through techniques that reduce link and router-level switching activity. We specifically focus on memory subsystem traffic, as it comprises the bulk of NoC load in a CMP. By transmitting only the flits that contain words predicted useful using a novel spatial locality predictor, our scheme seeks to reduce network activity. We aim to further lower NoC energy through microarchitectural mechanisms that inhibit datapath switching activity for unused words in individual flits. Using simulation-based performance studies and detailed energy models based on synthesized router designs and different link wire types, we show that 1) the prediction mechanism achieves very high accuracy, with an average rate of false-unused prediction of just 2.5 percent; 2) the combined NoC energy savings enabled by the predictor and microarchitectural support is 36 percent, on average, and up to 57 percent in the best case; and 3) there is no system performance penalty as a result of this technique

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX

GreenCool: An Energy-Efficient Liquid Cooling Design Technique for 3-D MPSoCs Via Channel Width Modulation

Author: Atienza David
Coskun Ayse K.
Meng Jie
Sabry Mohamed M.
Sridhar Arvind
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/05/2013
Field of study

Liquid cooling using interlayer microchannels has appeared as a viable and scalable packaging technology for 3-D multiprocessor system-on-chips (MPSoCs). Microchannel-based liquid cooling, however, can substantially increase the on-chip thermal gradients, which are undesirable for reliability, performance, and cooling efficiency. In this paper, we present GreenCool, an optimal design methodology for liquid-cooled 3-D MPSoCs. GreenCool simultaneously minimizes the cooling energy for a given system while maintaining thermal gradients and peak temperatures under safe limits. This is accomplished by tuning the heat transfer characteristics of the microchannels using channel width modulation. Channel width modulation is compatible with the current process technologies and incurs minimal additional fabrication costs. Through an extensive set of experiments, we show that channel width modulation is capable of complementing and enhancing the benefits of temperature-aware floorplanning. We also experiment with a 16-core 3-D system with stacked dynamic random-access memory, for which GreenCool improves energy efficiency by up to 53% with respect to no channel modulation

Infoscience - École polytechnique fédérale de Lausanne

Architectural Support for High-Performance, Power-Efficient and Secure Multiprocessor Systems

Author: An Baik Song
Publication venue
Publication date
Field of study

High performance systems have been widely adopted in many fields and the demand for better performance is constantly increasing. And the need of powerful yet flexible systems is also increasing to meet varying application requirements from diverse domains. Also, power efficiency in high performance computing has been one of the major issues to be resolved. The power density of core components becomes significantly higher, and the fraction of power supply in total management cost is dominant. Providing dependability is also a main concern in large-scale systems since more hardware resources can be abused by attackers. Therefore, designing high-performance, power-efficient and secure systems is crucial to provide adequate performance as well as reliability to users. Adhering to using traditional design methodologies for large-scale computing systems has a limit to meet the demand under restricted resource budgets. Interconnecting a large number of uniprocessor chips to build parallel processing systems is not an efficient solution in terms of performance and power. Chip multiprocessor (CMP) integrates multiple processing cores and caches on a chip and is thought of as a good alternative to previous design trends. In this dissertation, we deal with various design issues of high performance multiprocessor systems based on CMP to achieve both performance and power efficiency while maintaining security. First, we propose a fast and secure off-chip interconnects through minimizing network overheads and providing an efficient security mechanism. Second, we propose architectural support for fast and efficient memory protection in CMP systems, making the best use of the characteristics in CMP environments and multi-threaded workloads. Third, we propose a new router design for network-on-chip (NoC) based on a new memory technique. We introduce hybrid input buffers that use both SRAM and STT-MRAM for better performance as well as power efficiency. Simulation results show that the proposed schemes improve the performance of off-chip networks through reducing the message size by 54% on average. Also, the schemes diminish the overheads of bounds checking operations, thus enhancing the overall performance by 11% on average. Adopting hybrid buffers in NoC routers contributes to increasing the network throughput up to 21%

Texas A&M Repository

Efficient Interconnection Network Design for Heterogeneous Architectures

Author: Huang Jiayi
Publication venue
Publication date: 02/02/2021
Field of study

The onset of big data and deep learning applications, mixed with conventional general-purpose programs, have driven computer architecture to embrace heterogeneity with specialization. With the ever-increasing interconnected chip components, future architectures are required to operate under a stricter power budget and process emerging big data applications efficiently. Interconnection network as the communication backbone thus is facing the grand challenges of limited power envelope, data movement and performance scaling. This dissertation provides interconnect solutions that are specialized to application requirements towards power-/energy-efficient and high-performance computing for heterogeneous architectures. This dissertation examines the challenges of network-on-chip router power-gating techniques for general-purpose workloads to save static power. A voting approach is proposed as an adaptive power-gating policy that considers both local and global traffic status through router voting. In addition, low-latency routing algorithms are designed to guarantee performance in irregular power-gating networks. This holistic solution not only saves power but also avoids performance overhead. This research also introduces emerging computation paradigms to interconnects for big data applications to mitigate the pressure of data movement. Approximate network-on-chip is proposed to achieve high-throughput communication by means of lossy compression. Then, near-data processing is combined with in-network computing to further improve performance while reducing data movement. The two schemes are general to play as plug-ins for different network topologies and routing algorithms. To tackle the challenging computational requirements of deep learning workloads, this dissertation investigates the compelling opportunities of communication algorithm-architecture co-design to accelerate distributed deep learning. MultiTree allreduce algorithm is proposed to bond with message scheduling with network topology to achieve faster and contention-free communication. In addition, the interconnect hardware and flow control are also specialized to exploit deep learning communication characteristics and fulfill the algorithm needs, thereby effectively improving the performance and scalability. By considering application and algorithm characteristics, this research shows that interconnection network can be tailored accordingly to improve the power-/energy-efficiency and performance to satisfy heterogeneous computation and communication requirements

Texas A&M Repository

Modeling and optimization of high-performance many-core systems for energy-efficient and reliable computing

Author: Meng Jie
Publication venue: Boston University
Publication date: 01/01/2013
Field of study

Thesis (Ph.D.)--Boston UniversityMany-core systems, ranging from small-scale many-core processors to large-scale high performance computing (HPC) data centers, have become the main trend in computing system design owing to their potential to deliver higher throughput per watt. However, power densities and temperatures increase following the growth in the performance capacity, and bring major challenges in energy efficiency, cooling costs, and reliability. These challenges require a joint assessment of performance, power, and temperature tradeoffs as well as the design of runtime optimization techniques that monitor and manage the interplay among them. This thesis proposes novel modeling and runtime management techniques that evaluate and optimize the performance, energy, and reliability of many-core systems. We first address the energy and thermal challenges in 3D-stacked many-core processors. 3D processors with stacked DRAM have the potential to dramatically improve performance owing to lower memory access latency and higher bandwidth. However, the performance increase may cause 3D systems to exceed the power budgets or create thermal hot spots. In order to provide an accurate analysis and enable the design of efficient management policies, this thesis introduces a simulation framework to jointly analyze performance, power, and temperature for 3D systems. We then propose a runtime optimization policy that maximizes the system performance by characterizing the application behavior and predicting the operating points that satisfy the power and thermal constraints. Our policy reduces the energy-delay product (EDP) by up to 61.9% compared to existing strategies. Performance, cooling energy, and reliability are also critical aspects in HPC data centers. In addition to causing reliability degradation, high temperatures increase the required cooling energy. Communication cost, on the other hand, has a significant impact on system performance in HPC data centers. This thesis proposes a topology-aware technique that maximizes system reliability by selecting between workload clustering and balancing. Our policy improves the system reliability by up to 123.3% compared to existing temperature balancing approaches. We also introduce a job allocation methodology to simultaneously optimize the communication cost and the cooling energy in a data center. Our policy reduces the cooling cost by 40% compared to cooling-aware and performance-aware policies, while achieving comparable performance to performance-aware policy

Boston University Institutional Repository (OpenBU)

圧縮とマルチキャストを用いた低遅延オンチップネットワーク

Author: He Yuan
和远
Publication venue: 工学系研究科先端学際工学専攻
Publication date: 24/03/2014
Field of study

学位の種別:課程博士University of Tokyo(東京大学