Search CORE

3,212 research outputs found

Workshops on Extreme Scale Design Automation (ESDA) Challenges and Opportunities for 2025 and Beyond

Author: Bahar R. Iris
Jones Alex K.
Katkoori Srinivas
Madden Patrick H.
Marculescu Diana
Markov Igor L.
Publication venue
Publication date: 04/05/2020
Field of study

Integrated circuits and electronic systems, as well as design technologies, are evolving at a great rate -- both quantitatively and qualitatively. Major developments include new interconnects and switching devices with atomic-scale uncertainty, the depth and scale of on-chip integration, electronic system-level integration, the increasing significance of software, as well as more effective means of design entry, compilation, algorithmic optimization, numerical simulation, pre- and post-silicon design validation, and chip test. Application targets and key markets are also shifting substantially from desktop CPUs to mobile platforms to an Internet-of-Things infrastructure. In light of these changes in electronic design contexts and given EDA's significant dependence on such context, the EDA community must adapt to these changes and focus on the opportunities for research and commercial success. The CCC workshop series on Extreme-Scale Design Automation, organized with the support of ACM SIGDA, studied challenges faced by the EDA community as well as new and exciting opportunities currently available. This document represents a summary of the findings from these meetings.Comment: A Computing Community Consortium (CCC) workshop report, 32 page

arXiv.org e-Print Archive

Synergistic CPU-FPGA Acceleration of Sparse Linear Algebra

Author: Martin Richard P.
Nagarakatte Santosh
Soltaniyeh Mohammadreza
Publication venue
Publication date: 28/04/2020
Field of study

This paper describes REAP, a software-hardware approach that enables high performance sparse linear algebra computations on a cooperative CPU-FPGA platform. REAP carefully separates the task of organizing the matrix elements from the computation phase. It uses the CPU to provide a first-pass re-organization of the matrix elements, allowing the FPGA to focus on the computation. We introduce a new intermediate representation that allows the CPU to communicate the sparse data and the scheduling decisions to the FPGA. The computation is optimized on the FPGA for effective resource utilization with pipelining. REAP improves the performance of Sparse General Matrix Multiplication (SpGEMM) and Sparse Cholesky Factorization by 3.2X and 1.85X compared to widely used sparse libraries for them on the CPU, respectively.Comment: 12 page

arXiv.org e-Print Archive

Overcoming Limitations of GPGPU-Computing in Scientific Applications

Author: Kenyon Connor
Khanna Gaurav
Volkema Glenn
Publication venue
Publication date: 10/05/2019
Field of study

The performance of discrete general purpose graphics processing units (GPGPUs) has been improving at a rapid pace. The PCIe interconnect that controls the communication of data between the system host memory and the GPU has not improved as quickly, leaving a gap in performance due to GPU downtime while waiting for PCIe data transfer. In this article, we explore two alternatives to the limited PCIe bandwidth, NVIDIA NVLink interconnect, and zero-copy algorithms for shared memory Heterogeneous System Architecture (HSA) devices. The OpenCL SHOC benchmark suite is used to measure the performance of each device on various scientific application kernels.Comment: 9 pages, 10 figure

arXiv.org e-Print Archive

A Performance Comparison Using HPC Benchmarks: Windows HPC Server 2008 and Red Hat Enterprise Linux 5

Author: Doleschal J.
Henschel R.
Li H.
Mueller M.S.
Teige S.
Publication venue
Publication date: 01/10/2011
Field of study

This document was developed with support from the National Science Foundation (NSF) under Grant No. 0910812 to Indiana University for ”FutureGrid: An Experimental, High-Performance Grid Test-bed.” Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.A collection of performance benchmarks have been run on an IBM System X iDataPlex cluster using two different operating systems. Windows HPC Server 2008 (WinHPC) and Red Hat Enterprise Linux v5.4 (RHEL5) are compared using SPEC MPI2007 v1.1, the High Performance Computing Challenge (HPCC) and National Science Foundation (NSF) acceptance test benchmark suites. Overall, we find the performance of WinHPC and RHEL5 to be equivalent but significant performance differences exist when analyzing specific applications. We focus on presenting the results from the application benchmarks and include the results of the HPCC microbenchmark for completeness

IUScholarWorks (University of Indiana)

Wanted: Floating-Point Add Round-off Error instruction

Author: Dukhan Marat
Riedy Jason
Vuduc Richard
Publication venue
Publication date: 01/03/2016
Field of study

We propose a new instruction (FPADDRE) that computes the round-off error in floating-point addition. We explain how this instruction benefits high-precision arithmetic operations in applications where double precision is not sufficient. Performance estimates on Intel Haswell, Intel Skylake, and AMD Steamroller processors, as well as Intel Knights Corner co-processor, demonstrate that such an instruction would improve the latency of double-double addition by up to 55% and increase double-double addition throughput by up to 103%, with smaller, but non-negligible benefits for double-double multiplication. The new instruction delivers up to 2x speedups on three benchmarks that use high-precision floating-point arithmetic: double-double matrix-matrix multiplication, compensated dot product, and polynomial evaluation via the compensated Horner scheme

arXiv.org e-Print Archive

A Microbenchmark Characterization of the Emu Chick

Author: Conte Thomas M.
Eswar Srinivas
Hein Eric
Lavin Patrick
Li Jiajia
Riedy Jason
Vuduc Richard
Young Jeffrey S.
Publication venue: 'Elsevier BV'
Publication date: 31/05/2019
Field of study

The Emu Chick is a prototype system designed around the concept of migratory memory-side processing. Rather than transferring large amounts of data across power-hungry, high-latency interconnects, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each memory read. The current prototype hardware uses FPGAs to implement cache-less "Gossamer cores for doing computational work and a stationary core to run basic operating system functions and migrate threads between nodes. In this multi-node characterization of the Emu Chick, we extend an earlier single-node investigation (Hein, et al. AsHES 2018) of the the memory bandwidth characteristics of the system through benchmarks like STREAM, pointer chasing, and sparse matrix-vector multiplication. We compare the Emu Chick hardware to architectural simulation and an Intel Xeon-based platform. Our results demonstrate that for many basic operations the Emu Chick can use available memory bandwidth more efficiently than a more traditional, cache-based architecture although bandwidth usage suffers for computationally intensive workloads like SpMV. Moreover, the Emu Chick provides stable, predictable performance with up to 65% of the peak bandwidth utilization on a random-access pointer chasing benchmark with weak locality

arXiv.org e-Print Archive

Tolerating Soft Errors in Processor Cores Using CLEAR (Cross-Layer Exploration for Architecting Resilience)

Author: Abraham Jacob A.
Bose Pradip
Cheng Eric
Cher Chen-Yong
Cho Hyungmin
Lilja Klas
Mirkhani Shahrzad
Mitra Subhasish
Skadron Kevin
Stan Mircea R.
Szafaryn Lukasz G.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/09/2017
Field of study

We present CLEAR (Cross-Layer Exploration for Architecting Resilience), a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience targets at minimal costs (energy, power, execution time, area) by combining resilience techniques across various layers of the system stack (circuit, logic, architecture, software, algorithm). This is also referred to as cross-layer resilience. In this paper, we focus on radiation-induced soft errors in processor cores. We address both single-event upsets (SEUs) and single-event multiple upsets (SEMUs) in terrestrial environments. Our framework automatically and systematically explores the large space of comprehensive resilience techniques and their combinations across various layers of the system stack (586 cross-layer combinations in this paper), derives cost-effective solutions that achieve resilience targets at minimal costs, and provides guidelines for the design of new resilience techniques. Our results demonstrate that a carefully optimized combination of circuit-level hardening, logic-level parity checking, and micro-architectural recovery provides a highly cost-effective soft error resilience solution for general-purpose processor cores. For example, a 50x improvement in silent data corruption rate is achieved at only 2.1% energy cost for an out-of-order core (6.1% for an in-order core) with no speed impact. However, (application-aware) selective circuit-level hardening alone, guided by a thorough analysis of the effects of soft errors on application benchmarks, provides a cost-effective soft error resilience solution as well (with ~1% additional energy cost for a 50x improvement in silent data corruption rate).Comment: Unedited version of paper published in Transactions on Computer-Aided Design of Integrated Circuits and System

arXiv.org e-Print Archive

Recommended from our members

Memory-Based High-Level Synthesis Optimizations Security Exploration on the Power Side-Channel

Author: Blackstone Jeremy
Hu Wei
Kastner Ryan
Mu Dejun
Tai Yu
Zhang Lu
Publication venue: eScholarship, University of California
Publication date: 01/10/2020
Field of study

High-level synthesis (HLS) allows hardware designers to think algorithmically and not worry about low-level, cycle-by-cycle details. This provides the ability to quickly explore the architectural design space and tradeoffs between resource utilization and performance. Unfortunately, security evaluation is not a standard part of the HLS design flow. In this article, we aim to understand the effects of memory-based HLS optimizations on power side-channel leakage. We use Xilinx Vivado HLS to develop different cryptographic cores, implement them on a Spartan-6 FPGA, and collect power traces. We evaluate the designs with respect to resource utilization, performance, and information leakage through power consumption. We have two important observations and contributions. First, the choice of resource optimization directive results in different levels of side-channel vulnerabilities. Second, the partitioning optimization directive can greatly compromise the hardware cryptographic system through power side-channel leakage due to the deployment of memory control logic. We describe an evaluation procedure for power side-channel leakage and use it to make best-effort recommendations about how to design more secure architectures in the cryptographic domain

eScholarship - University of California

Performance Characterization of Multi-threaded Graph Processing Applications on Intel Many-Integrated-Core Architecture

Author: Chen Langshi
Jiang Lei
Qiu Judy
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/02/2019
Field of study

Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of terascale integration. Among emerging killer applications, parallel graph processing has been a critical technique to analyze connected data. In this paper, we empirically evaluate various computing platforms including an Intel Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor codenamed Knights Landing (KNL) in the domain of parallel graph processing. We show that the KNL gains encouraging performance when processing graphs, so that it can become a promising solution to accelerating multi-threaded graph applications. We further characterize the impact of KNL architectural enhancements on the performance of a state-of-the art graph framework.We have four key observations: 1 Different graph applications require distinctive numbers of threads to reach the peak performance. For the same application, various datasets need even different numbers of threads to achieve the best performance. 2 Only a few graph applications benefit from the high bandwidth MCDRAM, while others favor the low latency DDR4 DRAM. 3 Vector processing units executing AVX512 SIMD instructions on KNLs are underutilized when running the state-of-the-art graph framework. 4 The sub-NUMA cache clustering mode offering the lowest local memory access latency hurts the performance of graph benchmarks that are lack of NUMA awareness. At last, We suggest future works including system auto-tuning tools and graph framework optimizations to fully exploit the potential of KNL for parallel graph processing.Comment: published as L. Jiang, L. Chen and J. Qiu, "Performance Characterization of Multi-threaded Graph Processing Applications on Many-Integrated-Core Architecture," 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Belfast, United Kingdom, 2018, pp. 199-20

arXiv.org e-Print Archive

IUScholarWorks Open

Load-Varying LINPACK: A Benchmark for Evaluating Energy Efficiency in High-End Computing

Author: Feng Wu-chun
Subramaniam Balaji
Publication venue
Publication date: 01/12/2010
Field of study

For decades, performance has driven the high-end computing (HEC) community. However, as highlighted in recent exascale studies that chart a path from petascale to exascale computing, power consumption is fast becoming the major design constraint in HEC. Consequently, the HEC community needs to address this issue in future petascale and exascale computing systems. Current scientific benchmarks, such as LINPACK and SPEChpc, only evaluate HEC systems when running at full throttle, i.e., 100% workload, resulting in a focus on performance and ignoring the issues of power and energy consumption. In contrast, efforts like SPECpower evaluate the energy efficiency of a compute server at varying workloads. This is analogous to evaluating the energy efficiency (i.e., fuel efficiency) of an automobile at varying speeds (e.g., miles per gallon highway versus city). SPECpower, however, only evaluates the energy efficiency of a single compute server rather than a HEC system; furthermore, it is based on SPEC's Java Business Benchmarks (SPECjbb) rather than a scientific benchmark. Given the absence of a load-varying scientific benchmark to evaluate the energy efficiency of HEC systems at different workloads, we propose the load-varying LINPACK (LV-LINPACK) benchmark. In this paper, we identify application parameters that affect performance and provide a methodology to vary the workload of LINPACK, thus enabling a more rigorous study of energy efficiency in supercomputers, or more generally, HEC

Computer Science Technical Reports @Virginia Tech