3,212 research outputs found
Workshops on Extreme Scale Design Automation (ESDA) Challenges and Opportunities for 2025 and Beyond
Integrated circuits and electronic systems, as well as design technologies,
are evolving at a great rate -- both quantitatively and qualitatively. Major
developments include new interconnects and switching devices with atomic-scale
uncertainty, the depth and scale of on-chip integration, electronic
system-level integration, the increasing significance of software, as well as
more effective means of design entry, compilation, algorithmic optimization,
numerical simulation, pre- and post-silicon design validation, and chip test.
Application targets and key markets are also shifting substantially from
desktop CPUs to mobile platforms to an Internet-of-Things infrastructure. In
light of these changes in electronic design contexts and given EDA's
significant dependence on such context, the EDA community must adapt to these
changes and focus on the opportunities for research and commercial success. The
CCC workshop series on Extreme-Scale Design Automation, organized with the
support of ACM SIGDA, studied challenges faced by the EDA community as well as
new and exciting opportunities currently available. This document represents a
summary of the findings from these meetings.Comment: A Computing Community Consortium (CCC) workshop report, 32 page
Synergistic CPU-FPGA Acceleration of Sparse Linear Algebra
This paper describes REAP, a software-hardware approach that enables high
performance sparse linear algebra computations on a cooperative CPU-FPGA
platform. REAP carefully separates the task of organizing the matrix elements
from the computation phase. It uses the CPU to provide a first-pass
re-organization of the matrix elements, allowing the FPGA to focus on the
computation. We introduce a new intermediate representation that allows the CPU
to communicate the sparse data and the scheduling decisions to the FPGA. The
computation is optimized on the FPGA for effective resource utilization with
pipelining. REAP improves the performance of Sparse General Matrix
Multiplication (SpGEMM) and Sparse Cholesky Factorization by 3.2X and 1.85X
compared to widely used sparse libraries for them on the CPU, respectively.Comment: 12 page
Overcoming Limitations of GPGPU-Computing in Scientific Applications
The performance of discrete general purpose graphics processing units
(GPGPUs) has been improving at a rapid pace. The PCIe interconnect that
controls the communication of data between the system host memory and the GPU
has not improved as quickly, leaving a gap in performance due to GPU downtime
while waiting for PCIe data transfer. In this article, we explore two
alternatives to the limited PCIe bandwidth, NVIDIA NVLink interconnect, and
zero-copy algorithms for shared memory Heterogeneous System Architecture (HSA)
devices. The OpenCL SHOC benchmark suite is used to measure the performance of
each device on various scientific application kernels.Comment: 9 pages, 10 figure
A Performance Comparison Using HPC Benchmarks: Windows HPC Server 2008 and Red Hat Enterprise Linux 5
This document was developed with support from the National Science Foundation (NSF) under Grant No. 0910812 to Indiana University for ”FutureGrid: An Experimental, High-Performance Grid Test-bed.” Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.A collection of performance benchmarks have been run on an IBM System X iDataPlex cluster using two different operating systems. Windows HPC Server 2008 (WinHPC) and Red Hat Enterprise Linux v5.4 (RHEL5) are compared using SPEC MPI2007 v1.1, the High Performance Computing Challenge (HPCC) and National Science Foundation (NSF) acceptance test benchmark suites. Overall, we find the performance of WinHPC and RHEL5 to be equivalent but significant performance differences exist when analyzing specific applications. We focus on presenting the results from the application benchmarks and include the results of the HPCC microbenchmark for completeness
Wanted: Floating-Point Add Round-off Error instruction
We propose a new instruction (FPADDRE) that computes the round-off error in
floating-point addition. We explain how this instruction benefits
high-precision arithmetic operations in applications where double precision is
not sufficient. Performance estimates on Intel Haswell, Intel Skylake, and AMD
Steamroller processors, as well as Intel Knights Corner co-processor,
demonstrate that such an instruction would improve the latency of double-double
addition by up to 55% and increase double-double addition throughput by up to
103%, with smaller, but non-negligible benefits for double-double
multiplication. The new instruction delivers up to 2x speedups on three
benchmarks that use high-precision floating-point arithmetic: double-double
matrix-matrix multiplication, compensated dot product, and polynomial
evaluation via the compensated Horner scheme
A Microbenchmark Characterization of the Emu Chick
The Emu Chick is a prototype system designed around the concept of migratory
memory-side processing. Rather than transferring large amounts of data across
power-hungry, high-latency interconnects, the Emu Chick moves lightweight
thread contexts to near-memory cores before the beginning of each memory read.
The current prototype hardware uses FPGAs to implement cache-less "Gossamer
cores for doing computational work and a stationary core to run basic operating
system functions and migrate threads between nodes. In this multi-node
characterization of the Emu Chick, we extend an earlier single-node
investigation (Hein, et al. AsHES 2018) of the the memory bandwidth
characteristics of the system through benchmarks like STREAM, pointer chasing,
and sparse matrix-vector multiplication. We compare the Emu Chick hardware to
architectural simulation and an Intel Xeon-based platform. Our results
demonstrate that for many basic operations the Emu Chick can use available
memory bandwidth more efficiently than a more traditional, cache-based
architecture although bandwidth usage suffers for computationally intensive
workloads like SpMV. Moreover, the Emu Chick provides stable, predictable
performance with up to 65% of the peak bandwidth utilization on a random-access
pointer chasing benchmark with weak locality
Tolerating Soft Errors in Processor Cores Using CLEAR (Cross-Layer Exploration for Architecting Resilience)
We present CLEAR (Cross-Layer Exploration for Architecting Resilience), a
first of its kind framework which overcomes a major challenge in the design of
digital systems that are resilient to reliability failures: achieve desired
resilience targets at minimal costs (energy, power, execution time, area) by
combining resilience techniques across various layers of the system stack
(circuit, logic, architecture, software, algorithm). This is also referred to
as cross-layer resilience. In this paper, we focus on radiation-induced soft
errors in processor cores. We address both single-event upsets (SEUs) and
single-event multiple upsets (SEMUs) in terrestrial environments. Our framework
automatically and systematically explores the large space of comprehensive
resilience techniques and their combinations across various layers of the
system stack (586 cross-layer combinations in this paper), derives
cost-effective solutions that achieve resilience targets at minimal costs, and
provides guidelines for the design of new resilience techniques. Our results
demonstrate that a carefully optimized combination of circuit-level hardening,
logic-level parity checking, and micro-architectural recovery provides a highly
cost-effective soft error resilience solution for general-purpose processor
cores. For example, a 50x improvement in silent data corruption rate is
achieved at only 2.1% energy cost for an out-of-order core (6.1% for an
in-order core) with no speed impact. However, (application-aware) selective
circuit-level hardening alone, guided by a thorough analysis of the effects of
soft errors on application benchmarks, provides a cost-effective soft error
resilience solution as well (with ~1% additional energy cost for a 50x
improvement in silent data corruption rate).Comment: Unedited version of paper published in Transactions on Computer-Aided
Design of Integrated Circuits and System
Recommended from our members
Memory-Based High-Level Synthesis Optimizations Security Exploration on the Power Side-Channel
High-level synthesis (HLS) allows hardware designers to think algorithmically and not worry about low-level, cycle-by-cycle details. This provides the ability to quickly explore the architectural design space and tradeoffs between resource utilization and performance. Unfortunately, security evaluation is not a standard part of the HLS design flow. In this article, we aim to understand the effects of memory-based HLS optimizations on power side-channel leakage. We use Xilinx Vivado HLS to develop different cryptographic cores, implement them on a Spartan-6 FPGA, and collect power traces. We evaluate the designs with respect to resource utilization, performance, and information leakage through power consumption. We have two important observations and contributions. First, the choice of resource optimization directive results in different levels of side-channel vulnerabilities. Second, the partitioning optimization directive can greatly compromise the hardware cryptographic system through power side-channel leakage due to the deployment of memory control logic. We describe an evaluation procedure for power side-channel leakage and use it to make best-effort recommendations about how to design more secure architectures in the cryptographic domain
Performance Characterization of Multi-threaded Graph Processing Applications on Intel Many-Integrated-Core Architecture
Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of
terascale integration. Among emerging killer applications, parallel graph
processing has been a critical technique to analyze connected data. In this
paper, we empirically evaluate various computing platforms including an Intel
Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor
codenamed Knights Landing (KNL) in the domain of parallel graph processing. We
show that the KNL gains encouraging performance when processing graphs, so that
it can become a promising solution to accelerating multi-threaded graph
applications. We further characterize the impact of KNL architectural
enhancements on the performance of a state-of-the art graph framework.We have
four key observations: 1 Different graph applications require distinctive
numbers of threads to reach the peak performance. For the same application,
various datasets need even different numbers of threads to achieve the best
performance. 2 Only a few graph applications benefit from the high bandwidth
MCDRAM, while others favor the low latency DDR4 DRAM. 3 Vector processing units
executing AVX512 SIMD instructions on KNLs are underutilized when running the
state-of-the-art graph framework. 4 The sub-NUMA cache clustering mode offering
the lowest local memory access latency hurts the performance of graph
benchmarks that are lack of NUMA awareness. At last, We suggest future works
including system auto-tuning tools and graph framework optimizations to fully
exploit the potential of KNL for parallel graph processing.Comment: published as L. Jiang, L. Chen and J. Qiu, "Performance
Characterization of Multi-threaded Graph Processing Applications on
Many-Integrated-Core Architecture," 2018 IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS), Belfast, United
Kingdom, 2018, pp. 199-20
Load-Varying LINPACK: A Benchmark for Evaluating Energy Efficiency in High-End Computing
For decades, performance has driven the high-end computing (HEC) community. However, as highlighted in recent exascale studies that chart a path from petascale to exascale computing, power consumption is fast becoming the major design constraint in HEC. Consequently, the HEC community needs to address this issue in future petascale and exascale computing systems.
Current scientific benchmarks, such as LINPACK and SPEChpc, only evaluate HEC systems when running at full throttle, i.e., 100% workload, resulting in a focus on performance and ignoring the issues of power and energy consumption. In contrast, efforts like SPECpower evaluate the energy efficiency of a compute server at varying workloads. This is analogous to evaluating the energy efficiency (i.e., fuel efficiency) of an automobile at varying speeds (e.g., miles per gallon highway versus city). SPECpower, however, only evaluates the energy efficiency of a single compute server rather than a HEC system; furthermore, it is based on SPEC's Java Business Benchmarks (SPECjbb) rather than a scientific benchmark. Given the absence of a load-varying scientific benchmark to evaluate the energy efficiency of HEC systems at different workloads, we propose the load-varying LINPACK (LV-LINPACK) benchmark. In this paper, we identify application parameters that affect performance and provide a methodology to vary the workload of LINPACK, thus enabling a more rigorous study of energy efficiency in supercomputers, or more generally, HEC
- …