4,622 research outputs found
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Diffusive MIMO Molecular Communications: Channel Estimation, Equalization and Detection
In diffusion-based communication, as for molecular systems, the achievable
data rate is low due to the stochastic nature of diffusion which exhibits a
severe inter-symbol-interference (ISI). Multiple-Input Multiple-Output (MIMO)
multiplexing improves the data rate at the expense of an inter-link
interference (ILI). This paper investigates training-based channel estimation
schemes for diffusive MIMO (D-MIMO) systems and corresponding equalization
methods. Maximum likelihood and least-squares estimators of mean channel are
derived, and the training sequence is designed to minimize the mean square
error (MSE). Numerical validations in terms of MSE are compared with Cramer-Rao
bound derived herein. Equalization is based on decision feedback equalizer
(DFE) structure as this is effective in mitigating diffusive ISI/ILI.
Zero-forcing, minimum MSE and least-squares criteria have been paired to DFE,
and their performances are evaluated in terms of bit error probability. Since
D-MIMO systems are severely affected by the ILI because of short transmitters
inter-distance, D-MIMO time interleaving is exploited as countermeasure to
mitigate the ILI with remarkable performance improvements. The feasibility of a
block-type communication including training and data equalization is explored
for D-MIMO, and system-level performances are numerically derived.Comment: Accepted paper at IEEE transaction on Communicatio
A Study of Energy and Locality Effects using Space-filling Curves
The cost of energy is becoming an increasingly important driver for the
operating cost of HPC systems, adding yet another facet to the challenge of
producing efficient code. In this paper, we investigate the energy implications
of trading computation for locality using Hilbert and Morton space-filling
curves with dense matrix-matrix multiplication. The advantage of these curves
is that they exhibit an inherent tiling effect without requiring specific
architecture tuning. By accessing the matrices in the order determined by the
space-filling curves, we can trade computation for locality. The index
computation overhead of the Morton curve is found to be balanced against its
locality and energy efficiency, while the overhead of the Hilbert curve
outweighs its improvements on our test system.Comment: Proceedings of the 2014 IEEE International Parallel & Distributed
Processing Symposium Workshops (IPDPSW
Generating optimized Fourier interpolation routines for density function theory using SPIRAL
© 2015 IEEE.Upsampling of a multi-dimensional data-set is an operation with wide application in image processing and quantum mechanical calculations using density functional theory. For small up sampling factors as seen in the quantum chemistry code ONETEP, a time-shift based implementation that shifts samples by a fraction of the original grid spacing to fill in the intermediate values using a frequency domain Fourier property can be a good choice. Readily available highly optimized multidimensional FFT implementations are leveraged at the expense of extra passes through the entire working set. In this paper we present an optimized variant of the time-shift based up sampling. Since ONETEP handles threading, we address the memory hierarchy and SIMD vectorization, and focus on problem dimensions relevant for ONETEP. We present a formalization of this operation within the SPIRAL framework and demonstrate auto-generated and auto-tuned interpolation libraries. We compare the performance of our generated code against the previous best implementations using highly optimized FFT libraries (FFTW and MKL). We demonstrate speed-ups in isolation averaging 3x and within ONETEP of up to 15%
Evaluating application vulnerability to soft errors in multi-level cache hierarchy
As the capacity of cache increases dramatically with new processors, soft errors originating in cache has become a major reliability concern for high performance processors. This paper presents application specific soft error vulnerability analysis in order to understand an application's responses to soft errors from different levels of caches. Based on a high-performance processor simulator called Graphite, we have implemented a fault injection framework that can selectively inject bit flips to different levels of caches. We simulated a wide range of relevant bit error patterns and measured the applications' vulnerabilities to bit errors. Our experimental results have shown the various vulnerabilities of applications to bit errors from different levels of caches; the results have also indicated the probabilities of different behaviors from the applications
Performance Prediction of Nonbinary Forward Error Correction in Optical Transmission Experiments
In this paper, we compare different metrics to predict the error rate of
optical systems based on nonbinary forward error correction (FEC). It is shown
that the correct metric to predict the performance of coded modulation based on
nonbinary FEC is the mutual information. The accuracy of the prediction is
verified in a detailed example with multiple constellation formats, FEC
overheads in both simulations and optical transmission experiments over a
recirculating loop. It is shown that the employed FEC codes must be universal
if performance prediction based on thresholds is used. A tutorial introduction
into the computation of the threshold from optical transmission measurements is
also given.Comment: submitted to IEEE/OSA Journal of Lightwave Technolog
- …