1,734 research outputs found
Decimal Floating-point Fused Multiply Add with Redundant Number Systems
The IEEE standard of decimal floating-point arithmetic was officially released in 2008. The new decimal floating-point (DFP) format and arithmetic can be applied to remedy the conversion error caused by representing decimal floating-point numbers in binary floating-point format and to improve the computing performance of the decimal processing in commercial and financial applications. Nowadays, many architectures and algorithms of individual arithmetic functions for decimal floating-point numbers are proposed and investigated (e.g., addition, multiplication, division, and square root). However, because of the less efficiency of representing decimal number in binary devices, the area consumption and performance of the DFP arithmetic units are not comparable with the binary counterparts.
IBM proposed a binary fused multiply-add (FMA) function in the POWER series of processors in order to improve the performance of floating-point computations and to reduce the complexity of hardware design in reduced instruction set computing (RISC) systems. Such an instruction also has been approved to be suitable for efficiently implementing not only stand-alone addition and multiplication, but also division, square root, and other transcendental functions. Additionally, unconventional number systems including digit sets and encodings have displayed advantages on performance and area efficiency in many applications of computer arithmetic.
In this research, by analyzing the typical binary floating-point FMA designs and the design strategy of unconventional number systems, ``a high performance decimal floating-point fused multiply-add (DFMA) with redundant internal encodings" was proposed. First, the fixed-point components inside the DFMA (i.e., addition and multiplication) were studied and investigated as the basis of the FMA architecture. The specific number systems were also applied to improve the basic decimal fixed-point arithmetic. The superiority of redundant number systems in stand-alone decimal fixed-point addition and multiplication has been proved by the synthesis results. Afterwards, a new DFMA architecture which exploits the specific redundant internal operands was proposed. Overall, the specific number system improved, not only the efficiency of the fixed-point addition and multiplication inside the FMA, but also the architecture and algorithms to build up the FMA itself.
The functional division, square root, reciprocal, reciprocal square root, and many other functions, which exploit the Newton's or other similar methods, can benefit from the proposed DFMA architecture. With few necessary on-chip memory devices (e.g., Look-up tables) or even only software routines, these functions can be implemented on the basis of the hardwired FMA function. Therefore, the proposed DFMA can be implemented on chip solely as a key component to reduce the hardware cost. Additionally, our research on the decimal arithmetic with unconventional number systems expands the way of performing other high-performance decimal arithmetic (e.g., stand-alone division and square root) upon the basic binary devices (i.e., AND gate, OR gate, and binary full adder). The proposed techniques are also expected to be helpful to other non-binary based applications
Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures
Feltor is a modular and free scientific software package. It allows
developing platform independent code that runs on a variety of parallel
computer architectures ranging from laptop CPUs to multi-GPU distributed memory
systems. Feltor consists of both a numerical library and a collection of
application codes built on top of the library. Its main target are two- and
three-dimensional drift- and gyro-fluid simulations with discontinuous Galerkin
methods as the main numerical discretization technique. We observe that
numerical simulations of a recently developed gyro-fluid model produce
non-deterministic results in parallel computations. First, we show how we
restore accuracy and bitwise reproducibility algorithmically and
programmatically. In particular, we adopt an implementation of the exactly
rounded dot product based on long accumulators, which avoids accuracy losses
especially in parallel applications. However, reproducibility and accuracy
alone fail to indicate correct simulation behaviour. In fact, in the physical
model slightly different initial conditions lead to vastly different end
states. This behaviour translates to its numerical representation. Pointwise
convergence, even in principle, becomes impossible for long simulation times.
In a second part, we explore important performance tuning considerations. We
identify latency and memory bandwidth as the main performance indicators of our
routines. Based on these, we propose a parallel performance model that predicts
the execution time of algorithms implemented in Feltor and test our model on a
selection of parallel hardware architectures. We are able to predict the
execution time with a relative error of less than 25% for problem sizes between
0.1 and 1000 MB. Finally, we find that the product of latency and bandwidth
gives a minimum array size per compute node to achieve a scaling efficiency
above 50% (both strong and weak)
Quantum Monte Carlo with very large multideterminant wavefunctions
An algorithm to compute efficiently the first two derivatives of (very) large
multideterminant wavefunctions for quantum Monte Carlo calculations is
presented. The calculation of determinants and their derivatives is performed
using the Sherman-Morrison formula for updating the inverse Slater matrix. An
improved implementation based on the reduction of the number of column
substitutions and on a very efficient implementation of the calculation of the
scalar products involved is presented. It is emphasized that multideterminant
expansions contain in general a large number of identical spin-specific
determinants: for typical configuration interaction-type wavefunctions the
number of unique spin-specific determinants
() with a non-negligible weight in the expansion is
of order . We show that a careful implementation
of the calculation of the -dependent contributions can make this
step negligible enough so that in practice the algorithm scales as the total
number of unique spin-specific determinants, , over a wide range of total number of determinants (here,
up to about one million), thus greatly reducing the total
computational cost. Finally, a new truncation scheme for the multideterminant
expansion is proposed so that larger expansions can be considered without
increasing the computational time. The algorithm is illustrated with
all-electron Fixed-Node Diffusion Monte Carlo calculations of the total energy
of the chlorine atom. Calculations using a trial wavefunction including about
750 000 determinants with a computational increase of 400 compared to a
single-determinant calculation are shown to be feasible.Comment: 9 pages, 3 figure
Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with Multi-Precision Floating Point Support in 22 nm FD-SOI
In this paper, we present Ara, a 64-bit vector processor based on the version
0.5 draft of RISC-V's vector extension, implemented in GlobalFoundries 22FDX
FD-SOI technology. Ara's microarchitecture is scalable, as it is composed of a
set of identical lanes, each containing part of the processor's vector register
file and functional units. It achieves up to 97% FPU utilization when running a
256 x 256 double precision matrix multiplication on sixteen lanes. Ara runs at
more than 1 GHz in the typical corner (TT/0.80V/25 oC) achieving a performance
up to 33 DP-GFLOPS. In terms of energy efficiency, Ara achieves up to 41
DP-GFLOPS/W under the same conditions, which is slightly superior to similar
vector processors found in literature. An analysis on several vectorizable
linear algebra computation kernels for a range of different matrix and vector
sizes gives insight into performance limitations and bottlenecks for vector
processors and outlines directions to maintain high energy efficiency even for
small matrix sizes where the vector architecture achieves suboptimal
utilization of the available FPUs.Comment: 13 pages. Accepted for publication in IEEE Transactions on Very Large
Scale Integration System
Lattice QCD with Domain Decomposition on Intel Xeon Phi Co-Processors
The gap between the cost of moving data and the cost of computing continues
to grow, making it ever harder to design iterative solvers on extreme-scale
architectures. This problem can be alleviated by alternative algorithms that
reduce the amount of data movement. We investigate this in the context of
Lattice Quantum Chromodynamics and implement such an alternative solver
algorithm, based on domain decomposition, on Intel Xeon Phi co-processor (KNC)
clusters. We demonstrate close-to-linear on-chip scaling to all 60 cores of the
KNC. With a mix of single- and half-precision the domain-decomposition method
sustains 400-500 Gflop/s per chip. Compared to an optimized KNC implementation
of a standard solver [1], our full multi-node domain-decomposition solver
strong-scales to more nodes and reduces the time-to-solution by a factor of 5.Comment: 12 pages, 7 figures, presented at Supercomputing 2014, November
16-21, 2014, New Orleans, Louisiana, USA, speaker Simon Heybrock; SC '14
Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis, pages 69-80, IEEE Press Piscataway, NJ, USA
(c)201
Vector coprocessor sharing techniques for multicores: performance and energy gains
Vector Processors (VPs) created the breakthroughs needed for the emergence of computational science many years ago. All commercial computing architectures on the market today contain some form of vector or SIMD processing.
Many high-performance and embedded applications, often dealing with streams of data, cannot efficiently utilize dedicated vector processors for various reasons: limited percentage of sustained vector code due to substantial flow control; inherent small parallelism or the frequent involvement of operating system tasks; varying vector length across applications or within a single application; data dependencies within short sequences of instructions, a problem further exacerbated without loop unrolling or other compiler optimization techniques. Additionally, existing rigid SIMD architectures cannot tolerate efficiently dynamic application environments with many cores that may require the runtime adjustment of assigned vector resources in order to operate at desired energy/performance levels.
To simultaneously alleviate these drawbacks of rigid lane-based VP architectures, while also releasing on-chip real estate for other important design choices, the first part of this research proposes three architectural contexts for the implementation of a shared vector coprocessor in multicore processors. Sharing an expensive resource among multiple cores increases the efficiency of the functional units and the overall system throughput. The second part of the dissertation regards the evaluation and characterization of the three proposed shared vector architectures from the performance and power perspectives on an FPGA (Field-Programmable Gate Array) prototype. The third part of this work introduces performance and power estimation models based on observations deduced from the experimental results. The results show the opportunity to adaptively adjust the number of vector lanes assigned to individual cores or processing threads in order to minimize various energy-performance metrics on modern vector- capable multicore processors that run applications with dynamic workloads. Therefore, the fourth part of this research focuses on the development of a fine-to-coarse grain power management technique and a relevant adaptive hardware/software infrastructure which dynamically adjusts the assigned VP resources (number of vector lanes) in order to minimize the energy consumption for applications with dynamic workloads. In order to remove the inherent limitations imposed by FPGA technologies, the fifth part of this work consists of implementing an ASIC (Application Specific Integrated Circuit) version of the shared VP towards precise performance-energy studies involving high- performance vector processing in multicore environments
- âŠ