Search CORE

1,376 research outputs found

Transformations of High-Level Synthesis Codes for High-Performance Computing

Author: Besta Maciej
Hoefler Torsten
Licht Johannes de Fine
Meierhans Simon
Publication venue
Publication date: 29/10/2019
Field of study

Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

arXiv.org e-Print Archive

Repository for Publications and Research Data

Timing Measurement Platform for Arbitrary Black-Box Circuits Based on Transition Probability

Author: Cheung PYK
Wong JSJ
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2013
Field of study

Crossref

Spiral - Imperial College Digital Repository

Theoretical and algorithmic approaches to field-programmable gate array partitioning

Author: Plaut Barbara Catherine
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/1999
Field of study

Many practical problems dealing with the design of Very Large Scale Integrated (VLSI) circuits can be modeled as graphs in which vertices represent components of the circuit and edges represent a relationship between these components. When expressed as graphs, these problems can then often be solved using graph theoretic methods. Unfortunately, many such problems are NP-complete, hence no practical exact solutions are known to exist. In this dissertation, we study NP-complete problems taken from the realm of partitioning for Field-Programmable Gate Arrays (FPGAs). We adopt a two-pronged approach of theory and practice, developing practical heuristics driven by theoretical study. The theoretical approach is motivated by well-quasi-order (WQO) theory, which can be used to show, among other things, that when some hard problems have fixed parameters, polynomial-time solutions exist. This is of significance in the area of FPGA partitioning, in which practical problems are often characterized by fixed parameter instances. WQO techniques are not generally practical, however, and we develop new methods to solve several important problems in VLSI that are not even amenable to WQO techniques. We begin with a representative partitioning problem, Min Degree Graph Partition (MDGP), the fixed-parameter version of which is closed under the immersion order. \Ve show that the obstruction set ( set of immersion minimal elements) for this problem is computable; we prove both upper and lower bounds on the obstruction set size; and we completely characterize all fixed-parameter MDGP simple tree obstructions. WQO theory tells us only that fixed-parameter MDGP is solvable in (high-degree) polynomial time. We attack the problem using what we refer to as kd-candidate subsets, culminating in linear-time decision and search algorithms. The kd-candidate subset method also paves the way for an efficient heuristic for the FPGA Minimization problem. We then broaden our scope to incorporate delay minimization into FPGA partitioning. We develop, analyze and test a novel method called critical path compression, inspired in part by compiler optimization techniques. We then look at a variety of generalizations of MDGP. Some of these problems are not immersion closed; others are not even defined in a way that WQO theory applies. However, almost all of them are efficiently solvable via the kd-candidate subset approach. Interspersed in these results are many refinements of what is known about the complexity of these problems. We also discuss a few other solution strategies, and present many open problems

University of Tennessee, Knoxville: Trace

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Author: Aggarwal A.
Anderson E.
Anderson M.
D'Alberto P.
Del M.
Demmel J.
Jia-Wei H.
Kumar V. B. Y.
Lavin C.
Lin C. Y.
Moss D. J.
Solomonik E.
Vanhoucke V.
Wu E.
Zhou H.
Zhuo L.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/02/2020
Field of study

Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patterns. Matrix multiplication is no exception, and lower bounds have been proven and implemented both for shared and distributed memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing algorithms, as they offer full control of memory accesses to the programmer. While bounds developed in the context of fixed architectures still apply to these platforms, the spatially distributed nature of their computational and memory resources requires a decentralized approach to optimize algorithms for maximum hardware utilization. We present a model to optimize matrix multiplication for FPGA platforms, simultaneously targeting maximum performance and minimum off-chip data movement, within constraints set by the hardware. We map the model to a concrete architecture using a high-level synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary data types, and enables maintainability and portability across FPGA devices. Kernels generated from our architecture are shown to offer competitive performance in practice, scaling with both compute and memory resources. We offer our design as an open source project to encourage the open development of linear algebra and I/O minimizing algorithms on reconfigurable hardware platforms

arXiv.org e-Print Archive

Repository for Publications and Research Data

Crossref

Multiobjective Simulation Optimization Using Enhanced Evolutionary Algorithm Approaches

Author: Eskandari Hamidreza
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2006
Field of study

In today\u27s competitive business environment, a firm\u27s ability to make the correct, critical decisions can be translated into a great competitive advantage. Most of these critical real-world decisions involve the optimization not only of multiple objectives simultaneously, but also conflicting objectives, where improving one objective may degrade the performance of one or more of the other objectives. Traditional approaches for solving multiobjective optimization problems typically try to scalarize the multiple objectives into a single objective. This transforms the original multiple optimization problem formulation into a single objective optimization problem with a single solution. However, the drawbacks to these traditional approaches have motivated researchers and practitioners to seek alternative techniques that yield a set of Pareto optimal solutions rather than only a single solution. The problem becomes much more complicated in stochastic environments when the objectives take on uncertain (or noisy ) values due to random influences within the system being optimized, which is the case in real-world environments. Moreover, in stochastic environments, a solution approach should be sufficiently robust and/or capable of handling the uncertainty of the objective values. This makes the development of effective solution techniques that generate Pareto optimal solutions within these problem environments even more challenging than in their deterministic counterparts. Furthermore, many real-world problems involve complicated, black-box objective functions making a large number of solution evaluations computationally- and/or financially-prohibitive. This is often the case when complex computer simulation models are used to repeatedly evaluate possible solutions in search of the best solution (or set of solutions). Therefore, multiobjective optimization approaches capable of rapidly finding a diverse set of Pareto optimal solutions would be greatly beneficial. This research proposes two new multiobjective evolutionary algorithms (MOEAs), called fast Pareto genetic algorithm (FPGA) and stochastic Pareto genetic algorithm (SPGA), for optimization problems with multiple deterministic objectives and stochastic objectives, respectively. New search operators are introduced and employed to enhance the algorithms\u27 performance in terms of converging fast to the true Pareto optimal frontier while maintaining a diverse set of nondominated solutions along the Pareto optimal front. New concepts of solution dominance are defined for better discrimination among competing solutions in stochastic environments. SPGA uses a solution ranking strategy based on these new concepts. Computational results for a suite of published test problems indicate that both FPGA and SPGA are promising approaches. The results show that both FPGA and SPGA outperform the improved nondominated sorting genetic algorithm (NSGA-II), widely-considered benchmark in the MOEA research community, in terms of fast convergence to the true Pareto optimal frontier and diversity among the solutions along the front. The results also show that FPGA and SPGA require far fewer solution evaluations than NSGA-II, which is crucial in computationally-expensive simulation modeling applications

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Compressive Sensing Using Iterative Hard Thresholding with Low Precision Data Representation: Theory and Applications

Author: Alistarh Dan
Gürel Nezihe Merve
Kara Kaan
Lemmin Thomas
Püschel Markus
Smith Tyler
Stojanov Alen
Zhang Ce
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Modern scientific instruments produce vast amounts of data, which can overwhelm the processing ability of computer systems. Lossy compression of data is an intriguing solution, but comes with its own drawbacks, such as potential signal loss, and the need for careful optimization of the compression ratio. In this work, we focus on a setting where this problem is especially acute: compressive sensing frameworks for interferometry and medical imaging. We ask the following question: can the precision of the data representation be lowered for all inputs, with recovery guarantees and practical performance? Our first contribution is a theoretical analysis of the normalized Iterative Hard Thresholding (IHT) algorithm when all input data, meaning both the measurement matrix and the observation vector are quantized aggressively. We present a variant of low precision normalized {IHT} that, under mild conditions, can still provide recovery guarantees. The second contribution is the application of our quantization framework to radio astronomy and magnetic resonance imaging. We show that lowering the precision of the data can significantly accelerate image recovery. We evaluate our approach on telescope data and samples of brain images using CPU and FPGA implementations achieving up to a 9x speed-up with negligible loss of recovery quality.Comment: 19 pages, 5 figures, 1 table, in IEEE Transactions on Signal Processin

arXiv.org e-Print Archive

Crossref

IST Austria: PubRep (Institute of Science and Technology)