1,376 research outputs found
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Theoretical and algorithmic approaches to field-programmable gate array partitioning
Many practical problems dealing with the design of Very Large Scale Integrated (VLSI) circuits can be modeled as graphs in which vertices represent components of the circuit and edges represent a relationship between these components. When expressed as graphs, these problems can then often be solved using graph theoretic methods. Unfortunately, many such problems are NP-complete, hence no practical exact solutions are known to exist.
In this dissertation, we study NP-complete problems taken from the realm of partitioning for Field-Programmable Gate Arrays (FPGAs). We adopt a two-pronged approach of theory and practice, developing practical heuristics driven by theoretical study.
The theoretical approach is motivated by well-quasi-order (WQO) theory, which can be used to show, among other things, that when some hard problems have fixed parameters, polynomial-time solutions exist. This is of significance in the area of FPGA partitioning, in which practical problems are often characterized by fixed parameter instances. WQO techniques are not generally practical, however, and we develop new methods to solve several important problems in VLSI that are not even amenable to WQO techniques.
We begin with a representative partitioning problem, Min Degree Graph Partition (MDGP), the fixed-parameter version of which is closed under the immersion order. \Ve show that the obstruction set ( set of immersion minimal elements) for this problem is computable; we prove both upper and lower bounds on the obstruction set size; and we completely characterize all fixed-parameter MDGP simple tree obstructions.
WQO theory tells us only that fixed-parameter MDGP is solvable in (high-degree) polynomial time. We attack the problem using what we refer to as kd-candidate subsets, culminating in linear-time decision and search algorithms. The kd-candidate subset method also paves the way for an efficient heuristic for the FPGA Minimization problem.
We then broaden our scope to incorporate delay minimization into FPGA partitioning. We develop, analyze and test a novel method called critical path compression, inspired in part by compiler optimization techniques. We then look at a variety of generalizations of MDGP. Some of these problems are not immersion closed; others are not even defined in a way that WQO theory applies. However, almost all of them are efficiently solvable via the kd-candidate subset approach.
Interspersed in these results are many refinements of what is known about the complexity of these problems. We also discuss a few other solution strategies, and present many open problems
Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis
Data movement is the dominating factor affecting performance and energy in
modern computing systems. Consequently, many algorithms have been developed to
minimize the number of I/O operations for common computing patterns. Matrix
multiplication is no exception, and lower bounds have been proven and
implemented both for shared and distributed memory systems. Reconfigurable
hardware platforms are a lucrative target for I/O minimizing algorithms, as
they offer full control of memory accesses to the programmer. While bounds
developed in the context of fixed architectures still apply to these platforms,
the spatially distributed nature of their computational and memory resources
requires a decentralized approach to optimize algorithms for maximum hardware
utilization. We present a model to optimize matrix multiplication for FPGA
platforms, simultaneously targeting maximum performance and minimum off-chip
data movement, within constraints set by the hardware. We map the model to a
concrete architecture using a high-level synthesis tool, maintaining a high
level of abstraction, allowing us to support arbitrary data types, and enables
maintainability and portability across FPGA devices. Kernels generated from our
architecture are shown to offer competitive performance in practice, scaling
with both compute and memory resources. We offer our design as an open source
project to encourage the open development of linear algebra and I/O minimizing
algorithms on reconfigurable hardware platforms
Multiobjective Simulation Optimization Using Enhanced Evolutionary Algorithm Approaches
In today\u27s competitive business environment, a firm\u27s ability to make the correct, critical decisions can be translated into a great competitive advantage. Most of these critical real-world decisions involve the optimization not only of multiple objectives simultaneously, but also conflicting objectives, where improving one objective may degrade the performance of one or more of the other objectives. Traditional approaches for solving multiobjective optimization problems typically try to scalarize the multiple objectives into a single objective. This transforms the original multiple optimization problem formulation into a single objective optimization problem with a single solution. However, the drawbacks to these traditional approaches have motivated researchers and practitioners to seek alternative techniques that yield a set of Pareto optimal solutions rather than only a single solution. The problem becomes much more complicated in stochastic environments when the objectives take on uncertain (or noisy ) values due to random influences within the system being optimized, which is the case in real-world environments. Moreover, in stochastic environments, a solution approach should be sufficiently robust and/or capable of handling the uncertainty of the objective values. This makes the development of effective solution techniques that generate Pareto optimal solutions within these problem environments even more challenging than in their deterministic counterparts. Furthermore, many real-world problems involve complicated, black-box objective functions making a large number of solution evaluations computationally- and/or financially-prohibitive. This is often the case when complex computer simulation models are used to repeatedly evaluate possible solutions in search of the best solution (or set of solutions). Therefore, multiobjective optimization approaches capable of rapidly finding a diverse set of Pareto optimal solutions would be greatly beneficial. This research proposes two new multiobjective evolutionary algorithms (MOEAs), called fast Pareto genetic algorithm (FPGA) and stochastic Pareto genetic algorithm (SPGA), for optimization problems with multiple deterministic objectives and stochastic objectives, respectively. New search operators are introduced and employed to enhance the algorithms\u27 performance in terms of converging fast to the true Pareto optimal frontier while maintaining a diverse set of nondominated solutions along the Pareto optimal front. New concepts of solution dominance are defined for better discrimination among competing solutions in stochastic environments. SPGA uses a solution ranking strategy based on these new concepts. Computational results for a suite of published test problems indicate that both FPGA and SPGA are promising approaches. The results show that both FPGA and SPGA outperform the improved nondominated sorting genetic algorithm (NSGA-II), widely-considered benchmark in the MOEA research community, in terms of fast convergence to the true Pareto optimal frontier and diversity among the solutions along the front. The results also show that FPGA and SPGA require far fewer solution evaluations than NSGA-II, which is crucial in computationally-expensive simulation modeling applications
Compressive Sensing Using Iterative Hard Thresholding with Low Precision Data Representation: Theory and Applications
Modern scientific instruments produce vast amounts of data, which can
overwhelm the processing ability of computer systems. Lossy compression of data
is an intriguing solution, but comes with its own drawbacks, such as potential
signal loss, and the need for careful optimization of the compression ratio. In
this work, we focus on a setting where this problem is especially acute:
compressive sensing frameworks for interferometry and medical imaging. We ask
the following question: can the precision of the data representation be lowered
for all inputs, with recovery guarantees and practical performance? Our first
contribution is a theoretical analysis of the normalized Iterative Hard
Thresholding (IHT) algorithm when all input data, meaning both the measurement
matrix and the observation vector are quantized aggressively. We present a
variant of low precision normalized {IHT} that, under mild conditions, can
still provide recovery guarantees. The second contribution is the application
of our quantization framework to radio astronomy and magnetic resonance
imaging. We show that lowering the precision of the data can significantly
accelerate image recovery. We evaluate our approach on telescope data and
samples of brain images using CPU and FPGA implementations achieving up to a 9x
speed-up with negligible loss of recovery quality.Comment: 19 pages, 5 figures, 1 table, in IEEE Transactions on Signal
Processin
- …