866 research outputs found
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Tensor Computation: A New Framework for High-Dimensional Problems in EDA
Many critical EDA problems suffer from the curse of dimensionality, i.e. the
very fast-scaling computational burden produced by large number of parameters
and/or unknown variables. This phenomenon may be caused by multiple spatial or
temporal factors (e.g. 3-D field solvers discretizations and multi-rate circuit
simulation), nonlinearity of devices and circuits, large number of design or
optimization parameters (e.g. full-chip routing/placement and circuit sizing),
or extensive process variations (e.g. variability/reliability analysis and
design for manufacturability). The computational challenges generated by such
high dimensional problems are generally hard to handle efficiently with
traditional EDA core algorithms that are based on matrix and vector
computation. This paper presents "tensor computation" as an alternative general
framework for the development of efficient EDA algorithms and tools. A tensor
is a high-dimensional generalization of a matrix and a vector, and is a natural
choice for both storing and solving efficiently high-dimensional EDA problems.
This paper gives a basic tutorial on tensors, demonstrates some recent examples
of EDA applications (e.g., nonlinear circuit modeling and high-dimensional
uncertainty quantification), and suggests further open EDA problems where the
use of tensor computation could be of advantage.Comment: 14 figures. Accepted by IEEE Trans. CAD of Integrated Circuits and
System
Tools and Algorithms for the Construction and Analysis of Systems
This open access two-volume set constitutes the proceedings of the 27th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2021, which was held during March 27 – April 1, 2021, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2021. The conference was planned to take place in Luxembourg and changed to an online format due to the COVID-19 pandemic. The total of 41 full papers presented in the proceedings was carefully reviewed and selected from 141 submissions. The volume also contains 7 tool papers; 6 Tool Demo papers, 9 SV-Comp Competition Papers. The papers are organized in topical sections as follows: Part I: Game Theory; SMT Verification; Probabilities; Timed Systems; Neural Networks; Analysis of Network Communication. Part II: Verification Techniques (not SMT); Case Studies; Proof Generation/Validation; Tool Papers; Tool Demo Papers; SV-Comp Tool Competition Papers
High performance graph analysis on parallel architectures
PhD ThesisOver the last decade pharmacology has been developing computational
methods to enhance drug development and testing. A computational
method called network pharmacology uses graph analysis
tools to determine protein target sets that can lead on better targeted
drugs for diseases as Cancer. One promising area of network-based
pharmacology is the detection of protein groups that can produce
better e ects if they are targeted together by drugs. However, the
e cient prediction of such protein combinations is still a bottleneck
in the area of computational biology.
The computational burden of the algorithms used by such protein
prediction strategies to characterise the importance of such proteins
consists an additional challenge for the eld of network pharmacology.
Such computationally expensive graph algorithms as the all pairs
shortest path (APSP) computation can a ect the overall drug discovery
process as needed network analysis results cannot be given on
time. An ideal solution for these highly intensive computations could
be the use of super-computing. However, graph algorithms have datadriven
computation dictated by the structure of the graph and this
can lead to low compute capacity utilisation with execution times
dominated by memory latency.
Therefore, this thesis seeks optimised solutions for the real-world
graph problems of critical node detection and e ectiveness characterisation
emerged from the collaboration with a pioneer company in the
eld of network pharmacology as part of a Knowledge Transfer Partnership
(KTP) / Secondment (KTS). In particular, we examine how
genetic algorithms could bene t the prediction of protein complexes
where their removal could produce a more e ective 'druggable' impact.
Furthermore, we investigate how the problem of all pairs shortest
path (APSP) computation can be bene ted by the use of emerging
parallel hardware architectures as GPU- and FPGA- desktop-based
accelerators.
In particular, we address the problem of critical node detection with
the development of a heuristic search method. It is based on a genetic
algorithm that computes optimised node combinations where their removal
causes greater impact than common impact analysis strategies.
Furthermore, we design a general pattern for parallel network analysis
on multi-core architectures that considers graph's embedded properties.
It is a divide and conquer approach that decomposes a graph
into smaller subgraphs based on its strongly connected components
and computes the all pairs shortest paths concurrently on GPU. Furthermore,
we use linear algebra to design an APSP approach based
on the BFS algorithm. We use algebraic expressions to transform the
problem of path computation to multiple independent matrix-vector
multiplications that are executed concurrently on FPGA. Finally, we
analyse how the optimised solutions of perturbation analysis and parallel
graph processing provided in this thesis will impact the drug
discovery process.This research was part of a Knowledge Transfer Partnership (KTP)
and Knowledge Transfer Secondment (KTS) between e-therapeutics
PLC and Newcastle University. It was supported as a collaborative
project by e-therapeutics PLC and Technology Strategy boar
Advances in Functional Decomposition: Theory and Applications
Functional decomposition aims at finding efficient representations for Boolean functions. It is used in many applications, including multi-level logic synthesis, formal verification, and testing.
This dissertation presents novel heuristic algorithms for functional decomposition. These algorithms take advantage of suitable representations of the Boolean functions in order to be efficient.
The first two algorithms compute simple-disjoint and disjoint-support decompositions. They are based on representing the target function by a Reduced Ordered Binary Decision Diagram (BDD). Unlike other BDD-based algorithms, the presented ones can deal with larger target functions and produce more decompositions without requiring expensive manipulations of the representation, particularly BDD reordering.
The third algorithm also finds disjoint-support decompositions, but it is based on a technique which integrates circuit graph analysis and BDD-based decomposition. The combination of the two approaches results in an algorithm which is more robust than a purely BDD-based one, and that improves both the quality of the results and the running time.
The fourth algorithm uses circuit graph analysis to obtain non-disjoint decompositions. We show that the problem of computing non-disjoint decompositions can be reduced to the problem of computing multiple-vertex dominators. We also prove that multiple-vertex dominators can be found in polynomial time. This result is important because there is no known polynomial time algorithm for computing all non-disjoint decompositions of a Boolean function.
The fifth algorithm provides an efficient means to decompose a function at the circuit graph level, by using information derived from a BDD representation. This is done without the expensive circuit re-synthesis normally associated with BDD-based decomposition approaches.
Finally we present two publications that resulted from the many detours we have taken along the winding path of our research
Logic Synthesis for Established and Emerging Computing
Logic synthesis is an enabling technology to realize integrated computing systems, and it entails solving computationally intractable problems through a plurality of heuristic techniques. A recent push toward further formalization of synthesis problems has shown to be very useful toward both attempting to solve some logic problems exactly--which is computationally possible for instances of limited size today--as well as creating new and more powerful heuristics based on problem decomposition. Moreover, technological advances including nanodevices, optical computing, and quantum and quantum cellular computing require new and specific synthesis flows to assess feasibility and scalability. This review highlights recent progress in logic synthesis and optimization, describing models, data structures, and algorithms, with specific emphasis on both design quality and emerging technologies. Example applications and results of novel techniques to established and emerging technologies are reported
Recommended from our members
Hybrid Analog-Digital Co-Processing for Scientific Computation
In the past 10 years computer architecture research has moved to more heterogeneity and less adherence to conventional abstractions. Scientists and engineers hold an unshakable belief that computing holds keys to unlocking humanity's Grand Challenges. Acting on that belief they have looked deeper into computer architecture to find specialized support for their applications. Likewise, computer architects have looked deeper into circuits and devices in search of untapped performance and efficiency. The lines between computer architecture layers---applications, algorithms, architectures, microarchitectures, circuits and devices---have blurred. Against this backdrop, a menagerie of computer architectures are on the horizon, ones that forgo basic assumptions about computer hardware, and require new thinking of how such hardware supports problems and algorithms.
This thesis is about revisiting hybrid analog-digital computing in support of diverse modern workloads. Hybrid computing had extensive applications in early computing history, and has been revisited for small-scale applications in embedded systems. But architectural support for using hybrid computing in modern workloads, at scale and with high accuracy solutions, has been lacking.
I demonstrate solving a variety of scientific computing problems, including stochastic ODEs, partial differential equations, linear algebra, and nonlinear systems of equations, as case studies in hybrid computing. I solve these problems on a system of multiple prototype analog accelerator chips built by a team at Columbia University. On that team I made contributions toward programming the chips, building the digital interface, and validating the chips' functionality. The analog accelerator chip is intended for use in conjunction with a conventional digital host computer.
The appeal and motivation for using an analog accelerator is efficiency and performance, but it comes with limitations in accuracy and problem sizes that we have to work around.
The first problem is how to do problems in this unconventional computation model. Scientific computing phrases problems as differential equations and algebraic equations. Differential equations are a continuous view of the world, while algebraic equations are a discrete one. Prior work in analog computing mostly focused on differential equations; algebraic equations played a minor role in prior work in analog computing. The secret to using the analog accelerator to support modern workloads on conventional computers is that these two viewpoints are interchangeable. The algebraic equations that underlie most workloads can be solved as differential equations,
and differential equations are naturally solvable in the analog accelerator chip. A hybrid analog-digital computer architecture can focus on solving linear and nonlinear algebra problems to support many workloads.
The second problem is how to get accurate solutions using hybrid analog-digital computing. The reason that the analog computation model gives less accurate solutions is it gives up representing numbers as digital binary numbers, and instead uses the full range of analog voltage and current to represent real numbers. Prior work has established that encoding data in analog signals gives an energy efficiency advantage as long as the analog data precision is limited. While the analog accelerator alone may be useful for energy-constrained applications where inputs and outputs are imprecise, we are more interested in using analog in conjunction with digital for precise solutions. This thesis gives novel insight that the trick to do so is to solve nonlinear problems where low-precision guesses are useful for conventional digital algorithms.
The third problem is how to solve large problems using hybrid analog-digital computing. The reason the analog computation model can't handle large problems is it gives up step-by-step discrete-time operation, instead allowing variables to evolve smoothly in continuous time. To make that happen the analog accelerator works by chaining hardware for mathematical operations end-to-end. During computation analog data flows through the hardware with no overheads in control logic and memory accesses. The downside is then the needed hardware size grows alongside problem sizes. While scientific computing researchers have for a long time split large problems into smaller subproblems to fit in digital computer constraints, this thesis is a first attempt to consider these divide-and-conquer algorithms as an essential tool in using the analog model of computation.
As we enter the post-Moore’s law era of computing, unconventional architectures will offer specialized models of computation that uniquely support specific problem types. Two prominent examples are deep neural networks and quantum computers. Recent trends in computer science research show these unconventional architectures will soon have broad adoption. In this thesis I show another specialized, unconventional architecture is to use analog accelerators to solve problems in scientific computing. Computer architecture researchers will discover other important models of computation in the future. This thesis is an example of the discovery process, implementation, and evaluation of how an unconventional architecture supports specialized workloads
Extracting Data-Level Parallelism in High-Level Synthesis for Reconfigurable Architectures
High-Level Synthesis (HLS) tools are a set of algorithms that allow programmers to obtain implementable Hardware Description Language (HDL) code from specifications written high-level, sequential languages such as C, C++, or Java. HLS has allowed programmers to code in their preferred language while still obtaining all the benefits hardware acceleration has to offer without them needing to be intimately familiar with the hardware platform of the accelerator. In this work we summarize and expand upon several of our approaches to improve the automatic memory banking capabilities of HLS tools targeting reconfigurable architectures, namely Field-Programmable Gate Arrays or FPGA\u27s. We explored several approaches to automatically find the optimal partition factor and a usable banking scheme for stencil kernels including a tessellation based approach using multiple families of hyperplanes to do the partitioning which was able to find a better banking factor than current state-of-the-art methods and a graph theory methodology that allowed us to mathematically prove the optimality of our banking solutions. For non-stencil kernels we relaxed some of the conditions in our graph-based model to propose a best-effort solution to arbitrarily reduce memory access conflicts (simultaneous accesses to the same memory bank). We also proposed a non-linear transformation using prime factorization to convert a small subset of non-stencil kernels into stencil memory accesses, allowing us to use all previous work in memory partition to them. Our approaches were able to obtain better results than commercial tools and state-of-the-art algorithms in terms of reduced resource utilization and increased frequency of operation. We were also able to obtain better partition factors for some stencil kernels and usable baking schemes for non-stencil kernels with better performance than any applicable existing algorithm
Tools and Algorithms for the Construction and Analysis of Systems
This open access book constitutes the proceedings of the 28th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2022, which was held during April 2-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 46 full papers and 4 short papers presented in this volume were carefully reviewed and selected from 159 submissions. The proceedings also contain 16 tool papers of the affiliated competition SV-Comp and 1 paper consisting of the competition report. TACAS is a forum for researchers, developers, and users interested in rigorously based tools and algorithms for the construction and analysis of systems. The conference aims to bridge the gaps between different communities with this common interest and to support them in their quest to improve the utility, reliability, exibility, and efficiency of tools and algorithms for building computer-controlled systems
- …