1,718 research outputs found
Using ACL2 to Verify Loop Pipelining in Behavioral Synthesis
Behavioral synthesis involves compiling an Electronic System-Level (ESL)
design into its Register-Transfer Level (RTL) implementation. Loop pipelining
is one of the most critical and complex transformations employed in behavioral
synthesis. Certifying the loop pipelining algorithm is challenging because
there is a huge semantic gap between the input sequential design and the output
pipelined implementation making it infeasible to verify their equivalence with
automated sequential equivalence checking techniques. We discuss our ongoing
effort using ACL2 to certify loop pipelining transformation. The completion of
the proof is work in progress. However, some of the insights developed so far
may already be of value to the ACL2 community. In particular, we discuss the
key invariant we formalized, which is very different from that used in most
pipeline proofs. We discuss the needs for this invariant, its formalization in
ACL2, and our envisioned proof using the invariant. We also discuss some
trade-offs, challenges, and insights developed in course of the project.Comment: In Proceedings ACL2 2014, arXiv:1406.123
Survey on Combinatorial Register Allocation and Instruction Scheduling
Register allocation (mapping variables to processor registers or memory) and
instruction scheduling (reordering instructions to increase instruction-level
parallelism) are essential tasks for generating efficient assembly code in a
compiler. In the last three decades, combinatorial optimization has emerged as
an alternative to traditional, heuristic algorithms for these two tasks.
Combinatorial optimization approaches can deliver optimal solutions according
to a model, can precisely capture trade-offs between conflicting decisions, and
are more flexible at the expense of increased compilation time.
This paper provides an exhaustive literature review and a classification of
combinatorial optimization approaches to register allocation and instruction
scheduling, with a focus on the techniques that are most applied in this
context: integer programming, constraint programming, partitioned Boolean
quadratic programming, and enumeration. Researchers in compilers and
combinatorial optimization can benefit from identifying developments, trends,
and challenges in the area; compiler practitioners may discern opportunities
and grasp the potential benefit of applying combinatorial optimization
On-Chip Transparent Wire Pipelining (invited paper)
Wire pipelining has been proposed as a viable mean to break the discrepancy between decreasing gate delays and increasing wire delays in deep-submicron technologies. Far from being a straightforwardly applicable technique, this methodology requires a number of design modifications in order to insert it seamlessly in the current design flow. In this paper we briefly survey the methods presented by other researchers in the field and then we thoroughly analyze the solutions we recently proposed, ranging from system-level wire pipelining to physical design aspects
Modulo scheduling with integrated register spilling for clustered VLIW architectures
Clustering is a technique to decentralize the design of future wide issue VLIW cores and enable them to meet the technology constraints in terms of cycle time, area and power dissipation. In a clustered design, registers and functional units are grouped in clusters so that new instructions are needed to move data between them. New aggressive instruction scheduling techniques are required to minimize the negative effect of resource clustering and delays in moving data around. In this paper we present a novel software pipelining technique that performs instruction scheduling with reduced register requirements, register allocation, register spilling and inter-cluster communication in a single step. The algorithm uses limited backtracking to reconsider previously taken decisions. This backtracking provides the algorithm with additional possibilities for obtaining high throughput schedules with low spill code requirements for clustered architectures. We show that the proposed approach outperforms previously proposed techniques and that it is very scalable independently of the number of clusters, the number of communication buses and communication latency. The paper also includes an exploration of some parameters in the design of future clustered VLIW cores.Peer ReviewedPostprint (published version
Combining dynamic and static scheduling in high-level synthesis
Field Programmable Gate Arrays (FPGAs) are starting to become mainstream devices for custom computing, particularly deployed in data centres. However, using these FPGA devices requires familiarity with digital design at a low abstraction level. In order to enable software engineers without a hardware background to design custom hardware, high-level synthesis (HLS) tools automatically transform a high-level program, for example in C/C++, into a low-level hardware description.
A central task in HLS is scheduling: the allocation of operations to clock cycles. The classic approach to scheduling is static, in which each operation is mapped to a clock cycle at compile time, but recent years have seen the emergence of dynamic scheduling, in which an operationâs clock cycle is only determined at run-time. Both approaches have their merits: static scheduling can lead to simpler circuitry and more resource sharing, while dynamic scheduling can lead to faster hardware when the computation has a non-trivial control flow.
This thesis proposes a scheduling approach that combines the best of both worlds. My idea is to use existing program analysis techniques in software designs, such as probabilistic analysis and formal verification, to optimize the HLS hardware. First, this thesis proposes a tool named DASS that uses a heuristic-based approach to identify the code regions in the input program that are amenable to static scheduling and synthesises them into statically scheduled components, also known as static islands, leaving the top-level hardware dynamically scheduled. Second, this thesis addresses a problem of this approach: that the analysis of static islands and their dynamically scheduled surroundings are separate, where one treats the other as black boxes. We apply static analysis including dependence analysis between static islands and their dynamically scheduled surroundings to optimize the offsets of static islands for high performance. We also apply probabilistic analysis to estimate the performance of the dynamically scheduled part and use this information to optimize the static islands for high area efficiency. Finally, this thesis addresses the problem of conservatism in using sequential control flow designs which can limit the throughput of the hardware. We show this challenge can be solved by formally proving that certain control flows can be safely parallelised for high performance. This thesis demonstrates how to use automated formal verification to find out-of-order loop pipelining solutions and multi-threading solutions from a sequential program.Open Acces
Indexed Labels for Loop Iteration Dependent Costs
We present an extension to the labelling approach, a technique for lifting
resource consumption information from compiled to source code. This approach,
which is at the core of the annotating compiler from a large fragment of C to
8051 assembly of the CerCo project, looses preciseness when differences arise
as to the cost of the same portion of code, whether due to code transformation
such as loop optimisations or advanced architecture features (e.g. cache). We
propose to address this weakness by formally indexing cost labels with the
iterations of the containing loops they occur in. These indexes can be
transformed during the compilation, and when lifted back to source code they
produce dependent costs.
The proposed changes have been implemented in CerCo's untrusted prototype
compiler from a large fragment of C to 8051 assembly.Comment: In Proceedings QAPL 2013, arXiv:1306.241
Enhancing Predicate Pairing with Abstraction for Relational Verification
Relational verification is a technique that aims at proving properties that
relate two different program fragments, or two different program runs. It has
been shown that constrained Horn clauses (CHCs) can effectively be used for
relational verification by applying a CHC transformation, called predicate
pairing, which allows the CHC solver to infer relations among arguments of
different predicates. In this paper we study how the effects of the predicate
pairing transformation can be enhanced by using various abstract domains based
on linear arithmetic (i.e., the domain of convex polyhedra and some of its
subdomains) during the transformation. After presenting an algorithm for
predicate pairing with abstraction, we report on the experiments we have
performed on over a hundred relational verification problems by using various
abstract domains. The experiments have been performed by using the VeriMAP
transformation and verification system, together with the Parma Polyhedra
Library (PPL) and the Z3 solver for CHCs.Comment: Pre-proceedings paper presented at the 27th International Symposium
on Logic-Based Program Synthesis and Transformation (LOPSTR 2017), Namur,
Belgium, 10-12 October 2017 (arXiv:1708.07854
FeladatfĂŒggĆ felĂ©pĂtĂ©sƱ többprocesszoros cĂ©lrendszerek szintĂ©zis algoritmusainak kutatĂĄsa = Research of synthesis algorithms for special-purpose multiprocessing systems with task-dependent architecture
Ăj mĂłdszert Ă©s egy keretrendszert fejlesztettĂŒnk ki olyan speciĂĄlis többprocesszoros struktĂșra tervezĂ©sĂ©re, amely lehetĆvĂ© teszi a pipeline mƱködtetĂ©st akkor is, ha a feladat-leĂrĂĄsban nincs hatĂ©konyan kihasznĂĄlhatĂł pĂĄrhuzamossĂĄg. A szintĂ©zis egy magas szintƱ nyelven (C, Java, stb.) adott feladatleĂrĂĄsbĂłl indul ki. EzutĂĄn dekompozĂciĂłs algoritmus megfelelĆ szegmenseket kĂ©pez a program alapjĂĄn. A szegmensek kĂvĂĄnt szĂĄma, a szegmenseket megvalĂłsĂtĂł processzorok fĆbb tulajdonsĂĄgai Ă©s a becsĂŒlt kommunikĂĄciĂłs idĆigĂ©nyek megadhatĂłk bemeneti paramĂ©terekkĂ©nt. KedvezĆ pipeline felĂ©pĂtĂ©s cĂ©ljĂĄbĂłl a pipeline adatfolyamok magas szintƱ szintĂ©zisĂ©nek (HLS) mĂłdszertanĂĄt alkalmaztuk. Ezek az eszközök az ĂŒtemezĂ©s Ă©s az allokĂĄciĂł rĂ©vĂ©n kĂsĂ©rlik meg az optimalizĂĄlĂĄst a szegmensekbĆl kĂ©pzett adatfolyam grĂĄfon. EzĂ©rt a kiadĂłdĂł többprocesszoros felĂ©pĂtĂ©s nem egy uniformizĂĄlt processzor-rĂĄcs, hanem a megoldandĂł feladatra formĂĄlt struktĂșra, Ăgy feladatfĂŒggĆnek nevezhetĆ. A mĂłdszer modularitĂĄsa lehetĆvĂ© teszi a dekompozĂciĂłs algoritmusnak Ă©s a HLS eszköznek a cserĂ©jĂ©t, mĂłdosĂtĂĄsĂĄt az alkalmazĂĄsi igĂ©nyektĆl fĂŒggĆen. A mĂłdszer kiĂ©rtĂ©kelĂ©se cĂ©ljĂĄbĂłl olyan HLS eszközt alkalmaztunk, amely a kĂvĂĄnt pipeline ĂșjraindĂtĂĄsi periĂłdust bemeneti adatkĂ©nt tudja kezelni, Ă©s processzorok között egy optimalizĂĄlt idĆosztĂĄsos, arbitrĂĄciĂł-mentes sĂnrendszert hoz lĂ©tre. Ebben a struktĂșrĂĄban a kommunikĂĄciĂł szervezĂ©sĂ©hez nincs szĂŒksĂ©g kĂŒlön szoftver tĂĄmogatĂĄsra, ha a processzorok kĂ©pesek közvetlen adatforgalomra. | A new method and a framework tool has been developed for designing a special multiprocessing structure for making the pipeline function possible as a special parallel processing, even if there is no efficiently exploitable parallelism in the task description. The synthesis starts from a task description written in a high level language (C, Java, etc). A decomposing algorithm generates proper segments of this program. The desired number of the segments, the main properties of the processor set implementing the segments and the estimated communication time-demand can be given as input parameters. For constructing a pipeline structure, the high-level synthesis (HLS) methodology of pipelined datapaths is applied. These tools attempt to optimize by scheduling and allocating the dataflow graph generated from the segments Thus, the resulted structure is not a uniform processor grid, but it is shaped depending on the task, i.e. it can be called task-dependent. The modularity of the method permits the decomposition algorithm and the HLS tool to be replaced by other ones depending on the requirements of the application. For evaluating the method, a specific HLS tool is applied, which can accept the desired pipeline restart time as input parameter, and generates an optimized time shared simple arbitration-free bus system between the processing units. Therefore, there is no need for extra efforts to organize the communication, if the processing units can transfer data directly
Loop splitting for efficient pipelining in high-level synthesis
Loop pipelining is widely adopted as a key optimization method in high-level synthesis (HLS). However, when complex memory dependencies appear in a loop, commercial HLS tools are still not able to maximize pipeline performance. In this paper, we leverage parametric polyhedral analysis to reason about memory dependence patterns that are uncertain (i.e., parameterised by an undetermined variable) and/or nonuniform (i.e., varying between loop iterations). We develop an automated source-to-source code transformation to split the loop into pieces, which are then synthesised by Vivado HLS as the hardware generation back-end. Our technique allows generated loops to run with a minimal interval, automatically inserting statically-determined parametric pipeline breaks at those iterations violating dependencies. Our experiments on seven representative benchmarks show that, compared to default loop pipelining, our parametric loop splitting improves pipeline performance by 4:3 in terms of clock cycles per iteration. The optimized pipelines consume 2:0 as many LUTs, 1:8 as many registers, and 1:1 as many DSP blocks. Hence the area-time product is improved by nearly a factor of 2
- âŠ