435 research outputs found

    C++ coding principles for high-level synthesis

    Get PDF
    Abstract. High-level synthesis (HLS) raises the level of abstraction on digital integrated circuit design from traditional register transfer level (RTL) to behavioural system description level. This methodology offers great advantages such as increased designer productivity. The adoption of HLS, however, has been slowed down by the RTL code mistakenly generated with HLS which potentially results in poor quality compared to the traditional hand-written RTL. This thesis aims to solve this problem by finding the best programming practices for hardware-oriented C++. A digital downconverter and decimator are designed and implemented with Catapult HLS as a case study, where different coding practises are experimented with, and the best ones are generalized and presented. The quality of results of this case study is compared against a hand-written RTL design of the same intellectual property created by other designers. A few examples are presented as well demonstrating that small changes in the source code might have a major effect on the generated RTL. It is found that understanding how the HLS tool analyses the source code and executes operations in parallel greatly helps to improve the quality of results in the generated hardware. Also, by having a clear target architecture it is a simple task to verify the hardware in Catapult analysis views such as schedule and schematic view. By optimizing the source code, it is possible to generate similar quality hardware compared to traditional RTL flow. In this case, the area of the HLS design is about 19 % smaller than the RTL design with the same throughput, slightly lower latency, and roughly the same power consumption. C++ ohjelmointikäytännöt korkean tason synteesiin. Tiivistelmä. Korkean tason synteesi (HLS) nostaa digitaalisten integroitujen piirien suunnittelun abstraktiotason perinteiseltä rekisterinsiirtotasolta (RTL) systeemikuvaustasolle. Tämä metodologia tuo suuria etuja, kuten suunnittelijan korkeampi tuotteliaisuus. HLS:n laajempaa käyttöönottoa on kuitenkin hidastanut erheellisesti HLS:llä generoitu RTL-koodi, josta usein seuraa heikohko laatu käsin kirjoitettuun RTL-koodiin verrattuna. Tämän tutkimuksen tavoite on ratkaista tämä ongelma löytämällä parhaat ohjelmointikäytännöt korkean tason synteesiin suunnattuun C++-ohjelmointiin. Digitaalinen alasmuunnin ja desimaattori suunnitellaan ja implementoidaan käyttäen Catapult HLS-työkalua. Eri ohjelmointikäytäntöjä testataan ja parhaat yleistetään ja esitellään, minkä jälkeen tulosten laatua verrataan samaan lohkoon, jonka on ohjelmoinut eri suunnittelijat rekisterinsiirtotasolla. Tutkimus sisältää myös koodiesimerkkejä siitä, miten pienet muutokset lähdekoodissa voivat vaikuttaa merkittävästi lopputulokseen. Tutkimuksessa todetaan, että synteesityökalun toiminnan ymmärtäminen on kriittistä hyvien tulosten saavuttamisen kannalta. Suunnittelijalla tulisi olla selvä tavoitearkkitehtuuri generoitavasta RTL-koodista, jolloin sen varmentaminen synteesin jälkeen olisi helppoa Catapultin analyysinäkymissä. Optimoimalla lähdekoodia generoidun RTL-koodin tulosten laatu saadaan samaksi kuin käsin kirjoitetun RTL-koodin. Tässä tapauksessa generoidun RTL-koodin pinta-ala on 19 % pienempi kuin käsin kirjoitetun mallin samalla siirtonopeudella. Latenssi on hieman pienempi ja tehonkulutus samaa suuruusluokkaa

    Design of testbed and emulation tools

    Get PDF
    The research summarized was concerned with the design of testbed and emulation tools suitable to assist in projecting, with reasonable accuracy, the expected performance of highly concurrent computing systems on large, complete applications. Such testbed and emulation tools are intended for the eventual use of those exploring new concurrent system architectures and organizations, either as users or as designers of such systems. While a range of alternatives was considered, a software based set of hierarchical tools was chosen to provide maximum flexibility, to ease in moving to new computers as technology improves and to take advantage of the inherent reliability and availability of commercially available computing systems

    Design of data buffers in field programmablr gate arrays

    Get PDF
    The design of the data buffers for the field programable gate array (FPGA) projects is considered. A new method of buffer design is proposed, which is based on the representation of the synchronous dataflow graph in the three-dimensional space, optimization of them, and description in VHDL. The method gives the optimized buffers which are based either on RAM or on the register pipeline. The derived pipeline buffer can be mapped into the shift register primitive of FPGA. The method is built in the experimental SDFCAD framework intended for the pipelined datapath synthesis

    Compiling for an Heterogeneous Vector Image Processor

    No full text
    International audienceWe present a new compilation strategy, implemented at a small cost, to optimize image applications developed on top of a high level image processing library for an heterogeneous processor with a vector image processing accelerator. The library provides the semantics of the image computations. The pipelined structure of the accelerator allows to compute whole expressions with dozens of elementary image instructions, but is constrained as intermediate image values cannot be extracted. We adapted standard compilation techniques to perform this task automatically. Our strategy is implemented in PIPS, a source-to-source compiler which greatly reduces the development cost as standard phases are reused and parameterized for the target. Experiments were run on the hardware functional simulator. We compile 1217 cases, from elementary tests to full applications. All are optimal but a few which are mostly within a mere accelerator call of optimality. Our contribu- tions include: 1) a general low cost compilation strategy for image processing applications, based on the semantics provided by library calls, which improves locality by an order of magnitude; 2) a specific heuristic to minimize execution time on the target vector accelerator; 3) numerous experiments that show the effectiveness of our strategy

    ACOTES project: Advanced compiler technologies for embedded streaming

    Get PDF
    Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

    AGAMOS: A graph-based approach to modulo scheduling for clustered microarchitectures

    Get PDF
    This paper presents AGAMOS, a technique to modulo schedule loops on clustered microarchitectures. The proposed scheme uses a multilevel graph partitioning strategy to distribute the workload among clusters and reduces the number of intercluster communications at the same time. Partitioning is guided by approximate schedules (i.e., pseudoschedules), which take into account all of the constraints that influence the final schedule. To further reduce the number of intercluster communications, heuristics for instruction replication are included. The proposed scheme is evaluated using the SPECfp95 programs. The described scheme outperforms a state-of-the-art scheduler for all programs and different cluster configurations. For some configurations, the speedup obtained when using this new scheme is greater than 40 percent, and for selected programs, performance can be more than doubled.Peer ReviewedPostprint (published version

    A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

    Get PDF
    Today’s compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate sub-problems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way. In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced (a) by addressing the aforementioned transformations together as one problem and not separately, (b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse). The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time
    corecore