8 research outputs found

    Review of System Design Frameworks

    Get PDF
    In the last decade, the enormous development of the semiconductor industry with ever-increasing complexities of digital embedded systems and strong market competition with fast time-to-market and low design cost demands have imposed serious difficulty to a conventional design method. Therefore, there emerges a new design flow named model-based system design, which is based on high-level abstraction models, heavy design automation, and extensive component reuse to increase productivity and satisfy the market pressure. This thesis presents reviews of ten high level academic system design frameworks and tools that have been proposed and implemented recently to support the model based design flow, namely System-on-Chip Environment (SCE), Embedded System Environment (ESE), Metropolis, Daedalus, SystemCoDesigner (SCD), xPilot, GAUT, No-Instruction-Set Computer (NISC), Formal System Design (ForSyDe), and Ptolemy II. These tools are then compared to each other in various aspects comprising objective, technique, implementation and capability. Following that, three design flow frameworks, including ESE, Daedalus, and SystemCoDesigner, are experimented for their real usage, performance and practicality. The frameworks and tools implementing the model-based design flow all show promising results. Modelling tools (ForSyDe, and Ptolemy II) can sufficiently capture a wide range of complicated modern systems, while high-level synthesis tools (xPilot, GAUT, and NISC) produce better design qualities in terms of area, power, and cost in comparison to traditional works. Study cases of design flow frameworks (SCE, ESE, Metropolis, Daedalus, and SCD) show the model-based method significantly reduces developing time as well as facilitates the system design process. However, most of these tools and frameworks are being incomplete, and still under the experimental stage. There still be a lot of works needed until the method can be put into practice

    High-level automation of custom hardware design for high-performance computing

    Get PDF
    This dissertation focuses on efficient generation of custom processors from high-level language descriptions. Our work exploits compiler-based optimizations and transformations in tandem with high-level synthesis (HLS) to build high-performance custom processors. The goal is to offer a common multiplatform high-abstraction programming interface for heterogeneous compute systems where the benefits of custom reconfigurable (or fixed) processors can be exploited by the application developers. The research presented in this dissertation supports the following thesis: In an increasingly heterogeneous compute environment it is important to leverage the compute capabilities of each heterogeneous processor efficiently. In the case of FPGA and ASIC accelerators this can be achieved through HLS-based flows that (i) extract parallelism at coarser than basic block granularities, (ii) leverage common high-level parallel programming languages, and (iii) employ high-level source-to-source transformations to generate high-throughput custom processors. First, we propose a novel HLS flow that extracts instruction level parallelism beyond the boundary of basic blocks from C code. Subsequently, we describe FCUDA, an HLS-based framework for mapping fine-grained and coarse-grained parallelism from parallel CUDA kernels onto spatial parallelism. FCUDA provides a common programming model for acceleration on heterogeneous devices (i.e. GPUs and FPGAs). Moreover, the FCUDA framework balances multilevel granularity parallelism synthesis using efficient techniques that leverage fast and accurate estimation models (i.e. do not rely on lengthy physical implementation tools). Finally, we describe an advanced source-to-source transformation framework for throughput-driven parallelism synthesis (TDPS), which appropriately restructures CUDA kernel code to maximize throughput on FPGA devices. We have integrated the TDPS framework into the FCUDA flow to enable automatic performance porting of CUDA kernels designed for the GPU architecture onto the FPGA architecture

    Performance and area evaluations of processor-based benchmarks on FPGA devices

    Get PDF
    The computing system on SoCs is being long-term research since the FPGA technology has emerged due to its personality of re-programmable fabric, reconfigurable computing, and fast development time to market. During the last decade, uni-processor in a SoC is no longer to deal with the high growing market for complex applications such as Mobile Phones audio and video encoding, image and network processing. Due to the number of transistors on a silicon wafer is increasing, the recent FPGAs or embedded systems are advancing toward multi-processor-based design to meet tremendous performance and benefit this kind of systems are possible. Therefore, is an upcoming age of the MPSoC. In addition, most of the embedded processors are soft-cores, because they are flexible and reconfigurable for specific software functions and easy to build homogenous multi-processor systems for parallel programming. Moreover, behavioural synthesis tools are becoming a lot more powerful and enable to create datapath of logic units from high-level algorithms such as C to HDL and available for partitioning a HW/SW concurrent methodology. A range of embedded processors is able to implement on a FPGA-based prototyping to integrate the CPUs on a programmable device. This research is, firstly represent different types of computer architectures in modern embedded processors that are followed in different type of software applications (eg. Multi-threading Operations or Complex Functions) on FPGA-based SoCs; and secondly investigate their capability by executing a wide-range of multimedia software codes (Integer-algometric only) in different models of the processor-systems (uni-processor or multi-processor or Co-design), and finally compare those results in terms of the benchmarks and resource utilizations within FPGAs. All the examined programs were written in standard C and executed in a variety numbers of soft-core processors or hardware units to obtain the execution times. However, the number of processors and their customizable configuration or hardware datapath being generated are limited by a target FPGA resource, and designers need to understand the FPGA-based tradeoffs that have been considered - Speed versus Area. For this experimental purpose, I defined benchmarks into DLP / HLS catalogues, which are "data" and "function" intensive respectively. The programs of DLP will be executed in LEON3 MP and LE1 CMP multi-processor systems and the programs of HLS in the LegUp Co-design system on target FPGAs. In preliminary, the performance of the soft-core processors will be examined by executing all the benchmarks. The whole story of this thesis work centres on the issue of the execute times or the speed-up and area breakdown on FPGA devices in terms of different programs

    Automated generation of custom processor core from C code

    Get PDF
    We present a method for construction of application-specific processor cores from a given C code. Our approach consists of three phases. We start by quantifying the properties of the C code in terms of operation types, available parallelism, and other metrics. We then create an initial data path to exploit the available parallelism. We then apply designer-guided constraints to an interactive data path refinement algorithm that attempts to reduce the number of the most expensive components while meeting the constraints. Our experimental results show that our technique scales very well with the size of the C code. We demonstrate the efficiency of our technique on wide range of applications, from standard academic benchmarks to industrial size examples like the MP3 decoder. Each processor core was constructed and refined in under a minute, allowing the designer to explore several different configurations in much less time than needed for manual design. We compared our selection algorithm to the manual selection in terms of cost/performance and showed that our optimization technique achieves better cost/performance trade-off. We also synthesized our designs with programmable controller and, on average, the refined core have only 23% latency overhead, twice as many block RAMs and 36% fewer slices compared to the respective manual designs

    MURAC: A unified machine model for heterogeneous computers

    Get PDF
    Includes bibliographical referencesHeterogeneous computing enables the performance and energy advantages of multiple distinct processing architectures to be efficiently exploited within a single machine. These systems are capable of delivering large performance increases by matching the applications to architectures that are most suited to them. The Multiple Runtime-reconfigurable Architecture Computer (MURAC) model has been proposed to tackle the problems commonly found in the design and usage of these machines. This model presents a system-level approach that creates a clear separation of concerns between the system implementer and the application developer. The three key concepts that make up the MURAC model are a unified machine model, a unified instruction stream and a unified memory space. A simple programming model built upon these abstractions provides a consistent interface for interacting with the underlying machine to the user application. This programming model simplifies application partitioning between hardware and software and allows the easy integration of different execution models within the single control ow of a mixed-architecture application. The theoretical and practical trade-offs of the proposed model have been explored through the design of several systems. An instruction-accurate system simulator has been developed that supports the simulated execution of mixed-architecture applications. An embedded System-on-Chip implementation has been used to measure the overhead in hardware resources required to support the model, which was found to be minimal. An implementation of the model within an operating system on a tightly-coupled reconfigurable processor platform has been created. This implementation is used to extend the software scheduler to allow for the full support of mixed-architecture applications in a multitasking environment. Different scheduling strategies have been tested using this scheduler for mixed-architecture applications. The design and implementation of these systems has shown that a unified abstraction model for heterogeneous computers provides important usability benefits to system and application designers. These benefits are achieved through a consistent view of the multiple different architectures to the operating system and user applications. This allows them to focus on achieving their performance and efficiency goals by gaining the benefits of different execution models during runtime without the complex implementation details of the system-level synchronisation and coordination

    The Kiel Esterel processor: a multi-threaded reactive processor

    Get PDF
    Many embedded systems belong to the class of reactive systems, which continuously react to inputs from the environment by generating corresponding outputs. The programming of reactive systems typically requires the use of non-standard control flow constructs, such as concurrency or exception handling. Most programming languages do not support these constructs at all, or their use induces non-deterministic program behavior. To address these difficulties, the synchronous language Esterel has been developed to express reactive control flow patterns in a concise manner, with a clear semantics that imposes deterministic program behavior under all circumstances. There are different options to synthesize an Esterel program into a concrete system, e.g., software, hardware, and HW/SW co-design implementations. However, these classical synthesis approaches suffer from the limitations of traditional processors, with their instruction set architectures geared towards the sequential von-Neumann execution model, or they are very inflexible if HW synthesis is involved. Recently, another alternative for synthesizing Esterel has emerged, the reactive processing approach. Here the Esterel program is running on a processor that has been developed specifically for reactive systems. However, the main challenge when designing a reactive architecture is the handling of control. This thesis presents the Kiel Esterel Processor (KEP). In the KEP, the multi-threaded reactive architecture is responsible for managing the control flow of all threads. The KEP Instruction Set Architecture is complete in that it allows a direct mapping of all Esterel statements onto KEP assembler. It supports Esterel’s concurrency operator || in a very precise, direct and efficient way. It also supports full Esterel preemptions, i.e., the delayed and immediate strong/weak abortion and suspension. All other Esterel kernel statements, e.g., the Esterel exception, delay, and signal emission, etc., are also implemented directly and semantically accurate by the KEP. As the experimental comparison with a 32-bit commercial RISC processor indicates, the KEP has advantages in terms of memory use, execution speed, and energy consumption. Another advantage is the predictability of its timing behavior at the program level
    corecore