54 research outputs found
Multi-processor system-level synthesis for multiple applications on platform fpga
ABSTRACT Multiprocessor systems-on-chip (MPSoC) are being developed in increasing numbers to support the high number of applications running on modern embedded systems. Designing and programming such systems prove to be a major challenge. Most of the current design methodologies rely on creating the design by hand, and are therefore error-prone and time-consuming. This also limits the number of design points that can be explored. While some efforts have been made to automate the flow and raise the abstraction level, these are still limited to single-application designs. In this paper, we present a design methodology to generate and program MPSoC designs in a systematic and automated way for multiple applications. The architecture is automatically inferred from the application specifications, and customized for it. The flow is ideal for fast design space exploration (DSE) in MPSoC systems. We present results of a case study to compute the buffer-throughput trade-offs in real-life applications, H263 and JPEG decoders. The generation of the entire project takes about 100ms, and the whole DSE was completed in 45 minutes, including the FPGA mapping and synthesis
Chapter 4 DATAFLOW ANALYSIS FOR REAL-TIME EMBEDDED MULTIPROCESSOR SYSTEM DESIGN
Keywords: Dataflow analysis techniques are key to reduce the number of design iterations and shorten the design time of real-time embedded network based multiprocessor systems that process data streams. With these analysis techniques the worstcase end-to-end temporal behavior of hard real-time applications can be derived from a dataflow model in which computation, communication and arbitration is modeled. For soft real-time applications these static dataflow analysis techniques are combined with simulation of the dataflow model to test statistical assertions about their temporal behavior. The simulation results in combination with properties of the dataflow model are used to derive the sensitivity of design parameters and to estimate parameters like the capacity of data buffers. real-time, dataflow analysis, multiprocessor system, predictable design, systemon-chip 1
Practical Instruction Set Design and Compiler Retargetability Using Static Resource Models
The design of application (-domain) specific instructionset processors (ASIPs), optimized for code size, has traditionally been accompanied by the necessity to program assembly, at least for the performance critical parts of the application. The highly encoded instruction sets simply lack the orthogonal structure present in e.g. VLIW processors, that allows efficient compilation. This lack of efficient compilation tools has also severely hampered the design space exploration of code-size efficient instruction sets, and correspondingly, their tuning to the application domain. In [13] a practical method is demonstrated to model a broad class of highly encoded instruction sets in terms of virtual resources easily interpreted by classic resource constrained schedulers (such as the popular list-scheduling algorithm), thereby allowing efficient compilation with well understood compilation tools. In this paper we will demonstrate the suitability of this model to also enable instruction set design (-space exploration) with a simple, well-understood and proven method long used in the High-Level Synthesis (HLS) of ASICs. A small case study proves the practical applicability of the method
Limited address range architecture for reducing code size in embedded processors
In embedded systems a processor core must be designed with low power consumption, low cost and small silicon area in mind since program code often resides in on-chip ROM. To obtain small code size, not only the amount of instruction-level parallelism can be restricted by instruction sets, but also the encoding cost can be reduced by restricting the access to register files. However, communication among register files has to be supported by hardware, e.g. buses and wires, and compilers. In this paper, we propose a new type of architecture by limiting the encoding range to a subset of registers in a register file on the one hand, and keeping the overlap among different ranges on the other hand in order to support communication between all the functional units. We also propose the annotated conflict graph approach for modeling the range constraints in this architecture, which can be applied in combination with any scheduler. However, to overcome the phase coupling between address range assignment and scheduling in code generation, in this paper the address range constraints are transformed and integrated with the existing timing, resource and register file constraints. Constraint analysis techniques [9] are adapted to prune the search spaces based on those constraints. Results show that we can reduce code size up to 24.58% by applying our technique.</p
Skeleton-based automatic parallelization of image processing algorithms for GPUs
Abstract—Graphics Processing Units (GPUs) are becoming increasingly important in high performance computing. To main-tain high quality solutions, programmers have to efficiently parallelize and map their algorithms. This task is far from trivial, leading to the necessity to automate this process. In this paper, we present a technique to automatically par-allelize and map sequential code on a GPU, without the need for code-annotations. This technique is based on skeletonization and is targeted at image processing algorithms. Skeletonization separates the structure of a parallel computation from the algo-rithm’s functionality, enabling efficient implementations without requiring architecture knowledge from the programmer. We define a number of skeleton classes, each enabling GPU specific parallelization techniques and optimizations, including automatic thread creation, on-chip memory usage and memory coalescing. Recently, similar skeletonization techniques have been applied to GPUs. Our work uses domain specific skeletons and a finer-grained classification of algorithms. Comparing skeleton-based parallelization to existing GPU code generators in general, we potentially achieve a higher hardware efficiency by enabling algorithm restructuring through skeletons. In a set of benchmarks, we show that the presented skeleton-based approach generates highly optimized code, achieving high data throughput. Additionally, we show that the automatically generated code performs close or equal to manually mapped and optimized code. We conclude that skeleton-based parallelization for GPUs is promising, but we do believe that future research must focus on the identification of a finer-grained and complete classification. I
- …