1,567 research outputs found

    Heterogeneous processor pipeline for a product cipher application

    Full text link
    Processing data received as a stream is a task commonly performed by modern embedded devices, in a wide range of applications such as multimedia (encoding/decoding/ playing media), networking (switching and routing), digital security, scientific data processing, etc. Such processing normally tends to be calculation intensive and therefore requiring significant processing power. Therefore, hardware acceleration methods to increase the performance of such applications constitute an important area of study. In this paper, we present an evaluation of one such method to process streaming data, namely multi-processor pipeline architecture. The hardware is based on a Multiple-Processor System on Chip (MPSoC), using a data encryption algorithm as a case study. The algorithm is partitioned on a coarse grained level and mapped on to an MPSoC with five processor cores in a pipeline, using specifically configured Xtensa LX3 cores. The system is then selectively optimized by strengthening and pruning the resources of each processor core. The optimized system is evaluated and compared against an optimal single-processor System on Chip (SoC) for the same application. The multiple-processor pipeline system for data encryption algorithms used was observed to provide significant speed ups, up to 4.45 times that of the single-processor system, which is close to the ideal speed up from a five-stage pipeline

    FPGA Implementation of Data Flow Graphs for Digital Signal Processing Applications

    Get PDF
    A rapid growth in digital signal processing applications has increased the requirement for high-speed digital systems. Multiprocessor systems are the best choice for these applications. A prior sequence of operations should be applied to the operations that described the nature of these applications before hardware implementation is produced. These operations should be scheduled and hardware allocated. This paper proposes a new scheduling technique for digital signal processing (DSP) applications has been represented by data flow graphs (DFGs). In addition, hardware allocation is implemented in the form of embedded system. A proposed scheduling technique also achieves the optimal scheduling of a DFG at design time. The optimality criteria considered in this algorithm are the maximum throughput within the available hardware resources. The maximum throughput is achieved by arranging the DFG nodes according to their inter-related data dependencies. Then, two nodes can be clustered into one compound task to reduce the overall execution time by minimizing the number of tasks to be executed that minimizing the number of cycles to execute them. Then each task is presented in form of instruction to be executed in the hardware system. A hardware system is composed of one or multiple homogenous pipelined processing elements and it is designed to meet the maximum-rate schedule.  Two implementations are proposed of the system architecture according to the number of the processing elements, namely:  the serial system and the parallel system. The serial system comprises one processing element where all tasks are processed sequentially, whilst the parallel system has four processing elements to execute tasks concurrently. These systems consist mainly of seven units: central shared memory, state table, multiway function unit buffer, execution array, processing element/s, instruction buffer and the address generation unit. The hardware components were built on an FPGA chip using Verilog HDL. In synthesis results, the parallel system has better system performance by 25.5% than the serial system. While the serial system requires smaller area size, which described by the number of slice registers and the number of the slice lookup tables (LUTs) than the parallel one. The relationship between the number of instructions that are executed in both systems, and the system area and the system performance that presented by system frequency, are studied. By increasing memories size in both systems, the system performance isn’t affected as in a serial system, and it is slightly decreased as the parallel system by 1.5% to 4.5%. In terms of the systems area, both serial system area and parallel system area are increased and in some cases are doubled. The proposed scheduling technique is shown to outperform the retaining technique, which we have chosen to compare with.  The serial system has better performance by 19.3% higher system frequency than a retiming technique. And the parallel system also outperforms the retaining technique by 51.2% higher system frequency in synthesis results

    Designing a scalable dynamic load -balancing algorithm for pipelined single program multiple data applications on a non-dedicated heterogeneous network of workstations

    Get PDF
    Dynamic load balancing strategies have been shown to be the most critical part of an efficient implementation of various applications on large distributed computing systems. The need for dynamic load balancing strategies increases when the underlying hardware is a non-dedicated heterogeneous network of workstations (HNOW). This research focuses on the single program multiple data (SPMD) programming model as it has been extensively used in parallel programming for its simplicity and scalability in terms of computational power and memory size.;This dissertation formally defines and addresses the problem of designing a scalable dynamic load-balancing algorithm for pipelined SPMD applications on non-dedicated HNOW. During this process, the HNOW parameters, SPMD application characteristics, and load-balancing performance parameters are identified.;The dissertation presents a taxonomy that categorizes general load balancing algorithms and a methodology that facilitates creating new algorithms that can harness the HNOW computing power and still preserve the scalability of the SPMD application.;The dissertation devises a new algorithm, DLAH (Dynamic Load-balancing Algorithm for HNOW). DLAH is based on a modified diffusion technique, which incorporates the HNOW parameters. Analytical performance bound for the worst-case scenario of the diffusion technique has been derived.;The dissertation develops and utilizes an HNOW simulation model to conduct extensive simulations. These simulations were used to validate DLAH and compare its performance to related dynamic algorithms. The simulations results show that DLAH algorithm is scalable and performs well for both homogeneous and heterogeneous networks. Detailed sensitivity analysis was conducted to study the effects of key parameters on performance

    Modeling and Mapping of Optimized Schedules for Embedded Signal Processing Systems

    Get PDF
    The demand for Digital Signal Processing (DSP) in embedded systems has been increasing rapidly due to the proliferation of multimedia- and communication-intensive devices such as pervasive tablets and smart phones. Efficient implementation of embedded DSP systems requires integration of diverse hardware and software components, as well as dynamic workload distribution across heterogeneous computational resources. The former implies increased complexity of application modeling and analysis, but also brings enhanced potential for achieving improved energy consumption, cost or performance. The latter results from the increased use of dynamic behavior in embedded DSP applications. Furthermore, parallel programming is highly relevant in many embedded DSP areas due to the development and use of Multiprocessor System-On-Chip (MPSoC) technology. The need for efficient cooperation among different devices supporting diverse parallel embedded computations motivates high-level modeling that expresses dynamic signal processing behaviors and supports efficient task scheduling and hardware mapping. Starting with dynamic modeling, this thesis develops a systematic design methodology that supports functional simulation and hardware mapping of dynamic reconfiguration based on Parameterized Synchronous Dataflow (PSDF) graphs. By building on the DIF (Dataflow Interchange Format), which is a design language and associated software package for developing and experimenting with dataflow-based design techniques for signal processing systems, we have developed a novel tool for functional simulation of PSDF specifications. This simulation tool allows designers to model applications in PSDF and simulate their functionality, including use of the dynamic parameter reconfiguration capabilities offered by PSDF. With the help of this simulation tool, our design methodology helps to map PSDF specifications into efficient implementations on field programmable gate arrays (FPGAs). Furthermore, valid schedules can be derived from the PSDF models at runtime to adapt hardware configurations based on changing data characteristics or operational requirements. Under certain conditions, efficient quasi-static schedules can be applied to reduce overhead and enhance predictability in the scheduling process. Motivated by the fact that scheduling is critical to performance and to efficient use of dynamic reconfiguration, we have focused on a methodology for schedule design, which complements the emphasis on automated schedule construction in the existing literature on dataflow-based design and implementation. In particular, we have proposed a dataflow-based schedule design framework called the dataflow schedule graph (DSG), which provides a graphical framework for schedule construction based on dataflow semantics, and can also be used as an intermediate representation target for automated schedule generation. Our approach to applying the DSG in this thesis emphasizes schedule construction as a design process rather than an outcome of the synthesis process. Our approach employs dataflow graphs for representing both application models and schedules that are derived from them. By providing a dataflow-integrated framework for unambiguously representing, analyzing, manipulating, and interchanging schedules, the DSG facilitates effective codesign of dataflow-based application models and schedules for execution of these models. As multicore processors are deployed in an increasing variety of embedded image processing systems, effective utilization of resources such as multiprocessor systemon-chip (MPSoC) devices, and effective handling of implementation concerns such as memory management and I/O become critical to developing efficient embedded implementations. However, the diversity and complexity of applications and architectures in embedded image processing systems make the mapping of applications onto MPSoCs difficult. We help to address this challenge through a structured design methodology that is built upon the DSG modeling framework. We refer to this methodology as the DEIPS methodology (DSG-based design and implementation of Embedded Image Processing Systems). The DEIPS methodology provides a unified framework for joint consideration of DSG structures and the application graphs from which they are derived, which allows designers to integrate considerations of parallelization and resource constraints together with the application modeling process. We demonstrate the DEIPS methodology through cases studies on practical embedded image processing systems

    Efficient Neural Network Implementations on Parallel Embedded Platforms Applied to Real-Time Torque-Vectoring Optimization Using Predictions for Multi-Motor Electric Vehicles

    Get PDF
    The combination of machine learning and heterogeneous embedded platforms enables new potential for developing sophisticated control concepts which are applicable to the field of vehicle dynamics and ADAS. This interdisciplinary work provides enabler solutions -ultimately implementing fast predictions using neural networks (NNs) on field programmable gate arrays (FPGAs) and graphical processing units (GPUs)- while applying them to a challenging application: Torque Vectoring on a multi-electric-motor vehicle for enhanced vehicle dynamics. The foundation motivating this work is provided by discussing multiple domains of the technological context as well as the constraints related to the automotive field, which contrast with the attractiveness of exploiting the capabilities of new embedded platforms to apply advanced control algorithms for complex control problems. In this particular case we target enhanced vehicle dynamics on a multi-motor electric vehicle benefiting from the greater degrees of freedom and controllability offered by such powertrains. Considering the constraints of the application and the implications of the selected multivariable optimization challenge, we propose a NN to provide batch predictions for real-time optimization. This leads to the major contribution of this work: efficient NN implementations on two intrinsically parallel embedded platforms, a GPU and a FPGA, following an analysis of theoretical and practical implications of their different operating paradigms, in order to efficiently harness their computing potential while gaining insight into their peculiarities. The achieved results exceed the expectations and additionally provide a representative illustration of the strengths and weaknesses of each kind of platform. Consequently, having shown the applicability of the proposed solutions, this work contributes valuable enablers also for further developments following similar fundamental principles.Some of the results presented in this work are related to activities within the 3Ccar project, which has received funding from ECSEL Joint Undertaking under grant agreement No. 662192. This Joint Undertaking received support from the European Union’s Horizon 2020 research and innovation programme and Germany, Austria, Czech Republic, Romania, Belgium, United Kingdom, France, Netherlands, Latvia, Finland, Spain, Italy, Lithuania. This work was also partly supported by the project ENABLES3, which received funding from ECSEL Joint Undertaking under grant agreement No. 692455-2

    An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor

    Get PDF
    Modern systems-on-chip augment their baseline CPU with coprocessors and accelerators to increase overall computational capability and power efficiency, and thus have evolved into heterogeneous multi-core systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This paper discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a highly configurable VLIW Chip Multiprocessor architecture known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on a number of hardware configurations of the LE1 CMP. The presented OpenCL framework fully automates the compilation flow and supports work-item coalescing which better maps onto the ILP processor cores of the LE1 architecture. This paper discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework by running 12 industry-standard OpenCL benchmarks drawn from the AMD SDK and the Rodinia suites. The benchmarks are executed on 40 LE1 configurations with 10 implemented on an SoC-FPGA and the remaining on a cycle-accurate simulator. Across 12 OpenCL benchmarks results demonstrate near-linear wall-clock performance improvement of 1.8x (using 2 dual-issue cores), up to 5.2x (using 8 dual-issue cores) and on one case, super-linear improvement of 8.4x (FixOffset kernel, 8 dual-issue cores). The number of OpenCL benchmarks evaluated makes this study one of the most complete in the literature

    Investigate and classify various types of computer architecture

    Get PDF
    Issued as Final report, Project no. G-36-60

    A case study for NoC based homogeneous MPSoC architectures

    Get PDF
    The many-core design paradigm requires flexible and modular hardware and software components to provide the required scalability to next-generation on-chip multiprocessor architectures. A multidisciplinary approach is necessary to consider all the interactions between the different components of the design. In this paper, a complete design methodology that tackles at once the aspects of system level modeling, hardware architecture, and programming model has been successfully used for the implementation of a multiprocessor network-on-chip (NoC)-based system, the NoCRay graphic accelerator. The design, based on 16 processors, after prototyping with field-programmable gate array (FPGA), has been laid out in 90-nm technology. Post-layout results show very low power, area, as well as 500 MHz of clock frequency. Results show that an array of small and simple processors outperform a single high-end general purpose processo
    • …
    corecore