452 research outputs found

    Modeling and Mapping of Optimized Schedules for Embedded Signal Processing Systems

    Get PDF
    The demand for Digital Signal Processing (DSP) in embedded systems has been increasing rapidly due to the proliferation of multimedia- and communication-intensive devices such as pervasive tablets and smart phones. Efficient implementation of embedded DSP systems requires integration of diverse hardware and software components, as well as dynamic workload distribution across heterogeneous computational resources. The former implies increased complexity of application modeling and analysis, but also brings enhanced potential for achieving improved energy consumption, cost or performance. The latter results from the increased use of dynamic behavior in embedded DSP applications. Furthermore, parallel programming is highly relevant in many embedded DSP areas due to the development and use of Multiprocessor System-On-Chip (MPSoC) technology. The need for efficient cooperation among different devices supporting diverse parallel embedded computations motivates high-level modeling that expresses dynamic signal processing behaviors and supports efficient task scheduling and hardware mapping. Starting with dynamic modeling, this thesis develops a systematic design methodology that supports functional simulation and hardware mapping of dynamic reconfiguration based on Parameterized Synchronous Dataflow (PSDF) graphs. By building on the DIF (Dataflow Interchange Format), which is a design language and associated software package for developing and experimenting with dataflow-based design techniques for signal processing systems, we have developed a novel tool for functional simulation of PSDF specifications. This simulation tool allows designers to model applications in PSDF and simulate their functionality, including use of the dynamic parameter reconfiguration capabilities offered by PSDF. With the help of this simulation tool, our design methodology helps to map PSDF specifications into efficient implementations on field programmable gate arrays (FPGAs). Furthermore, valid schedules can be derived from the PSDF models at runtime to adapt hardware configurations based on changing data characteristics or operational requirements. Under certain conditions, efficient quasi-static schedules can be applied to reduce overhead and enhance predictability in the scheduling process. Motivated by the fact that scheduling is critical to performance and to efficient use of dynamic reconfiguration, we have focused on a methodology for schedule design, which complements the emphasis on automated schedule construction in the existing literature on dataflow-based design and implementation. In particular, we have proposed a dataflow-based schedule design framework called the dataflow schedule graph (DSG), which provides a graphical framework for schedule construction based on dataflow semantics, and can also be used as an intermediate representation target for automated schedule generation. Our approach to applying the DSG in this thesis emphasizes schedule construction as a design process rather than an outcome of the synthesis process. Our approach employs dataflow graphs for representing both application models and schedules that are derived from them. By providing a dataflow-integrated framework for unambiguously representing, analyzing, manipulating, and interchanging schedules, the DSG facilitates effective codesign of dataflow-based application models and schedules for execution of these models. As multicore processors are deployed in an increasing variety of embedded image processing systems, effective utilization of resources such as multiprocessor systemon-chip (MPSoC) devices, and effective handling of implementation concerns such as memory management and I/O become critical to developing efficient embedded implementations. However, the diversity and complexity of applications and architectures in embedded image processing systems make the mapping of applications onto MPSoCs difficult. We help to address this challenge through a structured design methodology that is built upon the DSG modeling framework. We refer to this methodology as the DEIPS methodology (DSG-based design and implementation of Embedded Image Processing Systems). The DEIPS methodology provides a unified framework for joint consideration of DSG structures and the application graphs from which they are derived, which allows designers to integrate considerations of parallelization and resource constraints together with the application modeling process. We demonstrate the DEIPS methodology through cases studies on practical embedded image processing systems

    SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

    Get PDF
    Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p

    NIKEL: Electronics and data acquisition for kilopixels kinetic inductance camera

    Full text link
    A prototype of digital frequency multiplexing electronics allowing the real time monitoring of microwave kinetic inductance detector (MKIDs) arrays for mm-wave astronomy has been developed. Thanks to the frequency multiplexing, it can monitor simultaneously 400 pixels over a 500 MHz bandwidth and requires only two coaxial cables for instrumenting such a large array. The chosen solution and the performances achieved are presented in this paper.Comment: 21 pages, 14 figure

    Implementation and Evaluation of Deep Neural Networks in Commercially Available Processing in Memory Hardware

    Get PDF
    Deep Neural Networks (DNN), specifically Convolutional Neural Networks (CNNs) are often associated with a large number of data-parallel computations. Therefore, data-centric computing paradigms, such as Processing in Memory (PIM), are being widely explored for CNN acceleration applications. A recent PIM architecture, developed and commercialized by the UPMEM company, has demonstrated impressive performance boost over traditional CPU-based systems for a wide range of data parallel applications. However, the application domain of CNN acceleration is yet to be explored on this PIM platform. In this work, successful implementations of CNNs on the UPMEM PIM system are presented. Furthermore, multiple operation mapping schemes with different optimization goals are explored. Based on the data achieved from the physical implementation of the CNNs on the UPMEM system, key-takeaways for future implementations and further UPMEM improvements are presented. Finally, to compare UPMEM’s performance with other PIMs, a model is proposed that is capable of producing estimated performance results of PIMs given architectural parameters. The creation and usage of the model is covered in this work

    Accelerating legacy applications with spatial computing devices

    Get PDF
    Heterogeneous computing is the major driving factor in designing new energy-efficient high-performance computing systems. Despite the broad adoption of GPUs and other specialized architectures, the interest in spatial architectures like field-programmable gate arrays (FPGAs) has grown. While combining high performance, low power consumption and high adaptability constitute an advantage, these devices still suffer from a weak software ecosystem, which forces application developers to use tools requiring deep knowledge of the underlying system, often leaving legacy code (e.g., Fortran applications) unsupported. By realizing this, we describe a methodology for porting Fortran (legacy) code on modern FPGA architectures, with the target of preserving performance/power ratios. Aimed as an experience report, we considered an industrial computational fluid dynamics application to demonstrate that our methodology produces synthesizable OpenCL codes targeting Intel Arria10 and Stratix10 devices. Although performance gain is not far beyond that of the original CPU code (we obtained a relative speedup of x 0.59 and x 0.63, respectively, for a single optimized main kernel, while only on the Stratix10 we achieved x 2.56 by replicating the main optimized kernel 4 times), our results are quite encouraging to drawn the path for further investigations. This paper also reports some major criticalities in porting Fortran code on FPGA architectures

    Definition of avionics concepts for a heavy lift cargo vehicle, appendix A

    Get PDF
    The major objective of the study task was to define a cost effective, multiuser simulation, test, and demonstration facility to support the development of avionics systems for future space vehicles. This volume provides the results of the main simulation processor selection study and describes some proof-of-concept demonstrations for the avionics test bed facility
    corecore