368 research outputs found

    A Multithreaded Soft Processor for SoPC Area Reduction

    Full text link
    The growth in size and performance of Field Programmable Gate Arrays (FPGAs) has compelled System-on-a-Programmable-Chip (SoPC) designers to use soft proces-sors for controlling systems with large numbers of intellec-tual property (IP) blocks. Soft processors control IP blocks, which are accessed by the processor either as peripheral de-vices or/and by using custom instructions (CIs). In large systems, chip multiprocessors (CMPs) are used to execute many programs concurrently. When these programs require the use of the same IP blocks which are accessed as periph-eral devices, they may have to stall waiting for their turn. In the case of CIs, the FPGA logic blocks that implement the CIs may have to be replicated for each processor. In both of these cases FPGA area is wasted, either by idle soft processors or the replication of CI logic blocks. This paper presents a multithreaded (MT) soft processor for area reduction in SoPC implementations. An MT proces-sor allows multiple programs to access the same IP without the need for the logic replication or the replication of whole processors. We first designed a single-threaded processor that is instruction-set compatible to Altera’s Nios II soft processor. Our processor is approximately the same size as the Nios II Economy version, with equivalent performance. We augmented our processor to have 4-way interleaved mul-tithreading capabilities. This paper compares the area us-age and performance of the MT processor versus two CMP systems, using Altera’s and our single-threaded processors, separately. Our results show that we can achieve an area savings of about 45 % for the processor itself, in addition to the area savings due to not replicating CI logic blocks. 1

    The "MIND" Scalable PIM Architecture

    Get PDF
    MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture

    Vector processor virtualization: distributed memory hierarchy and simultaneous multithreading

    Get PDF
    Taking advantage of DLP (Data-Level Parallelism) is indispensable in most data streaming and multimedia applications. Several architectures have been proposed to improve both the performance and energy consumption for such applications. Superscalar and VLIW (Very Long Instruction Word) processors, along with SIMD (Single-Instruction Multiple-Data) and vector processor (VP) accelerators, are among the available options for designers to accomplish their desired requirements. On the other hand, these choices turn out to be large resource and energy consumers, while also not being always used efficiently due to data dependencies among instructions and limited portion of vectorizable code in single applications that deploy them. This dissertation proposes an innovative architecture for a multithreaded VP which separates the path for performing data shuffle and memory-indexed accesses from the data path for executing other vector instructions that access the memory. This separation speeds up the most common memory access operations by avoiding extra delays and unnecessary stalls. In this multilane-based VP design, each vector lane uses its own private memory to avoid any stalls during memory access instructions. More importantly, the proposed VP has an innovative multithreaded architecture which makes it highly suitable for concurrent sharing in multicore environments. To this end, the VP which is developed in VHDL and prototyped on an FPGA (Field-Programmable Gate Array), serves as a coprocessor for one or more scalar cores in various system architectures presented in the dissertation. In the first system architecture, the VP is allocated exclusively to a single scalar core. Benchmarking shows that the VP can achieve very high performance. The inclusion of distributed data shuffle engines across vector lanes has a spectacular impact on the execution time, primarily for applications like FFT (Fast-Fourier Transform) that require large amounts of data shuffling. In the second system architecture, a VP virtualization technique is presented which, when applied, enables the multithreaded VP to simultaneously execute many threads of various vector lengths. The threads compete simultaneously for the VP resources having as a goal an improved aggregate VP utilization. This approach yields high VP utilization even under low utilization for the individual threads. A vector register file (VRF) virtualization technique dynamically allocates physical vector registers to running threads. The technique is implemented for a multi-core processor embedded in an FPGA. Under the dynamic creation of threads, benchmarking demonstrates large VP speedups and drastic energy savings when compared to the first system architecture. In the last system architecture, further improvements focus on VP virtualization relying exclusively on hardware. Moreover, a pipelined data shuffle network replaces the non-pipelined shuffle engines. The VP can then take advantage of identical instruction flows that may be present in different vector applications by running in a fused instruction mode that increases its utilization. A power dissipation model is introduced as well as two optimization policies towards minimizing the consumed energy, or the product of the energy and runtime for a given application. Benchmarking shows the positive impact of these optimizations

    Ærø: A Platform Architecture for Mixed-Criticality Airborne Systems

    Get PDF

    Power models, energy models and libraries for energy-efficient concurrent data structures and algorithms

    Get PDF
    EXCESS deliverable D2.3. More information at http://www.excess-project.eu/This deliverable reports the results of the power models, energy models and librariesfor energy-efficient concurrent data structures and algorithms as available by projectmonth 30 of Work Package 2 (WP2). It reports i) the latest results of Task 2.2-2.4 onproviding programming abstractions and libraries for developing energy-efficient datastructures and algorithms and ii) the improved results of Task 2.1 on investigating andmodeling the trade-off between energy and performance of concurrent data structuresand algorithms. The work has been conducted on two main EXCESS platforms: Intelplatforms with recent Intel multicore CPUs and Movidius Myriad platforms

    Modeling and Mapping of Optimized Schedules for Embedded Signal Processing Systems

    Get PDF
    The demand for Digital Signal Processing (DSP) in embedded systems has been increasing rapidly due to the proliferation of multimedia- and communication-intensive devices such as pervasive tablets and smart phones. Efficient implementation of embedded DSP systems requires integration of diverse hardware and software components, as well as dynamic workload distribution across heterogeneous computational resources. The former implies increased complexity of application modeling and analysis, but also brings enhanced potential for achieving improved energy consumption, cost or performance. The latter results from the increased use of dynamic behavior in embedded DSP applications. Furthermore, parallel programming is highly relevant in many embedded DSP areas due to the development and use of Multiprocessor System-On-Chip (MPSoC) technology. The need for efficient cooperation among different devices supporting diverse parallel embedded computations motivates high-level modeling that expresses dynamic signal processing behaviors and supports efficient task scheduling and hardware mapping. Starting with dynamic modeling, this thesis develops a systematic design methodology that supports functional simulation and hardware mapping of dynamic reconfiguration based on Parameterized Synchronous Dataflow (PSDF) graphs. By building on the DIF (Dataflow Interchange Format), which is a design language and associated software package for developing and experimenting with dataflow-based design techniques for signal processing systems, we have developed a novel tool for functional simulation of PSDF specifications. This simulation tool allows designers to model applications in PSDF and simulate their functionality, including use of the dynamic parameter reconfiguration capabilities offered by PSDF. With the help of this simulation tool, our design methodology helps to map PSDF specifications into efficient implementations on field programmable gate arrays (FPGAs). Furthermore, valid schedules can be derived from the PSDF models at runtime to adapt hardware configurations based on changing data characteristics or operational requirements. Under certain conditions, efficient quasi-static schedules can be applied to reduce overhead and enhance predictability in the scheduling process. Motivated by the fact that scheduling is critical to performance and to efficient use of dynamic reconfiguration, we have focused on a methodology for schedule design, which complements the emphasis on automated schedule construction in the existing literature on dataflow-based design and implementation. In particular, we have proposed a dataflow-based schedule design framework called the dataflow schedule graph (DSG), which provides a graphical framework for schedule construction based on dataflow semantics, and can also be used as an intermediate representation target for automated schedule generation. Our approach to applying the DSG in this thesis emphasizes schedule construction as a design process rather than an outcome of the synthesis process. Our approach employs dataflow graphs for representing both application models and schedules that are derived from them. By providing a dataflow-integrated framework for unambiguously representing, analyzing, manipulating, and interchanging schedules, the DSG facilitates effective codesign of dataflow-based application models and schedules for execution of these models. As multicore processors are deployed in an increasing variety of embedded image processing systems, effective utilization of resources such as multiprocessor systemon-chip (MPSoC) devices, and effective handling of implementation concerns such as memory management and I/O become critical to developing efficient embedded implementations. However, the diversity and complexity of applications and architectures in embedded image processing systems make the mapping of applications onto MPSoCs difficult. We help to address this challenge through a structured design methodology that is built upon the DSG modeling framework. We refer to this methodology as the DEIPS methodology (DSG-based design and implementation of Embedded Image Processing Systems). The DEIPS methodology provides a unified framework for joint consideration of DSG structures and the application graphs from which they are derived, which allows designers to integrate considerations of parallelization and resource constraints together with the application modeling process. We demonstrate the DEIPS methodology through cases studies on practical embedded image processing systems
    • …
    corecore