56 research outputs found

    Design and resource management of reconfigurable multiprocessors for data-parallel applications

    Get PDF
    FPGA (Field-Programmable Gate Array)-based custom reconfigurable computing machines have established themselves as low-cost and low-risk alternatives to ASIC (Application-Specific Integrated Circuit) implementations and general-purpose microprocessors in accelerating a wide range of computation-intensive applications. Most often they are Application Specific Programmable Circuiits (ASPCs), which are developer programmable instead of user programmable. The major disadvantages of ASPCs are minimal programmability, and significant time and energy overheads caused by required hardware reconfiguration when the problem size outnumbers the available reconfigurable resources; these problems are expected to become more serious with increases in the FPGA chip size. On the other hand, dominant high-performance computing systems, such as PC clusters and SMPs (Symmetric Multiprocessors), suffer from high communication latencies and/or scalability problems. This research introduces low-cost, user-programmable and reconfigurable MultiProcessor-on-a-Programmable-Chip (MPoPC) systems for high-performance, low-cost computing. It also proposes a relevant resource management framework that deals with performance, power consumption and energy issues. These semi-customized systems reduce significantly runtime device reconfiguration by employing userprogrammable processing elements that are reusable for different tasks in large, complex applications. For the sake of illustration, two different types of MPoPCs with hardware FPUs (floating-point units) are designed and implemented for credible performance evaluation and modeling: the coarse-grain MIMD (Multiple-Instruction, Multiple-Data) CG-MPoPC machine based on a processor IP (Intellectual Property) core and the mixed-mode (MIMD, SIMD or M-SIMD) variant-grain HERA (HEterogeneous Reconfigurable Architecture) machine. In addition to alleviating the above difficulties, MPoPCs can offer several performance and energy advantages to our data-parallel applications when compared to ASPCs; they are simpler and more scalable, and have less verification time and cost. Various common computation-intensive benchmark algorithms, such as matrix-matrix multiplication (MMM) and LU factorization, are studied and their parallel solutions are shown for the two MPoPCs. The performance is evaluated with large sparse real-world matrices primarily from power engineering. We expect even further performance gains on MPoPCs in the near future by employing ever improving FPGAs. The innovative nature of this work has the potential to guide research in this arising field of high-performance, low-cost reconfigurable computing. The largest advantage of reconfigurable logic lies in its large degree of hardware customization and reconfiguration which allows reusing the resources to match the computation and communication needs of applications. Therefore, a major effort in the presented design methodology for mixed-mode MPoPCs, like HERA, is devoted to effective resource management. A two-phase approach is applied. A mixed-mode weighted Task Flow Graph (w-TFG) is first constructed for any given application, where tasks are classified according to their most appropriate computing mode (e.g., SIMD or MIMD). At compile time, an architecture is customized and synthesized for the TFG using an Integer Linear Programming (ILP) formulation and a parameterized hardware component library. Various run-time scheduling schemes with different performanceenergy objectives are proposed. A system-level energy model for HERA, which is based on low-level implementation data and run-time statistics, is proposed to guide performance-energy trade-off decisions. A parallel power flow analysis technique based on Newton\u27s method is proposed and employed to verify the methodology

    Implementation of Data-Driven Applications on Two-Level Reconfigurable Hardware

    Get PDF
    RÉSUMÉ Les architectures reconfigurables Ă  large grain sont devenues un sujet important de recherche en raison de leur haut potentiel pour accĂ©lĂ©rer une large gamme d’applications. Ces architectures utilisent la nature parallĂšle de l’architecture matĂ©rielle pour accĂ©lĂ©rer les calculs. Les architectures reconfigurables Ă  large grain sont en mesure de combler les lacunes existantes entre le FPGA (architecture reconfigurable Ă  grain fin) et le processeur. Elles contrastent gĂ©nĂ©ralement avec les Application Specific Integrated Circuits (ASIC) en ce qui concerne la performance (moins bonnes) et la flexibilitĂ© (meilleures). La programmation d’architectures reconfigurables est un dĂ©fi qui date depuis longtemps et pose plusieurs problĂšmes. Les programmeurs doivent ĂȘtre avisĂ©s des caractĂ©ristiques du matĂ©riel sur lequel ils travaillent et connaĂźtre des langages de description matĂ©riels tels que VHDL et Verilog au lieu de langages de programmation sĂ©quentielle. L’implĂ©mentation d’un algorithme sur FPGA s’avĂšre plus difficile que de le faire sur des CPU ou des GPU. Les implĂ©mentations Ă  base de processeurs ont dĂ©jĂ  leur chemin de donnĂ©es prĂ© synthĂ©tisĂ© et ont besoin uniquement d’un programme pour le contrĂŽler. Par contre, dans un FPGA, le dĂ©veloppeur doit crĂ©er autant le chemin de donnĂ©es que le contrĂŽleur. Cependant, concevoir une nouvelle architecture pour exploiter efficacement les millions de cellules logiques et les milliers de ressources arithmĂ©tiques dĂ©diĂ©es qui sont disponibles dans une FPGA est une tĂąche difficile qui requiert beaucoup de temps. Seulement les spĂ©cialistes dans le design de circuits peuvent le faire. Ce projet est fondĂ© sur un tissu de calcul gĂ©nĂ©rique contrĂŽlĂ© par les donnĂ©es qui a Ă©tĂ© proposĂ© par le professeur J.P David et a dĂ©jĂ  Ă©tĂ© implĂ©mentĂ© par un Ă©tudiant Ă  la maĂźtrise M. Allard. Cette architecture est principalement formĂ©e de trois composants: l’unitĂ© arithmĂ©tique et logique partagĂ©e (Shared Arithmetic Logic Unit –SALU-), la machine Ă  Ă©tat pour le jeton des donnĂ©es (Token State Machine –TSM-) et la banque de FIFO (FIFO Bank –FB-). Cette architecture est semblable aux architectures reconfigurables Ă  large grain (Coarse-Grained Reconfigurable Architecture-CGRAs-), mais contrĂŽlĂ©e par les donnĂ©es.----------ABSTRACT Coarse-grained reconfigurable computing architectures have become an important research topic because of their high potential to accelerate a wide range of applications. These architectures apply the concurrent nature of hardware architecture to accelerate computations. Substantially, coarse-grained reconfigurable computing architectures can fill up existing gaps between FPGAs and processor. They typically contrast with Application Specific Integrated Circuits (ASICs) in connection with performance and flexibility. Programming reconfigurable computing architectures is a long-standing challenge, and it is yet extremely inconvenient. Programmers must be aware of hardware features and also it is assumed that they have a good knowledge of hardware description languages such as VHDL and Verilog, instead of the sequential programming paradigm. Implementing an algorithm on FPGA is intrinsically more difficult than programming a processor or a GPU. Processor-based implementations “only” require a program to control their pre-synthesized data path, while an FPGA requires that a designer creates a new data path and a new controller for each application. Nevertheless, conceiving an architecture that best exploits the millions of logic cells and the thousands of dedicated arithmetic resources available in an FPGA is a time-consuming challenge that only talented experts in circuit design can handle. This project is founded on the generic data-driven compute fabric proposed by Prof. J.P. David and implemented by M. Allard, a previous master student. This architecture is composed of three main individual components: the Shared Arithmetic Logic Unit (SALU), the Token State Machine (TSM) and the FIFO Bank (FB). The architecture is somewhat similar to Coarse-Grained Reconfigurable Architectures (CGRAs), but it is data-driven. Indeed, in that architecture, register banks are replaced by FBs and the controllers are TSMs. The operations start as soon as the operands are available in the FIFOs that contain the operands. Data travel from FBs to FBs through the SALU, as programmed in the configuration memory of the TSMs. Final results return in FIFOs

    Automated design of domain-specific custom instructions

    Get PDF

    Run-time management for future MPSoC platforms

    Get PDF
    In recent years, we are witnessing the dawning of the Multi-Processor Systemon- Chip (MPSoC) era. In essence, this era is triggered by the need to handle more complex applications, while reducing overall cost of embedded (handheld) devices. This cost will mainly be determined by the cost of the hardware platform and the cost of designing applications for that platform. The cost of a hardware platform will partly depend on its production volume. In turn, this means that ??exible, (easily) programmable multi-purpose platforms will exhibit a lower cost. A multi-purpose platform not only requires ??exibility, but should also combine a high performance with a low power consumption. To this end, MPSoC devices integrate computer architectural properties of various computing domains. Just like large-scale parallel and distributed systems, they contain multiple heterogeneous processing elements interconnected by a scalable, network-like structure. This helps in achieving scalable high performance. As in most mobile or portable embedded systems, there is a need for low-power operation and real-time behavior. The cost of designing applications is equally important. Indeed, the actual value of future MPSoC devices is not contained within the embedded multiprocessor IC, but in their capability to provide the user of the device with an amount of services or experiences. So from an application viewpoint, MPSoCs are designed to ef??ciently process multimedia content in applications like video players, video conferencing, 3D gaming, augmented reality, etc. Such applications typically require a lot of processing power and a signi??cant amount of memory. To keep up with ever evolving user needs and with new application standards appearing at a fast pace, MPSoC platforms need to be be easily programmable. Application scalability, i.e. the ability to use just enough platform resources according to the user requirements and with respect to the device capabilities is also an important factor. Hence scalability, ??exibility, real-time behavior, a high performance, a low power consumption and, ??nally, programmability are key components in realizing the success of MPSoC platforms. The run-time manager is logically located between the application layer en the platform layer. It has a crucial role in realizing these MPSoC requirements. As it abstracts the platform hardware, it improves platform programmability. By deciding on resource assignment at run-time and based on the performance requirements of the user, the needs of the application and the capabilities of the platform, it contributes to ??exibility, scalability and to low power operation. As it has an arbiter function between different applications, it enables real-time behavior. This thesis details the key components of such an MPSoC run-time manager and provides a proof-of-concept implementation. These key components include application quality management algorithms linked to MPSoC resource management mechanisms and policies, adapted to the provided MPSoC platform services. First, we describe the role, the responsibilities and the boundary conditions of an MPSoC run-time manager in a generic way. This includes a de??nition of the multiprocessor run-time management design space, a description of the run-time manager design trade-offs and a brief discussion on how these trade-offs affect the key MPSoC requirements. This design space de??nition and the trade-offs are illustrated based on ongoing research and on existing commercial and academic multiprocessor run-time management solutions. Consequently, we introduce a fast and ef??cient resource allocation heuristic that considers FPGA fabric properties such as fragmentation. In addition, this thesis introduces a novel task assignment algorithm for handling soft IP cores denoted as hierarchical con??guration. Hierarchical con??guration managed by the run-time manager enables easier application design and increases the run-time spatial mapping freedom. In turn, this improves the performance of the resource assignment algorithm. Furthermore, we introduce run-time task migration components. We detail a new run-time task migration policy closely coupled to the run-time resource assignment algorithm. In addition to detailing a design-environment supported mechanism that enables moving tasks between an ISP and ??ne-grained recon??gurable hardware, we also propose two novel task migration mechanisms tailored to the Network-on-Chip environment. Finally, we propose a novel mechanism for task migration initiation, based on reusing debug registers in modern embedded microprocessors. We propose a reactive on-chip communication management mechanism. We show that by exploiting an injection rate control mechanism it is possible to provide a communication management system capable of providing a soft (reactive) QoS in a NoC. We introduce a novel, platform independent run-time algorithm to perform quality management, i.e. to select an application quality operating point at run-time based on the user requirements and the available platform resources, as reported by the resource manager. This contribution also proposes a novel way to manage the interaction between the quality manager and the resource manager. In order to have a the realistic, reproducible and ??exible run-time manager testbench with respect to applications with multiple quality levels and implementation tradev offs, we have created an input data generation tool denoted Pareto Surfaces For Free (PSFF). The the PSFF tool is, to the best of our knowledge, the ??rst tool that generates multiple realistic application operating points either based on pro??ling information of a real-life application or based on a designer-controlled random generator. Finally, we provide a proof-of-concept demonstrator that combines these concepts and shows how these mechanisms and policies can operate for real-life situations. In addition, we show that the proposed solutions can be integrated into existing platform operating systems

    Memory hierarchy and data communication in heterogeneous reconfigurable SoCs

    Get PDF
    The miniaturization race in the hardware industry aiming at continuous increasing of transistor density on a die does not bring respective application performance improvements any more. One of the most promising alternatives is to exploit a heterogeneous nature of common applications in hardware. Supported by reconfigurable computation, which has already proved its efficiency in accelerating data intensive applications, this concept promises a breakthrough in contemporary technology development. Memory organization in such heterogeneous reconfigurable architectures becomes very critical. Two primary aspects introduce a sophisticated trade-off. On the one hand, a memory subsystem should provide well organized distributed data structure and guarantee the required data bandwidth. On the other hand, it should hide the heterogeneous hardware structure from the end-user, in order to support feasible high-level programmability of the system. This thesis work explores the heterogeneous reconfigurable hardware architectures and presents possible solutions to cope the problem of memory organization and data structure. By the example of the MORPHEUS heterogeneous platform, the discussion follows the complete design cycle, starting from decision making and justification, until hardware realization. Particular emphasis is made on the methods to support high system performance, meet application requirements, and provide a user-friendly programmer interface. As a result, the research introduces a complete heterogeneous platform enhanced with a hierarchical memory organization, which copes with its task by means of separating computation from communication, providing reconfigurable engines with computation and configuration data, and unification of heterogeneous computational devices using local storage buffers. It is distinguished from the related solutions by distributed data-flow organization, specifically engineered mechanisms to operate with data on local domains, particular communication infrastructure based on Network-on-Chip, and thorough methods to prevent computation and communication stalls. In addition, a novel advanced technique to accelerate memory access was developed and implemented

    Template-based embedded reconfigurable computing

    Get PDF
    XIV+212hlm.;24c

    Automatic Design of Efficient Application-centric Architectures.

    Full text link
    As the market for embedded devices continues to grow, the demand for high performance, low cost, and low power computation grows as well. Many embedded applications perform computationally intensive tasks such as processing streaming video or audio, wireless communication, or speech recognition and must be implemented within tight power budgets. Typically, general purpose processors are not able to meet these performance and power requirements. Custom hardware in the form of loop accelerators are often used to execute the compute-intensive portions of these applications because they can achieve significantly higher levels of performance and power efficiency. Automated hardware synthesis from high level specifications is a key technology used in designing these accelerators, because the resulting hardware is correct by construction, easing verification and greatly decreasing time-to-market in the quickly evolving embedded domain. In this dissertation, a compiler-directed approach is used to design a loop accelerator from a C specification and a throughput requirement. The compiler analyzes the loop and generates a virtual architecture containing sufficient resources to sustain the required throughput. Next, a software pipelining scheduler maps the operations in the loop to the virtual architecture. Finally, the accelerator datapath is derived from the resulting schedule. In this dissertation, synthesis of different types of loop accelerators is investigated. First, the system for synthesizing single loop accelerators is detailed. In particular, a scheduler is presented that is aware of the effects of its decisions on the resulting hardware, and attempts to minimize hardware cost. Second, synthesis of multifunction loop accelerators, or accelerators capable of executing multiple loops, is presented. Such accelerators exploit coarse-grained hardware sharing across loops in order to reduce overall cost. Finally, synthesis of post-programmable accelerators is presented, allowing changes to be made to the software after an accelerator has been created. The tradeoffs between the flexibility, cost, and energy efficiency of these different types of accelerators are investigated. Automatically synthesized loop accelerators are capable of achieving order-of-magnitude gains in performance, area efficiency, and power efficiency over processors, and programmable accelerators allow software changes while maintaining highly efficient levels of computation.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61644/1/fank_1.pd

    Control Software for Reconfigurable Coprocessors

    Get PDF
    On-line data processing at the ATLAS general purpose particle detector, which is currently under construction at Geneva, generates demands on computing power that are difficult to satisfy with commodity CPU-based computers. One of the most demanding applications is the recognition of particle tracks that originate from B-quark decays. However, this and many others applications can benefit from parallel execution on field programmable gate arrays (FPGA). After the demonstration of accelerated track recognition with big FPGA-based custom computers, the development of FPGA based coprocessors started in the late 1990's. Applications of FPGA coprocessors are usually partitioned between the host and the tightly coupled coprocessor. The objective of the research that I present in this thesis was the development of software that mediates to applications the access to FPGA coprocessors. I used a software process based on iterative prototyping to cope with the expected changing requirements. Also, I used a strict bottom-up design to create classes that model devices on the coprocessors. Using these low-level classes, I developed tools which were used for bootstrapping, debugging, and firmware update of the coprocessors during their development and maintenance. Measurements show that the software overhead introduced by object-oriented programming and software layering is small. The software-support for six different coprocessors was partitioned into corresponding independent packages, which reuse a set of packages that provide common and basic functions. The steady evolution and use of the software during more than four years shows that the software is maintainable, adaptable, and usable

    High-performance and hardware-aware computing: proceedings of the second International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC\u2711), San Antonio, Texas, USA, February 2011 ; (in conjunction with HPCA-17)

    Get PDF
    High-performance system architectures are increasingly exploiting heterogeneity. The HipHaC workshop aims at combining new aspects of parallel, heterogeneous, and reconfigurable microprocessor technologies with concepts of high-performance computing and, particularly, numerical solution methods. Compute- and memory-intensive applications can only benefit from the full hardware potential if all features on all levels are taken into account in a holistic approach
    • 

    corecore