14 research outputs found

    The Potential for a GPU-Like Overlay Architecture for FPGAs

    Get PDF
    We propose a soft processor programming model and architecture inspired by graphics processing units (GPUs) that are well-matched to the strengths of FPGAs, namely, highly parallel and pipelinable computation. In particular, our soft processor architecture exploits multithreading, vector operations, and predication to supply a floating-point pipeline of 64 stages via hardware support for up to 256 concurrent thread contexts. The key new contributions of our architecture are mechanisms for managing threads and register files that maximize data-level and instruction-level parallelism while overcoming the challenges of port limitations of FPGA block memories as well as memory and pipeline latency. Through simulation of a system that (i) is programmable via NVIDIA's high-level Cg language, (ii) supports AMD's CTM r5xx GPU ISA, and (iii) is realizable on an XtremeData XD1000 FPGA-based accelerator system, we demonstrate the potential for such a system to achieve 100% utilization of a deeply pipelined floating-point datapath

    Block Processor: A Resource-distributed Architecture

    Get PDF
    Abstract-We present the architecture of Block Processor, task-level coprocessor, to execute vectorizable computing task migrated from main processor via command bus. The Block Processor is designed around 32 high-MVL block registers, which can be direct operands of vector instruction and be local cache of the Block Processor. The corresponding unique conflictsolving mechanism scales with the various implementations and easily supports chaining by adding extra execution states. The architecture distributes the block registers, ALUs and control logic. We implement the Block Processor which maps efficiently into the FPGA since the FPGA also distributes its inner resource. Each block register requires two FPGA Block RAM to be 2-read-1-write-port, 1024-depth and 32-bit-width. With the enhanced chaining and decoupling, it might hinder the latency of vector memory instructions and then sustain the computing abilities. With the little resource occupied, 1024-point radix-2 DIF FFT costs 11348 cycles on one Block Processor

    Just In Time Assembly (JITA) - A Run Time Interpretation Approach for Achieving Productivity of Creating Custom Accelerators in FPGAs

    Get PDF
    The reconfigurable computing community has yet to be successful in allowing programmers to access FPGAs through traditional software development flows. Existing barriers that prevent programmers from using FPGAs include: 1) knowledge of hardware programming models, 2) the need to work within the vendor specific CAD tools and hardware synthesis. This thesis presents a series of published papers that explore different aspects of a new approach being developed to remove the barriers and enable programmers to compile accelerators on next generation reconfigurable manycore architectures. The approach is entitled Just In Time Assembly (JITA) of hardware accelerators. The approach has been defined to allow hardware accelerators to be built and run through software compilation and run time interpretation outside of CAD tools and without requiring each new accelerator to be synthesized. The approach advocates the use of libraries of pre-synthesized components that can be referenced through symbolic links in a similar fashion to dynamically linked software libraries. Synthesis still must occur but is moved out of the application programmers software flow and into the initial coding process that occurs when programming patterns that define a Domain Specific Language (DSL) are first coded. Programmers see no difference between creating software or hardware functionality when using the DSL. A new run time interpreter is introduced to assemble the individual pre-synthesized hardware accelerators that comprise the accelerator functionality within a configurable tile array of partially reconfigurable slots at run time. Quantitative results are presented that compares utilization, performance, and productivity of the approach to what would be achieved by full custom accelerators created through traditional CAD flows using hardware programming models and passing through synthesis

    Revisiting the high-performance reconfigurable computing for future datacenters

    Get PDF
    Modern datacenters are reinforcing the computational power and energy efficiency by assimilating field programmable gate arrays (FPGAs). The sustainability of this large-scale integration depends on enabling multi-tenant FPGAs. This requisite amplifies the importance of communication architecture and virtualization method with the required features in order to meet the high-end objective. Consequently, in the last decade, academia and industry proposed several virtualization techniques and hardware architectures for addressing resource management, scheduling, adoptability, segregation, scalability, performance-overhead, availability, programmability, time-to-market, security, and mainly, multitenancy. This paper provides an extensive survey covering three important aspects-discussion on non-standard terms used in existing literature, network-on-chip evaluation choices as a mean to explore the communication architecture, and virtualization methods under latest classification. The purpose is to emphasize the importance of choosing appropriate communication architecture, virtualization technique and standard language to evolve the multi-tenant FPGAs in datacenters. None of the previous surveys encapsulated these aspects in one writing. Open problems are indicated for scientific community as well

    Application Specific Customization and Scalability of Soft Multiprocessors

    Full text link

    A Hybrid Partially Reconfigurable Overlay Supporting Just-In-Time Assembly of Custom Accelerators on FPGAs

    Get PDF
    The state of the art in design and development flows for FPGAs are not sufficiently mature to allow programmers to implement their applications through traditional software development flows. The stipulation of synthesis as well as the requirement of background knowledge on the FPGAs\u27 low-level physical hardware structure are major challenges that prevent programmers from using FPGAs. The reconfigurable computing community is seeking solutions to raise the level of design abstraction at which programmers must operate, and move the synthesis process out of the programmers\u27 path through the use of overlays. A recent approach, Just-In-Time Assembly (JITA), was proposed that enables hardware accelerators to be assembled at runtime, all from within a traditional software compilation flow. The JITA approach presents a promising path to constructing hardware designs on FPGAs using pre-synthesized parallel programming patterns, but suffers from two major limitations. First, all variant programming patterns must be pre-synthesized. Second, conditional operations are not supported. In this thesis, I present a new reconfigurable overlay, URUK, that overcomes the two limitations imposed by the JITA approach. Similar to the original JITA approach, the proposed URUK overlay allows hardware accelerators to be constructed on FPGAs through software compilation flows. To this basic capability, URUK adds additional support to enable the assembly of presynthesized fine-grained computational operators to be assembled within the FPGA. This thesis provides analysis of URUK from three different perspectives; utilization, performance, and productivity. The analysis includes comparisons against High-Level Synthesis (HLS) and the state of the art approach to creating static overlays. The tradeoffs conclude that URUK can achieve approximately equivalent performance for algebra operations compared to HLS custom accelerators, which are designed with simple experience on FPGAs. Further, URUK shows a high degree of flexibility for runtime placement and routing of the primitive operations. The analysis shows how this flexibility can be leveraged to reduce communication overhead among tiles, compared to traditional static overlays. The results also show URUK can enable software programmers without any hardware skills to create hardware accelerators at productivity levels consistent with software development and compilation

    Instruction fusion and vector processor virtualization for higher throughput simultaneous multithreaded processors

    Get PDF
    The utilization wall, caused by the breakdown of threshold voltage scaling, hinders performance gains for new generation microprocessors. To alleviate its impact, an instruction fusion technique is first proposed for multiscalar and many-core processors. With instruction fusion, similar copies of an instruction to be run on multiple pipelines or cores are merged into a single copy for simultaneous execution. Instruction fusion applied to vector code enables the processor to idle early pipeline stages and instruction caches at various times during program implementation with minimum performance degradation, while reducing the program size and the required instruction memory bandwidth. Instruction fusion is applied to a MIPS-based dual-core that resembles an ideal multiscalar of degree two. Benchmarking using an FPGA prototype shows a 6-11% reduction in dynamic power dissipation as well as a 17-45% decrease in code size with frequent performance improvements due to higher instruction cache hit rates. The second part of this dissertation deals with vector processors (VPs) which are commonly assigned exclusively to a single thread/core, and are not often performance and energy efficient due to mismatches with the vector needs of individual applications. An easy-to-implement VP virtualization technology is presented to improve the VP in terms of utilization and energy efficiency. The proposed VP virtualization technology, when applied, improves aggregate VP utilization by enabling simultaneous execution of multiple threads of similar or disparate vector lengths on a multithreaded VP. With a vector register file (VRF) virtualization technique invented to dynamically allocate physical vector registers to threads, the virtualization approach improves programmer productivity by providing at run time a distinct physical register name space to each competing thread, thus eliminating the need to solve register name conflicts statically. The virtualization technique is applied to a multithreaded VP prototyped on an FPGA; it supports VP sharing as well as power gating for better energy efficiency. A throughput-driven scheduler is proposed to optimize the virtualized VP’s utilization in dynamic environments where diverse threads are created randomly. Simulations of various low utilization benchmarks show that, with the proposed scheduler and power gating, the virtualized VP yields a larger than 3-fold speedup while the reduction in the total energy consumption approaches 40% compared to the same VP running in the single-threaded mode. The third part of this dissertation focuses on combining the two aforementioned technologies to create an improved VP prototype that is fully virtualized to support thread fusion and dynamic lane-based power-gating (PG). The VP is capable of dynamically triggering thread fusion according to the availability of similar threads in the task queue. Once thread fusion is triggered, every vector instruction issued to the virtualized VP is interpreted as two similar instructions working in two independent virtual spaces, thus doubling the vector instruction issue rate. Based on an accurate power model of the VP prototype, two different policies are proposed to dynamically choose the optimal number of active VP lanes. With the combined effort of VP lane-based PG and thread fusion, compared to a conventional VP without the two proposed capabilities, benchmarking shows that the new prototype yields up to 33.8% energy reduction in addition to 40% runtime improvement, or up to 62.7% reduction in the product of energy and runtime
    corecore