1,003 research outputs found
The IPS fidelity scale as a guideline to implement Supported Employment
info:eu-repo/semantics/publishe
Template-based embedded reconfigurable computing
XIV+212hlm.;24c
An FPGA implementation of an investigative many-core processor, Fynbos : in support of a Fortran autoparallelising software pipeline
Includes bibliographical references.In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Moore’s law continues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trading in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a number of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work implements a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler
FPGA-SPICE: A Simulation-based Power Estimation Framework for FPGAs
Mainstream Field Programmable Gate Array (FPGA) power estimation tools are based on probabilistic activity estimation and analytical power models. The power consumption of the programmable resources of FPGAs is highly sensitive to their configurations. Due to their highly flexible nature, the configurations of FPGAs routing multiplexers or Look Up Tables (LUTs) are really different from a design to another but current analytical power models cannot accurately capture the associated power differences. In this paper, we introduce a simulation-based power estimation framework for FPGAs, called FPGA-SPICE, which supports any FPGA architecture that can be described with an architectural description language. Our power estimation engine automatically generates accurate SPICE netlists according to the FPGA configurations and enables precise power analysis of FPGA architectures. SPICE testbenches can be generated at different level of complexity, denoted as full-chip-level, grid-level and component-level testbenches. Full-chip-level testbenches dump the netlists associated with the complete FPGA fabric. To reduce simulation time, FPGA-SPICE can split the full-chip-level testbenches into grid-level testbenches, each of which consisting of a complete logic block netlist, or component-level testbenches, which consider individual circuit elements, i.e., multiplexers, LUTs, flip-flops, etc., separately. We show that the grid/component-level approach can achieve 14 x speed-up with a moderate 14% accuracy loss, compared to the full-chip level. We also use FPGA-SPICE to study the power characteristics of a commercial FPGA architecture at different technology nodes. Experimental results show that the global routing architecture consumes 50% of the total power, the local routing architecture claims for 40% of the total power, and the remaining 10% comes from the LUTs and flip-flops
FPGA implementation of a frame delay
The objective of this thesis is to investigate the applicability of Field Programmable Gate Arrays (FPGAs) for frame delay implementation. FPGAs are programmable devices that can be directly configured by the end user without the use of an integrated circuit fabrication facility. They offer the designer the benefits of custom hardware, eliminating high development costs and manufacturing time. Frame delays are easier to realize using R/W memory where data is written into the memory and read out for each frame. FPGAs are used in a Quartus II environment as it is easy to perform frame delay implementation using schematic entry procedure. Since FPGAs use look-up tables as configurable logic blocks, they are considered as an appropriate choice for frame delay based designs
FPGA-SPICE: A Simulation-Based Architecture Evaluation Framework for FPGAs
In this paper, we developed a simulation-based architecture evaluation framework for field-programmable gate arrays (FPGAs), called FPGA-SPICE, which enables automatic layout-level estimation and electrical simulations of FPGA architectures. FPGA-SPICE can automatically generate Verilog and SPICE netlists based on realistic FPGA configurations and a high-level eTtensible Markup Language-based FPGA architectural description language. The outputted Verilog netlists can be used to generate layouts of full FPGA fabrics through a semicustom design flow. SPICE simulation decks can be generated at three levels of complexity, namely, full-chip-level, grid-level, and component-level, providing different tradeoff between accuracy and simulation time. In order to enable such level of analysis, we presented two SPICE netlist partitioning techniques: loads extraction and parasitic net activity estimation. Electrical simulations showed that averaged over the selected benchmarks, the grid-/component-level approach can achieve 6.1x/7.5x execution speed-up with 9.9%/8.3% accuracy loss, respectively, compared to the full-chip level simulation. FPGA-SPICE was showcased through three different case studies: 1) an area breakdown analysis for static random access memory-based FPGAs, showing that configuration memories are a dominant factor; 2) a power breakdown comparison to analytical models, analyzing the source of accuracy loss; and 3) a robustness evaluation against process corners, studying their impact on energy consumption of full FPGA fabrics
Pattern-Based FPGA Logic Block and Clustering Algorithm
In classical FPGA, LUTs and DFFs are pre-packed into BLEs and then BLEs are grouped into logic blocks. We propose a novel logic block architecture with fast combinational paths between LUTs, called pattern-based logic blocks. A new clustering algorithm is developed to release the potential of pattern-based logic blocks. Experimental results show that the novel architecture and the associated clustering algorithm lead to a 14% performance gain and a 8% wirelength reduction with a 3% area overhead compared to conventional architecture in large control-instensive benchmarks
Recommended from our members
Investigation into the wafer-scale integration of fine-grain parallel processing computer systems
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.This thesis investigates the potential of wafer-scale integration (WSI) for the implementation of low-cost fine-grain parallel processing computer systems. As WSI is a relatively new subject, there was little work on which to base investigations. Indeed, most WSI architectures existed only as untried and sometimes vague proposals. Accordingly, the research strategy approached this problem by identifying a representative WSI structure and architecture on which to base investigations. An analysis of architectural proposals identified associative memory to be general purpose parallel processing component used in a wide range of WSI architectures. Furthermore, this analysis provided a set of WSI-level design requirements to evaluate the sustainability of different architectures as research vehicles. The WSI-ASP (WASP) device, which has a large associative memory as its main component is shown to meet these requirements and hence was chosen as the research vehicle. Consequently, this thesis addresses WSI potential through an in-depth investigation into the feasibility of implementing a large associative memory for the WASP device that meets the demanding technological constraints of WSI. Overall, the thesis concludes that WSI offers significant potential for the implementation of low-cost fine-grain parallel processing computer systems. However, due to the dual constraints of thermal management and the area required for the power distribution network, power density is a major design constraint in WSI. Indeed, it is shown that WSI power densities need to be an order of magnitude lower than VLSI power densities. The thesis demonstrates that for associative memories at least, VLSI designs are unsuited to implementation in WSI. Rather, it is shown that WSI circuits must be closely matched to the operational environment to assure suitable power densities. These circuits are significantly larger than their VLSI equivalents. Nonetheless, the thesis demonstrates that by concentrating on the most power intensive circuits, it is possible to achieve acceptable power densities with only a modest increase in area overheads.SER
Studies on Core-Based Testing of System-on-Chips Using Functional Bus and Network-on-Chip Interconnects
The tests of a complex system such as a microprocessor-based system-onchip
(SoC) or a network-on-chip (NoC) are difficult and expensive. In this thesis,
we propose three core-based test methods that reuse the existing functional
interconnects-a flat bus, hierarchical buses of multiprocessor SoC's (MPSoC),
and a N oC-in order to avoid the silicon area cost of a dedicated test access mechanism
(TAM). However, the use of functional interconnects as functional TAM's
introduces several new problems.
During tests, the interconnects-including the bus arbitrator, the bus bridges,
and the NoC routers-operate in the functional mode to transport the test stimuli
and responses, while the core under tests (CUT) operate in the test mode. Second,
the test data is transported to the CUT through the functional bus, and not
directly to the test port. Therefore, special core test wrappers that can provide
the necessary control signals required by the different functional interconnect are
proposed. We developed two types of wrappers, one buffer-based wrapper for the
bus-based systems and another pair of complementary wrappers for the NoCbased
systems.
Using the core test wrappers, we propose test scheduling schemes for the three
functionally different types of interconnects. The test scheduling scheme for a flat
bus is developed based on an efficient packet scheduling scheme that minimizes
both the buffer sizes and the test time under a power constraint. The schedulingscheme is then extended to take advantage of the hierarchical bus architecture of
the MPSoC systems. The third test scheduling scheme based on the bandwidth
sharing is developed specifically for the NoC-based systems. The test scheduling
is performed under the objective of co-optimizing the wrapper area cost and the
resulting test application time using the two complementary NoC wrappers.
For each of the proposed methodology for the three types of SoC architec ..
ture, we conducted a thorough experimental evaluation in order to verify their
effectiveness compared to other methods
- …