10 research outputs found

    A Micro Power Hardware Fabric for Embedded Computing

    Get PDF
    Field Programmable Gate Arrays (FPGAs) mitigate many of the problemsencountered with the development of ASICs by offering flexibility, faster time-to-market, and amortized NRE costs, among other benefits. While FPGAs are increasingly being used for complex computational applications such as signal and image processing, networking, and cryptology, they are far from ideal for these tasks due to relatively high power consumption and silicon usage overheads compared to direct ASIC implementation. A reconfigurable device that exhibits ASIC-like power characteristics and FPGA-like costs and tool support is desirable to fill this void. In this research, a parameterized, reconfigurable fabric model named as domain specific fabric (DSF) is developed that exhibits ASIC-like power characteristics for Digital Signal Processing (DSP) style applications. Using this model, the impact of varying different design parameters on power and performance has been studied. Different optimization techniques like local search and simulated annealing are used to determine the appropriate interconnect for a specific set of applications. A design space exploration tool has been developed to automate and generate a tailored architectural instance of the fabric.The fabric has been synthesized on 160 nm cell-based ASIC fabrication process from OKI and 130 nm from IBM. A detailed power-performance analysis has been completed using signal and image processing benchmarks from the MediaBench benchmark suite and elsewhere with comparisons to other hardware and software implementations. The optimized fabric implemented using the 130 nm process yields energy within 3X of a direct ASIC implementation, 330X better than a Virtex-II Pro FPGA and 2016X better than an Intel XScale processor

    A Physical Implementation with Custom Low Power Extensions of a Reconfigurable Hardware Fabric

    Get PDF
    The primary focus of this thesis is on the physical implementation of the SuperCISC Reconfigurable Hardware Fabric (RHF). The SuperCISC RHF provides a fast time to market solution that approximates the benefits of an ASIC (Application Specific Integrated Circuit) while retaining the design flow of an embedded software system. The fabric which consists of computational ALU stripes and configurable multiplexer based interconnect stripes has been implemented in the IBM 0.13um CMOS process using Cadence SoC Encounter. As the entire hardware fabric utilizes a combinational flow, glitching power consumption is a potential problem inherent to the fabric. A CMOS thyristor based programmable delay element has been designed in the IBM 0.13um CMOS process, to minimize the glitch power consumed in the hardware fabric. The delay element was characterized for use in the IBM standard cell library to synthesize standard cell ASIC designs requiring this capability such as the SuperCISC fabric. The thesis also introduces a power-gated memory solution, which can be used to increase the size of an EEPROM memory for use in SoC style applications. A macromodel of the EEPROM has been used to model the erase, program and read characteristics of the EEPROM. This memory is designed for use in the fabric for storing encryption keys, etc

    Static Timing Analysis Based Transformations of Super-Complex Instruction Set Hardware Functions

    Get PDF
    Application specific hardware implementations are an increasingly popular way of reducing execution time and power consumption in embedded systems. This application specific hardware typically consumes a small fraction of the execution time and power consumption that the equivalent software code would require. Modern electronic design automation (EDA) tools can be used to apply a variety of transformations to hardware blocks in an effort to achieve additional performance and power savings. A number of such transformations require a tool with knowledge of the designs' timing characteristics. This thesis describes a static timing analyzer and two timing analysis based design automation tools. The static timing analyzer estimates the worst-case timing characteristics of a hardware data flow graph. These hardware data flow graphs are intermediate representations generated within a C to VHDL hardware acceleration compiler. Two EDA tools were then developed which utilize static timing analysis. An automated pipelining tool was developed to increase the throughput of large blocks of combinational logic generated by the hardware acceleration compiler. Another tool was designed in an attempt to mitigate power consumption resulting from extraneous combinational switching. By inserting special signal buffers, known as delay elements, with preselected propagation delays, combinational functional units can be kept inactive until their inputs have stabilized. The hardware descriptions generated by both tools were synthesized, simulated, and power profiled using existing commercial EDA tools. The results show that pipelining leads to an average performance increase of 3.3x, while delay elements saved between 25% and 33% of the power consumption when tested on a set of signal and image processing benchmarks

    Design space exploration for low-power reconfigurable fabrics

    Full text link
    Field Programmable Gate Array (FPGA)-like programmability and Computer Aided Design (CAD), with Application Specific Integrated Circuit (ASIC)-like power characteristics for Digital Signal Processing (DSP) style applications. Using this model, architectural design space decisions are explored in order to define an energy-efficient fabric. The impact on energy and performance due to the variation of different parameters such as datawidth and interconnection flexibility has been studied. The multiplexer cardinality usage has also been studied by mapping some of the signal processing applications onto the fabric. The results point to the use of power optimized 32-bit width computational elements interconnected by low cardinality multiplexers like 4:1 multiplexers. I

    THE VLIW-SUPERCISC COMPILER: EXPLOITINGPARALLELISM FROM C-BASED APPLICATIONS

    Get PDF
    A common approach to decreasing embedded application execution time is creating a homogeneous parallel processor architecture. The parallelism of any such architecture is limited to the number of instructions that can be scheduled in the same cycle. This number of instructions scheduled in a cycle, or instruction-level parallelism (ILP), is limited by the ability to extract parallelism from the application. Other techniques attempt to improve performance with hardware acceleration. Often, segments of highly computational extensive code are extracted and custom hardware is created to replace the software execution. This technique requires many resources and still does not address the segments of code outside of the computationally extensive kernel.To solve this problem, hardware acceleration for computationally intensive segments of code in addition to accelerating the entire application with very long instruction word, VLIW, techniques is proposed. (1) A compilation flow that targets a 4-wide VLIW processor architecture is presented. This system was used to investigate the available speed-up of VLIW architectures. The architecture was modified to combine the VLIW processor with the capability to execute application specific customized instructions. To create the custom instruction hardware, a control and data flow graph (CDFG) framework was created. The CDFG framework was created to provide a framework for compiler transformations and hardware generation. In order to remove control flow from segments of code selected for hardware generation, (2) the technique of hardware predication was developed. Hardware predication allows if-then and if-then-else control flow constructs to be transformed into strict data flow through the use of multiplexors. From the transformed CDFGs, (3) a VHDL generation pass was created that translates the compiler data structures into synthesizable VHDL. The resulting architecture contains the VLIW processor and tightly coupled application specific hardware. This architecture was analyzed for performance changes comparedto the initial VLIW architecture, and a traditional processor. Lastly, (4) the architecture was analyzed for power and energy savings. A post static timing pass was added to the compilation flow for the insertion of hardware to delay early switching of operations.By measuring only the execution of the hardware function and comparing the performance to the equivalent code executed in software, a performance multiplier of up to 322 times is seen when synthesized onto an Altera Stratix II ES2S180F1508C4 FPGA. The average performance increase seen was 63 times faster. For the entire application, the speedup reached nearly 30X and was on average 12X better than a single processor implementation. The power and energy required by the VLIW processor core and the hardware functions for the computational kernels after 160nm OKI standard cell ASIC synthesis show a maximum power savings of 417 times that of execution on the processor with an average of 133 times savings in power consumption. With the increased execution time and the savings in power the energy savings will see a multiplicative effect. The energy improvement is therefore several orders of magnitude for the hardware functions, the savings range from over 1,000X to approximately 60,000X

    A Hybrid Hardware/Software Architecture That Combines a 4-wide Very Long Instruction Word Software Processor (VLIW) with Application-specific Super-complex Instruction Set Hardware Functions

    Get PDF
    Application-driven processor designs are becoming increasingly feasible. Today, advances in field-programmable gate array (FPGA) technology are opening the doors to fast and highly-feasible hardware/software co-designed architectures. Over 100,000 FPGA logic array blocks and nearly 100 ASIC multiply-accumulate cores combine with extensible CPU cores to foster the design of configurable, application-driven hybrid processors.This thesis proposes a hardware/software co-designed architecture targeted to an FPGA. The architecture is a very-long instruction-word (VLIW) processor coupled with super-complex instruction set (SuperCISC) hardware co-processors. Results of the VLIW/SuperCISC show performance speedups over a single-issue processor of 9x to 332x, and entire application speedups from 4x to 127x. Contributions of this research include a 4-way VLIW designed from the ground up, a zero-overhead implementation of a hardware/software interface, evaluation of the scalability of shared data stores, examples of application-specific hardware accelerants, a SystemC simulator, and an evaluation of shared memory configurations

    OPTIMIZATION OF MAPPING ONTO A FLEXIBLE LOW-POWERELECTRONIC FABRIC ARCHITECTURE

    Get PDF
    A combinatorial problem that arises from a novel electronic fabric architecture designed forlow-power devices such as cellular phones and palm computers is presented. We consider theproblem of efficiently mapping a given data flow graph onto a particular implementation ofthe fabric architecture. We formulate mixed integer linear programs (MILP) and design asliding partial MILP heuristic for this problem. We highlight the modeling and algorithmicaspects that are necessary to make the MILP formulation competitive. The sliding partialMILP heuristic is developed to generate mappings faster and to find mappings for benchmarkinstances that cannot be solved by the MILP formulation.We also present a method to tune software parameters using ideas from software testingand machine learning. The method is based on the key observation that for many classes ofinstances, the software shows improved performance if a few critical parameters have good values, although which parameters are critical depends on the class of instances. Our methodattempts to find good parameter values using a relatively small number of optimization trials

    A Low-Energy Reconfigurable Fabric for the SuperCISC Architecture

    No full text
    corecore