11 research outputs found

    SIMD pipelined processor implemented on a FPGA

    Get PDF
    The goal of this thesis was to create a processor using VHDL that could be used for educational purposes as well as a stepping stone in creating a reconfigurable system for digital signal processing or image processing applications. To do this a subset of MIPS instructions were chosen to demonstrate functionality within a five stage pipeline (instruction fetch, instruction decode, execution, memory, and write back) processor in simulation and in synthesis. A hazard controller was implemented to handle data forwarding and stalling. The basic MIPS architecture was extended by adding singlecycle multiplication functionality and single-cycle SIMD instructions. The architecture contains parameters for easy modification of SIMD units depending on the needs of the processor. The SIMD architecture was designed with distributed memory so that every memory unit received the same address. This simplifies the address logic so that the processor does not have to use a complex addressing mode. The memory can be pictured as row and columns method of access. The SIMD instructions were chosen to be able to perform binary operations to implement future morphological operations and to use the multiply and add operations for implementing MACs to perform convolution and filtering operations in future image processing applications. The board being used to verify the processor was a Xilinx University Program (XUP) board that contains Xilinx Virtex II Pro XC2VP30 FPGA, package FF896. The maximum number of units that can be instantiated in the FPGA on the XUP board is eight units which would use the entire FPGA slice area. This allows the processor to complete eight sets of 32-bit data operations per cycle when the SIMD pipeline is full. The design was shown to operate at the maximum speed of 100 MHz and utilize all the area of the FPGA. The processor was verified in both simulation and synthesis. The new soft-core 32-bit SIMD processor extends existing soft-core processors in that it provides a reconfigurable SIMD-pipeline allowing it to operate on multiple inputs concurrently, with 32-bit operands and a single-cycle throughput

    Prosessori- ja system-on-chip-työkalujen yhteiskäyttö

    Get PDF
    Transport-triggered architecture (TTA) processors provide an efficient middle-ground in creating intellectual property (IP) components for system-on-chip (SoC) designs. Using TTAs, the design effort is greatly reduced compared to ASIC approach, and a more economic and efficient implementation is possible than when using a general purpose processor. This Thesis examines ways to accelerate the design flow when using TTA processors in SoC designs. The proposed flows combine the use of the TTA-based Co-design Environment (TCE) tool set and Kactus2 IP-XACT design environment. The IP-XACT standard and the Kactus2 tool make it easy to integrate and configure IP components from multiple vendors, whereas the TCE tools provide a fast and efficient path from C to VHDL. The Thesis presents three use cases for TTA: as a ready-made fixed accelerator, a general purpose processor, and a tailored application-specific processor. Moreover, management of instance-specific data in IP-XACT is discussed. For each use case, the design flows are presented in detail step-by-step, a case example is presented, and the design time spent on each step is evaluated. The flows contain between 15 and 18 steps and use between 8 and 12 different program tools from the studied tool sets. Provided that C source codes and IP-XACT library are available, a non-HW oriented engineer can implement an FPGA based multiprocessor product in less than 4 hours. Based on the results, further development suggestions for the TCE tools and Kactus2 are made

    Static Timing Analysis Based Transformations of Super-Complex Instruction Set Hardware Functions

    Get PDF
    Application specific hardware implementations are an increasingly popular way of reducing execution time and power consumption in embedded systems. This application specific hardware typically consumes a small fraction of the execution time and power consumption that the equivalent software code would require. Modern electronic design automation (EDA) tools can be used to apply a variety of transformations to hardware blocks in an effort to achieve additional performance and power savings. A number of such transformations require a tool with knowledge of the designs' timing characteristics. This thesis describes a static timing analyzer and two timing analysis based design automation tools. The static timing analyzer estimates the worst-case timing characteristics of a hardware data flow graph. These hardware data flow graphs are intermediate representations generated within a C to VHDL hardware acceleration compiler. Two EDA tools were then developed which utilize static timing analysis. An automated pipelining tool was developed to increase the throughput of large blocks of combinational logic generated by the hardware acceleration compiler. Another tool was designed in an attempt to mitigate power consumption resulting from extraneous combinational switching. By inserting special signal buffers, known as delay elements, with preselected propagation delays, combinational functional units can be kept inactive until their inputs have stabilized. The hardware descriptions generated by both tools were synthesized, simulated, and power profiled using existing commercial EDA tools. The results show that pipelining leads to an average performance increase of 3.3x, while delay elements saved between 25% and 33% of the power consumption when tested on a set of signal and image processing benchmarks

    Computing SpMV on FPGAs

    Get PDF
    There are hundreds of papers on accelerating sparse matrix vector multiplication (SpMV), however, only a handful target FPGAs. Some claim that FPGAs inherently perform inferiorly to CPUs and GPUs. FPGAs do perform inferiorly for some applications like matrix-matrix multiplication and matrix-vector multiplication. CPUs and GPUs have too much memory bandwidth and too much floating point computation power for FPGAs to compete. However, the low computations to memory operations ratio and irregular memory access of SpMV trips up both CPUs and GPUs. We see this as a leveling of the playing field for FPGAs. Our implementation focuses on three pillars: matrix traversal, multiply-accumulator design, and matrix compression. First, most SpMV implementations traverse the matrix in row-major order, but we mix column and row traversal. Second, To accommodate the new traversal the multiply accumulator stores many intermediate y values. Third, we compress the matrix to increase the transfer rate of the matrix from RAM to the FPGA. Together these pillars enable our SpMV implementation to perform competitively with CPUs and GPUs

    DESIGN AUTOMATION FOR LOW POWER RFID TAGS

    Get PDF
    Radio Frequency Identification (RFID) tags are small, wireless devices capable of automated item identification, used in a variety of applications including supply chain management, asset management, automatic toll collection (EZ Pass), etc. However, the design of these types of custom systems using the traditional methods can take months for a hardware engineer to develop and debug. In this dissertation, an automated, low-power flow for the design of RFID tags has been developed, implemented and validated. This dissertation presents the RFID Compiler, which permits high-level design entry using a simple description of the desired primitives and their behavior in ANSI-C. The compiler has different back-ends capable of targeting microprocessor-based or custom hardware-based tags. For the hardware-based tag, the back-end automatically converts the user-supplied behavior in C to low power synthesizable VHDL optimized for RFID applications. The compiler also integrates a fast, high-level power macromodeling flow, which can be used to generate power estimates within 15% accuracy of industry CAD tools and to optimize the primitives and / or the behaviors, compared to conventional practices. Using the RFID Compiler, the user can develop the entire design in a matter of days or weeks. The compiler has been used to implement standards such as ANSI, ISO 18000-7, 18000-6C and 18185-7. The automatically generated tag designs were validated by targeting microprocessors such as the AD Chips EISC and FPGAs such as Xilinx Spartan 3. The corresponding ASIC implementation is comparable to the conventionally designed commercial tags in terms of the energy and area. Thus, the RFID Compiler permits the design of power efficient, custom RFID tags by a wider audience with a dramatically reduced design cycle

    OPTIMIZATION OF MAPPING ONTO A FLEXIBLE LOW-POWERELECTRONIC FABRIC ARCHITECTURE

    Get PDF
    A combinatorial problem that arises from a novel electronic fabric architecture designed forlow-power devices such as cellular phones and palm computers is presented. We consider theproblem of efficiently mapping a given data flow graph onto a particular implementation ofthe fabric architecture. We formulate mixed integer linear programs (MILP) and design asliding partial MILP heuristic for this problem. We highlight the modeling and algorithmicaspects that are necessary to make the MILP formulation competitive. The sliding partialMILP heuristic is developed to generate mappings faster and to find mappings for benchmarkinstances that cannot be solved by the MILP formulation.We also present a method to tune software parameters using ideas from software testingand machine learning. The method is based on the key observation that for many classes ofinstances, the software shows improved performance if a few critical parameters have good values, although which parameters are critical depends on the class of instances. Our methodattempts to find good parameter values using a relatively small number of optimization trials

    THE VLIW-SUPERCISC COMPILER: EXPLOITINGPARALLELISM FROM C-BASED APPLICATIONS

    Get PDF
    A common approach to decreasing embedded application execution time is creating a homogeneous parallel processor architecture. The parallelism of any such architecture is limited to the number of instructions that can be scheduled in the same cycle. This number of instructions scheduled in a cycle, or instruction-level parallelism (ILP), is limited by the ability to extract parallelism from the application. Other techniques attempt to improve performance with hardware acceleration. Often, segments of highly computational extensive code are extracted and custom hardware is created to replace the software execution. This technique requires many resources and still does not address the segments of code outside of the computationally extensive kernel.To solve this problem, hardware acceleration for computationally intensive segments of code in addition to accelerating the entire application with very long instruction word, VLIW, techniques is proposed. (1) A compilation flow that targets a 4-wide VLIW processor architecture is presented. This system was used to investigate the available speed-up of VLIW architectures. The architecture was modified to combine the VLIW processor with the capability to execute application specific customized instructions. To create the custom instruction hardware, a control and data flow graph (CDFG) framework was created. The CDFG framework was created to provide a framework for compiler transformations and hardware generation. In order to remove control flow from segments of code selected for hardware generation, (2) the technique of hardware predication was developed. Hardware predication allows if-then and if-then-else control flow constructs to be transformed into strict data flow through the use of multiplexors. From the transformed CDFGs, (3) a VHDL generation pass was created that translates the compiler data structures into synthesizable VHDL. The resulting architecture contains the VLIW processor and tightly coupled application specific hardware. This architecture was analyzed for performance changes comparedto the initial VLIW architecture, and a traditional processor. Lastly, (4) the architecture was analyzed for power and energy savings. A post static timing pass was added to the compilation flow for the insertion of hardware to delay early switching of operations.By measuring only the execution of the hardware function and comparing the performance to the equivalent code executed in software, a performance multiplier of up to 322 times is seen when synthesized onto an Altera Stratix II ES2S180F1508C4 FPGA. The average performance increase seen was 63 times faster. For the entire application, the speedup reached nearly 30X and was on average 12X better than a single processor implementation. The power and energy required by the VLIW processor core and the hardware functions for the computational kernels after 160nm OKI standard cell ASIC synthesis show a maximum power savings of 417 times that of execution on the processor with an average of 133 times savings in power consumption. With the increased execution time and the savings in power the energy savings will see a multiplicative effect. The energy improvement is therefore several orders of magnitude for the hardware functions, the savings range from over 1,000X to approximately 60,000X

    Dynamic task scheduling and binding for many-core systems through stream rewriting

    Get PDF
    This thesis proposes a novel model of computation, called stream rewriting, for the specification and implementation of highly concurrent applications. Basically, the active tasks of an application and their dependencies are encoded as a token stream, which is iteratively modified by a set of rewriting rules at runtime. In order to estimate the performance and scalability of stream rewriting, a large number of experiments have been evaluated on many-core systems and the task management has been implemented in software and hardware.In dieser Dissertation wurde Stream Rewriting als eine neue Methode entwickelt, um Anwendungen mit einer großen Anzahl von dynamischen Tasks zu beschreiben und effizient zur Laufzeit verwalten zu können. Dabei werden die aktiven Tasks in einem Datenstrom verpackt, der zur Laufzeit durch wiederholtes Suchen und Ersetzen umgeschrieben wird. Um die Performance und Skalierbarkeit zu bestimmen, wurde eine Vielzahl von Experimenten mit Many-Core-Systemen durchgeführt und die Verwaltung von Tasks über Stream Rewriting in Software und Hardware implementiert
    corecore