9 research outputs found

    A Probabilistic Approach for the System-Level Design of Multi-ASIP Platforms

    Get PDF

    Automatic Design of Efficient Application-centric Architectures.

    Full text link
    As the market for embedded devices continues to grow, the demand for high performance, low cost, and low power computation grows as well. Many embedded applications perform computationally intensive tasks such as processing streaming video or audio, wireless communication, or speech recognition and must be implemented within tight power budgets. Typically, general purpose processors are not able to meet these performance and power requirements. Custom hardware in the form of loop accelerators are often used to execute the compute-intensive portions of these applications because they can achieve significantly higher levels of performance and power efficiency. Automated hardware synthesis from high level specifications is a key technology used in designing these accelerators, because the resulting hardware is correct by construction, easing verification and greatly decreasing time-to-market in the quickly evolving embedded domain. In this dissertation, a compiler-directed approach is used to design a loop accelerator from a C specification and a throughput requirement. The compiler analyzes the loop and generates a virtual architecture containing sufficient resources to sustain the required throughput. Next, a software pipelining scheduler maps the operations in the loop to the virtual architecture. Finally, the accelerator datapath is derived from the resulting schedule. In this dissertation, synthesis of different types of loop accelerators is investigated. First, the system for synthesizing single loop accelerators is detailed. In particular, a scheduler is presented that is aware of the effects of its decisions on the resulting hardware, and attempts to minimize hardware cost. Second, synthesis of multifunction loop accelerators, or accelerators capable of executing multiple loops, is presented. Such accelerators exploit coarse-grained hardware sharing across loops in order to reduce overall cost. Finally, synthesis of post-programmable accelerators is presented, allowing changes to be made to the software after an accelerator has been created. The tradeoffs between the flexibility, cost, and energy efficiency of these different types of accelerators are investigated. Automatically synthesized loop accelerators are capable of achieving order-of-magnitude gains in performance, area efficiency, and power efficiency over processors, and programmable accelerators allow software changes while maintaining highly efficient levels of computation.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61644/1/fank_1.pd

    Rapid Prototyping and Exploration Environment for Generating C-to-Hardware-Compilers

    Get PDF
    There is today an ever-increasing demand for more computational power coupled with a desire to minimize energy requirements. Hardware accelerators currently appear to be the best solution to this problem. While general purpose computation with GPUs seem to be very successful in this area, they perform adequately only in those cases where the data access patterns and utilized algorithms fit the underlying architecture. ASICs on the other hand can yield even better results in terms of performance and energy consumption, but are very inflexible, as they are manufactured with an application specific circuitry. Field Programmable Gate Arrays (FPGAs) represent a combination of approaches: With their application specific hardware they provide high computational power while requiring, for many applications, less energy than a CPU or a GPU. On the other hand they are far more flexible than an ASIC due to their reconfigurability. The only remaining problem is the programming of the FPGAs, as they are far more difficult to program compared to regular software. To allow common software developers, who have at best very limited knowledge in hardware design, to make use of these devices, tools were developed that take a regular high level language and generate hardware from it. Among such tools, C-to-HDL compilers are a particularly wide-spread approach. These compilers attempt to translate common C code into a hardware description language from which a datapath is generated. Most of these compilers have many restrictions for the input and differ in their underlying generated micro architecture, their scheduling method, their applied optimizations, their execution model and even their target hardware. Thus, a comparison of a certain aspect alone, like their implemented scheduling method or their generated micro architecture, is almost impossible, as they differ in so many other aspects. This work provides a survey of the existing C-to-HDL compilers and presents a new approach to evaluating and exploring different micro architectures for dynamic scheduling used by such compilers. From a mathematically formulated rule set the Triad compiler generates a backend for the Scale compiler framework, which then implements a hardware generation backend with described dynamic scheduling. While more than a factor of four slower than hardware from highly optimized compilers, this environment allows easy comparison and exploration of different rule sets and the micro architecture for the dynamically scheduled datapaths generated from them. For demonstration purposes a rule set modeling the COCOMA token flow model from the COMRADE 2.0 compiler was implemented. Multiple variants of it were explored: Savings of up to 11% of the required hardware resources were possible

    Customising compilers for customisable processors

    Get PDF
    The automatic generation of instruction set extensions to provide application-specific acceleration for embedded processors has been a productive area of research in recent years. There have been incremental improvements in the quality of the algorithms that discover and select which instructions to add to a processor. The use of automatic algorithms, however, result in instructions which are radically different from those found in conventional, human-designed, RISC or CISC ISAs. This has resulted in a gap between the hardware’s capabilities and the compiler’s ability to exploit them. This thesis proposes and investigates the use of a high-level compiler pass that uses graph-subgraph isomorphism checking to exploit these complex instructions. Operating in a separate pass permits techniques to be applied that are uniquely suited for mapping complex instructions, but unsuitable for conventional instruction selection. The existing, mature, compiler back-end can then handle the remainder of the compilation. With this method, the high-level pass was able to use 1965 different automatically produced instructions to obtain an initial average speed-up of 1.11x over 179 benchmarks evaluated on a hardware-verified cycle-accurate simulator. This result was improved following an investigation of how the produced instructions were being used by the compiler. It was established that the models the automatic tools were using to develop instructions did not take account of how well the compiler could realistically use them. Adding additional parameters to the search heuristic to account for compiler issues increased the speed-up from 1.11x to 1.24x. An alternative approach using a re-designed hardware interface was also investigated and this achieved a speed-up of 1.26x while reducing hardware and compiler complexity. A complementary, high-level, method of exploiting dual memory banks was created to increase memory bandwidth to accommodate the increased data-processing bandwidth provided by extension instructions. Finally, the compiler was considered for use in a non-conventional role where rather than generating code it is used to apply source-level transformations prior to the generation of extension instructions and thus affect the shape of the instructions that are generated

    Siirtoliipaisuarkkitehtuurin muuttuvanmittaisten käskyjen pakkaus

    Get PDF
    The Static Random-Access Memory (SRAM) modules used for embedded microprocessor devices consume a large portion of the whole system’s power. The memory module consumes static power on keeping awake and dynamic power on memory accesses. The power dissipation of the instruction memory can be limited by using code compression methods, which reduce the memory size. The compression may require the use of variable length instruction formats in the processor. The power-efficient design of variable length instruction fetch and decode units is challenging for static multiple-issue processors, because such architectures have simple hardware to begin with, as they aim for very low power consumption on embedded platforms. The power saved by using these compression approaches, which necessitate more complex logic, is easily lost on inefficient processor design. This thesis proposes an implementation for instruction template-based compression, its decompression and two instruction fetch design alternatives for variable length instruction encoding on Transport Triggered Architecture (TTA), a static multiple-issue exposed data path architecture. Both of the new fetch and decode units are integrated into the TTA-based Co-design Environment (TCE), which is a toolset for rapid designing and prototyping of processors based on TTA. The hardware description of the fetch units is verified on a register transfer level and benchmarked using the CHStone test suite. Furthermore, the fetch units are synthesized on a 40 nm standard cell Application Specific Integrated Circuit (ASIC) technology library for area, performance and power consumption measurements. The power cost of the variable length instruction support is compared to the power savings from memory reduction, which is evaluated using HP Labs’ CACTI tool. The compression approach reaches an average program size reduction of 44% at best with a set of test programs, and the total power consumption of the system is reduced. The thesis shows that the proposed variable length fetch designs are sufficiently low-power oriented for TTA processors to benefit from the code compression

    A hardware-software codesign framework for cellular computing

    Get PDF
    Until recently, the ever-increasing demand of computing power has been met on one hand by increasing the operating frequency of processors and on the other hand by designing architectures capable of exploiting parallelism at the instruction level through hardware mechanisms such as super-scalar execution. However, both these approaches seem to have reached a plateau, mainly due to issues related to design complexity and cost-effectiveness. To face the stabilization of performance of single-threaded processors, the current trend in processor design seems to favor a switch to coarser-grain parallelization, typically at the thread level. In other words, high computational power is achieved not only by a single, very fast and very complex processor, but through the parallel operation of several processors, each executing a different thread. Extrapolating this trend to take into account the vast amount of on-chip hardware resources that will be available in the next few decades (either through further shrinkage of silicon fabrication processes or by the introduction of molecular-scale devices), together with the predicted features of such devices (e.g., the impossibility of global synchronization or higher failure rates), it seems reasonable to foretell that current design techniques will not be able to cope with the requirements of next-generation electronic devices and that novel design tools and programming methods will have to be devised. A tempting source of inspiration to solve the problems implied by a massively parallel organization and inherently error-prone substrates is biology. In fact, living beings possess characteristics, such as robustness to damage and self-organization, which were shown in previous research as interesting to be implemented in hardware. For instance, it was possible to realize relatively simple systems, such as a self-repairing watch. Overall, these bio-inspired approaches seem very promising but their interest for a wider audience is problematic because their heavily hardware-oriented designs lack some of the flexibility achievable with a general purpose processor. In the context of this thesis, we will introduce a processor-grade processing element at the heart of a bio-inspired hardware system. This processor, based on a single-instruction, features some key properties that allow it to maintain the versatility required by the implementation of bio-inspired mechanisms and to realize general computation. We will also demonstrate that the flexibility of such a processor enables it to be evolved so it can be tailored to different types of applications. In the second half of this thesis, we will analyze how the implementation of a large number of these processors can be used on a hardware platform to explore various bio-inspired mechanisms. Based on an extensible platform of many FPGAs, configured as a networked structure of processors, the hardware part of this computing framework is backed by an open library of software components that provides primitives for efficient inter-processor communication and distributed computation. We will show that this dual software–hardware approach allows a very quick exploration of different ways to solve computational problems using bio-inspired techniques. In addition, we also show that the flexibility of our approach allows it to exploit replication as a solution to issues that concern standard embedded applications
    corecore