5 research outputs found

    A software-hardware hybrid steering mechanism for clustered microarchitectures

    Full text link

    A simple, verified validator for software pipelining

    Get PDF
    International audienceSoftware pipelining is a loop optimization that overlaps the execution of several iterations of a loop to expose more instruction-level parallelism. It can result in first-class performances characteristics, but at the cost of significant obfuscation of the code, making this optimization difficult to test and debug. In this paper, we present a translation validation algorithm that uses symbolic evaluation to detect semantics discrepancies between a loop and its pipelined version. Our algorithm can be implemented simply and efficiently, is provably sound, and appears to be complete with respect to most modulo scheduling algorithms. A conclusion of this case study is that it is possible and effective to use symbolic evaluation to reason about loop transformations

    Improving multithreading performance for clustered VLIW architectures.

    Get PDF
    Very Long Instruction Word (VLIW) processors are very popular in embedded and mobile computing domain. Use of VLIW processors range from Digital Signal Processors (DSPs) found in a plethora of communication and multimedia devices to Graphics Processing Units (GPUs) used in gaming and high performance computing devices. The advantage of VLIWs is their low complexity and low power design which enable high performance at a low cost. Scalability of VLIWs is limited by the scalability of register file ports. It is not viable to have a VLIW processor with a single large register file because of area and power consumption implications of the register file. Clustered VLIW solve the register file scalability issue by partitioning the register file into multiple clusters and a set of functional units that are attached to register file of that cluster. Using a clustered approach, higher issue width can be achieved while keeping the cost of register file within reasonable limits. Several commercial VLIW processors have been designed using the clustered VLIW model. VLIW processors can be used to run a larger set of applications. Many of these applications have a good Lnstruction Level Parallelism (ILP) which can be efficiently utilized. However, several applications, specially the ones that are control code dominated do not exibit good ILP and the processor is underutilized. Cache misses is another major source of resource underutiliztion. Multithreading is a popular technique to improve processor utilization. Interleaved MultiThreading (IMT) hides cache miss latencies by scheduling a different thread each cycle but cannot hide unused instructions slots. Simultaneous MultiThread (SMT) can also remove ILP under-utilization by issuing multiple threads to fill the empty instruction slots. However, SMT has a higher implementation cost than IMT. The thesis presents Cluster-level Simultaneous MultiThreading (CSMT) that supports a limited form of SMT where VLIW instructions from different threads are merged at a cluster-level granularity. This lowers the hardware implementation cost to a level comparable to the cheap IMT technique. The more complex SMT combines VLIW instructions at the individual operation-level granularity which is quite expensive especially in for a mobile solution. We refer to SMT at operation-level as OpSMT to reduce ambiguity. While previous studies restricted OpSMT on a VLIW to 2 threads, CSMT has a better scalability and upto 8 threads can be supported at a reasonable cost. The thesis proposes several other techniques to further improve CSMT performance. In particular, Cluster renaming remaps the clusters used by instructions of different threads to reduce resource conflicts. Cluster renaming is quite effective in reducing the issue-slots under-utilization and significantly improves CSMT performance.The thesis also proposes: a hybrid between IMT and CSMT which increases the number of supported threads, heterogeneous instruction merging where some instructions are combined using SMT and CSMT rest, and finally, split-issue, a technique that allows to launch partially an instruction making it easier to be combined with others

    Customizing the Computation Capabilities of Microprocessors.

    Full text link
    Designers of microprocessor-based systems must constantly improve performance and increase computational efficiency in their designs to create value. To this end, it is increasingly common to see computation accelerators in general-purpose processor designs. Computation accelerators collapse portions of an application's dataflow graph, reducing the critical path of computations, easing the burden on processor resources, and reducing energy consumption in systems. There are many problems associated with adding accelerators to microprocessors, though. Design of accelerators, architectural integration, and software support all present major challenges. This dissertation tackles these challenges in the context of accelerators targeting acyclic and cyclic patterns of computation. First, a technique to identify critical computation subgraphs within an application set is presented. This technique is hardware-cognizant and effectively generates a set of instruction set extensions given a domain of target applications. Next, several general-purpose accelerator structures are quantitatively designed using critical subgraph analysis for a broad application set. The next challenge is architectural integration of accelerators. Traditionally, software invokes accelerators by statically encoding new instructions into the application binary. This is incredibly costly, though, requiring many portions of hardware and software to be redesigned. This dissertation develops strategies to utilize accelerators, without changing the instruction set. In the proposed approach, the microarchitecture translates applications at run-time, replacing computation subgraphs with microcode to utilize accelerators. We explore the tradeoffs in performing difficult aspects of the translation at compile-time, while retaining run-time replacement. This culminates in a simple microarchitectural interface that supports a plug-and-play model for integrating accelerators into a pre-designed microprocessor. Software support is the last challenge in dealing with computation accelerators. The primary issue is difficulty in generating high-quality code utilizing accelerators. Hand-written assembly code is standard in industry, and if compiler support does exist, simple greedy algorithms are common. In this work, we investigate more thorough techniques for compiling for computation accelerators. Where greedy heuristics only explore one possible solution, the techniques in this dissertation explore the entire design space, when possible. Intelligent pruning methods ensure that compilation is both tractable and scalable.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57633/2/ntclark_1.pd
    corecore