45 research outputs found

    Cooperative Data and Computation Partitioning for Decentralized Architectures.

    Full text link
    Scalability of future wide-issue processor designs is severely hampered by the use of centralized resources such as register files, memories and interconnect networks. While the use of centralized resources eases both hardware design and compiler code generation efforts, they can become performance bottlenecks as access latencies increase with larger designs. The natural solution to this problem is to adapt the architecture to use smaller, decentralized resources. Decentralized architectures use smaller, faster components and exploit distributed instruction-level parallelism across the resources. A multicluster architecture is an example of such a decentralized processor, where subsets of smaller register files, functional units, and memories are grouped together in a tightly coupled unit, forming a cluster. These clusters can then be replicated and connected together to form a scalable, high-performance architecture. The main difficulty with decentralized architectures resides in compiler code generation. In a centralized Very Long Instruction Word (VLIW) processor, the compiler must statically schedule each operation to both a functional unit and a time slot for execution. In contrast, for a decentralized multicluster VLIW, the compiler must consider the additional effects of cluster assignment, recognizing that communication between clusters will result in a delay penalty. In addition, if the multicluster processor also has partitioned data memories, the compiler has the additional task of assigning data objects to their respective memories. Each decision, of cluster, functional unit, memory, and time slot, are highly interrelated and can have dramatic effects on the best choice for every other decision. This dissertation addresses the issues of extracting and exploiting inherent parallelism across decentralized resources through compiler analysis and code generation techniques. First, a static analysis technique to partition data objects is presented, which maps data objects to decentralized scratchpad memories. Second, an alternative profile-guided technique for memory partitioning is presented which can effectively map data access operations onto distributed caches. Finally, a detailed, resource-aware partitioning algorithm is presented which can effectively split computation operations of an application across a set of processing elements. These partitioners work in tandem to create a high-performance partition assignment of both memory and computation operations for decentralized multicluster or multicore processors.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57649/2/mchu_1.pd

    Hardware schemes for early register release

    Get PDF
    Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is quite related to the size and number of ports of the register file. In conventional register renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.Peer ReviewedPostprint (published version

    Compiler-Driven Reconfiguration of Multiprocessors

    Get PDF
    Hussmann M, Thies M, Kastens U, Purnaprajna M, Porrmann M, Rückert U. Compiler-Driven Reconfiguration of Multiprocessors. In: Proceedings of the Workshop on Application Specific Processors (WASP) 2007. 2007.Multiprocessors enable parallel execution of a single large application to achieve a performance improvement. An application is split at instruction, data or task level (based on the granularity), such that the overhead of partitioning is minimal. Parallelization for multiprocessors is mostly restricted to a fixed granularity. Reconfiguration enables architectural variations to allow multiple granularities of operation within a multiprocessor. This adaptability optimizes resource utilization over a fixed organization. Here, a unified hardware-software approach to design a reconfigurable multiprocessor system called QuadroCore is presented. In our holistic methodology, compiler-driven reconfiguration selects from a fixed set of modes. Each mode relies on matching program analysis to exploit the architecture efficiently. For instance, a multiprocessor may adapt to different parallelization paradigms. The compiler can determine the best execution mode for each piece of code by analyzing the parallelism in a program. A fast, singlecycle, run-time reconfiguration between these predetermined modes is enabled by executing special instructions which switch coarse-grained components like instruction decoders, ALUs and register banks. Performance is evaluated in terms of execution cycles and achieved clock frequency. First results indicate suitability especially in audio and video processing applications

    Automatic Design of Efficient Application-centric Architectures.

    Full text link
    As the market for embedded devices continues to grow, the demand for high performance, low cost, and low power computation grows as well. Many embedded applications perform computationally intensive tasks such as processing streaming video or audio, wireless communication, or speech recognition and must be implemented within tight power budgets. Typically, general purpose processors are not able to meet these performance and power requirements. Custom hardware in the form of loop accelerators are often used to execute the compute-intensive portions of these applications because they can achieve significantly higher levels of performance and power efficiency. Automated hardware synthesis from high level specifications is a key technology used in designing these accelerators, because the resulting hardware is correct by construction, easing verification and greatly decreasing time-to-market in the quickly evolving embedded domain. In this dissertation, a compiler-directed approach is used to design a loop accelerator from a C specification and a throughput requirement. The compiler analyzes the loop and generates a virtual architecture containing sufficient resources to sustain the required throughput. Next, a software pipelining scheduler maps the operations in the loop to the virtual architecture. Finally, the accelerator datapath is derived from the resulting schedule. In this dissertation, synthesis of different types of loop accelerators is investigated. First, the system for synthesizing single loop accelerators is detailed. In particular, a scheduler is presented that is aware of the effects of its decisions on the resulting hardware, and attempts to minimize hardware cost. Second, synthesis of multifunction loop accelerators, or accelerators capable of executing multiple loops, is presented. Such accelerators exploit coarse-grained hardware sharing across loops in order to reduce overall cost. Finally, synthesis of post-programmable accelerators is presented, allowing changes to be made to the software after an accelerator has been created. The tradeoffs between the flexibility, cost, and energy efficiency of these different types of accelerators are investigated. Automatically synthesized loop accelerators are capable of achieving order-of-magnitude gains in performance, area efficiency, and power efficiency over processors, and programmable accelerators allow software changes while maintaining highly efficient levels of computation.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61644/1/fank_1.pd

    Hybrid Caching for Chip Multiprocessors Using Compiler-Based Data Classification

    Get PDF
    The high performance delivered by modern computer system keeps scaling with an increasingnumber of processors connected using distributed network on-chip. As a result, memory accesslatency, largely dominated by remote data cache access and inter-processor communication, is becoming a critical performance bottleneck. To release this problem, it is necessary to localize data access as much as possible while keep efficient on-chip cache memory utilization. Achieving this however, is application dependent and needs a keen insight into the memory access characteristics of the applications. This thesis demonstrates how using fairly simple thus inexpensive compiler analysis memory accesses can be classified into private data access and shared data access. In addition, we introduce a third classification named probably private access and demonstrate the impact of this category compared to traditional private and shared memory classification. The memory access classification information from the compiler analysis is then provided to the runtime system through a modified memory allocator and page table to facilitate a hybrid private-shared caching technique. The hybrid cache mechanism is aware of different data access classification and adopts appropriate placement and search policies accordingly to improve performance. Our analysis demonstrates that many applications have a significant amount of both private and shared data and that compiler analysis can identify the private data effectively for many applications. Experimentsresults show that the implemented hybrid caching scheme achieves 4.03% performance improvement over state of the art NUCA-base caching

    AGAMOS: A graph-based approach to modulo scheduling for clustered microarchitectures

    Get PDF
    This paper presents AGAMOS, a technique to modulo schedule loops on clustered microarchitectures. The proposed scheme uses a multilevel graph partitioning strategy to distribute the workload among clusters and reduces the number of intercluster communications at the same time. Partitioning is guided by approximate schedules (i.e., pseudoschedules), which take into account all of the constraints that influence the final schedule. To further reduce the number of intercluster communications, heuristics for instruction replication are included. The proposed scheme is evaluated using the SPECfp95 programs. The described scheme outperforms a state-of-the-art scheduler for all programs and different cluster configurations. For some configurations, the speedup obtained when using this new scheme is greater than 40 percent, and for selected programs, performance can be more than doubled.Peer ReviewedPostprint (published version

    CARS: A New Code Generation Framework for Clustered ILP Processors

    Get PDF
    Clustered ILP processors are characterized by a large number of non-centralized on-chip resources grouped into clusters. Traditional code generation schemes for these processors consist of multiple phases for cluster assignment, register allocation and instruction scheduling. Most of these approaches need additional re-scheduling phases because they often do not impose finite resource constraints in all phases of code generation. These phase-ordered solutions have several drawbacks, resulting in the generation of poor performance code. Moreover, the iterative/back-tracking algorithms used in some of these schemes have large running times. In this report we present CARS, a code generation framework for Clustered ILP processors, which combines the cluster assignment, register allocation, and instruction scheduling phases into a single code generation phase, thereby eliminating the problems associated with phase-ordered solutions. The CARS algorithm explicitly takes into account all the resource constraints at each cluster scheduling step to reduce spilling and to avoid iterative re-scheduling steps. We also present a new on-the-fly register allocation scheme developed for CARS. We describe an implementation of the proposed code generation framework and the results of a performance evaluation study using the SPEC95/2000 and MediaBench benchmarks. (Also cross-referenced as UMIACS-TR-2000-55

    Limits of a decoupled out-of-order superscalar architecture

    Get PDF

    On-the-fly tracing for data-centric computing : parallelization, workflow and applications

    Get PDF
    As data-centric computing becomes the trend in science and engineering, more and more hardware systems, as well as middleware frameworks, are emerging to handle the intensive computations associated with big data. At the programming level, it is crucial to have corresponding programming paradigms for dealing with big data. Although MapReduce is now a known programming model for data-centric computing where parallelization is completely replaced by partitioning the computing task through data, not all programs particularly those using statistical computing and data mining algorithms with interdependence can be re-factorized in such a fashion. On the other hand, many traditional automatic parallelization methods put an emphasis on formalism and may not achieve optimal performance with the given limited computing resources. In this work we propose a cross-platform programming paradigm, called on-the-fly data tracing , to provide source-to-source transformation where the same framework also provides the functionality of workflow optimization on larger applications. Using a big-data approximation computations related to large-scale data input are identified in the code and workflow and a simplified core dependence graph is built based on the computational load taking in to account big data. The code can then be partitioned into sections for efficient parallelization; and at the workflow level, optimization can be performed by adjusting the scheduling for big-data considerations, including the I/O performance of the machine. Regarding each unit in both source code and workflow as a model, this framework enables model-based parallel programming that matches the available computing resources. The techniques used in model-based parallel programming as well as the design of the software framework for both parallelization and workflow optimization as well as its implementations with multiple programming languages are presented in the dissertation. Then, the following experiments are performed to validate the framework: i) the benchmarking of parallelization speed-up using typical examples in data analysis and machine learning (e.g. naive Bayes, k-means) and ii) three real-world applications in data-centric computing with the framework are also described to illustrate the efficiency: pattern detection from hurricane and storm surge simulations, road traffic flow prediction and text mining from social media data. In the applications, it illustrates how to build scalable workflows with the framework along with performance enhancements

    Clustered VLIW architecture based on queue register files

    Get PDF
    Institute for Computing Systems ArchitectureInstruction-level parallelism (ILP) is a set of hardware and software techniques that allow parallel execution of machine operations. Superscalar architectures rely most heavily upon hardware schemes to identify parallelism among operations. Although successful in terms of performance, the hardware complexity involved might limit the scalability of this model. VLIW architectures use a different approach to exploit ILP. In this case all data dependence analyses and scheduling of operations are performed at compile time, resulting in a simpler hardware organization. This allows the inclusion of a larger number of functional units (FUs) into a single chip. IN spite of this relative simplification, the scalability of VLIW architectures can be constrained by the size and number of ports of the register file. VLIW machines often use software pipelining techniques to improve the execution of loop structures, which can increase the register pressure. Furthermore, the access time of a register file can be compromised by the number of ports, causing a negative impact on the machine cycle time. For these reasons we understand that the benefits of having parallel FUs, which have motivated the investigation of alternative machine designs. This thesis presents a scalar VLIW architecture comprising clusters of FUs and private register files. Register files organised as queue structures are used as a mechanism for inter-cluster communication, allowing the enforcement of fixed latency in the process. This scheme presents better possibilities in terms of scalability as the size of the individual register files is not determined by the total number of FUs, suggesting that the silicon area may grow only linearly with respect to the total number of FUs. However, the effectiveness of such an organization depends on the efficiency of the code partitioning strategy. We have developed an algorithm for a clustered VLIW architecture integrating both software pipelining and code partitioning in a a single procedure. Experimental results show it may allow performance levels close to an unclustered machine without communication restraints. Finally, we have developed silicon area and cycle time models to quantify the scalability of performance and cost for this class of architecture
    corecore