122 research outputs found
Improving multithreading performance for clustered VLIW architectures.
Very Long Instruction Word (VLIW) processors are very popular in embedded and mobile computing domain. Use of VLIW processors range from Digital Signal Processors (DSPs) found in a plethora of communication and multimedia devices to Graphics Processing Units (GPUs) used in gaming and high performance computing devices. The advantage of VLIWs is their low complexity and low power design which enable high performance at a low cost. Scalability of VLIWs is limited by the scalability of register file ports. It is not viable to have a VLIW processor with a single large register file because of area and power consumption implications of the register file.
Clustered VLIW solve the register file scalability issue by partitioning the register file into multiple clusters and a set of functional units that are attached to register file of that cluster. Using a clustered approach, higher issue width can be achieved while keeping the cost of register file within reasonable limits. Several commercial VLIW processors have been designed using the clustered VLIW model.
VLIW processors can be used to run a larger set of applications. Many of these applications have a good Lnstruction Level Parallelism (ILP) which can be efficiently utilized. However, several applications, specially the ones that are control code dominated do not exibit good ILP and the processor is underutilized. Cache misses is another major source of resource underutiliztion. Multithreading is a popular technique to improve processor utilization. Interleaved MultiThreading (IMT) hides cache miss latencies by scheduling a different thread each cycle but cannot hide unused instructions slots. Simultaneous MultiThread (SMT) can also remove ILP under-utilization by issuing multiple threads to fill the empty instruction slots. However, SMT has a higher implementation cost than IMT. The thesis presents Cluster-level Simultaneous MultiThreading (CSMT) that supports a limited form of SMT where VLIW instructions from different threads are merged at a cluster-level granularity. This lowers the hardware implementation cost to a level comparable to the cheap IMT technique. The more complex SMT combines VLIW instructions at the individual operation-level granularity which is quite expensive especially in for a mobile solution. We refer to SMT at operation-level as OpSMT to reduce ambiguity. While previous studies restricted OpSMT on a VLIW to 2 threads, CSMT has a better scalability and upto 8 threads can be supported at a reasonable cost.
The thesis proposes several other techniques to further improve CSMT performance. In particular, Cluster renaming remaps the clusters used by instructions of different threads to reduce resource conflicts. Cluster renaming is quite effective in reducing the issue-slots under-utilization and significantly improves CSMT performance.The thesis also proposes: a hybrid between IMT and CSMT which increases the number of supported threads, heterogeneous instruction merging where some instructions are combined using SMT and CSMT rest, and finally, split-issue, a technique that allows to launch partially an instruction making it easier to be combined with others
Development of Energy Models for Design Space Exploration of Embedded Many-Core Systems
This paper introduces a methodology to develop energy models for the design
space exploration of embedded many-core systems. The design process of such
systems can benefit from sophisticated models. Software and hardware can be
specifically optimized based on comprehensive knowledge about application
scenario and hardware behavior. The contribution of our work is an automated
framework to estimate the energy consumption at an arbitrary abstraction level
without the need to provide further information about the system. We validated
our framework with the configurable many-core system CoreVA-MPSoC. Compared to
a simulation of the CoreVA-MPSoC on gate level in a 28nm FD-SOI standard cell
technology, our framework shows an average estimation error of about 4%.Comment: Presented at HIP3ES, 201
Compiler Techniques for Loosely-Coupled Multi-Cluster Architectures
No abstract available
Cooperative Data and Computation Partitioning for Decentralized Architectures.
Scalability of future wide-issue processor designs is severely hampered by the
use of centralized resources such as register files, memories and interconnect
networks. While the use of centralized resources eases both hardware design and
compiler code generation efforts, they can become performance bottlenecks as
access latencies increase with larger designs. The natural solution to this
problem is to adapt the architecture to use smaller, decentralized resources.
Decentralized architectures use smaller, faster components and exploit
distributed instruction-level parallelism across the resources. A multicluster
architecture is an example of such a decentralized processor, where subsets of
smaller register files, functional units, and memories are grouped together in a
tightly coupled unit, forming a cluster. These clusters can then be replicated
and connected together to form a scalable, high-performance architecture.
The main difficulty with decentralized architectures resides in compiler code
generation. In a centralized Very Long Instruction Word (VLIW) processor, the
compiler must statically schedule each operation to both a functional unit and a
time slot for execution. In contrast, for a decentralized multicluster VLIW,
the compiler must consider the additional effects of cluster assignment,
recognizing that communication between clusters will result in a delay penalty.
In addition, if the multicluster processor also has partitioned data memories,
the compiler has the additional task of assigning data objects to their
respective memories. Each decision, of cluster, functional unit, memory, and
time slot, are highly interrelated and can have dramatic effects on the best
choice for every other decision.
This dissertation addresses the issues of extracting and exploiting inherent
parallelism across decentralized resources through compiler analysis and code
generation techniques. First, a static analysis technique to partition data
objects is presented, which maps data objects to decentralized scratchpad
memories. Second, an alternative profile-guided technique for memory
partitioning is presented which can effectively map data access operations onto
distributed caches. Finally, a detailed, resource-aware partitioning algorithm
is presented which can effectively split computation operations of an
application across a set of processing elements. These partitioners work in
tandem to create a high-performance partition assignment of both memory and
computation operations for decentralized multicluster or multicore processors.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57649/2/mchu_1.pd
- âŚ