3 research outputs found
Cooperative Data and Computation Partitioning for Decentralized Architectures.
Scalability of future wide-issue processor designs is severely hampered by the
use of centralized resources such as register files, memories and interconnect
networks. While the use of centralized resources eases both hardware design and
compiler code generation efforts, they can become performance bottlenecks as
access latencies increase with larger designs. The natural solution to this
problem is to adapt the architecture to use smaller, decentralized resources.
Decentralized architectures use smaller, faster components and exploit
distributed instruction-level parallelism across the resources. A multicluster
architecture is an example of such a decentralized processor, where subsets of
smaller register files, functional units, and memories are grouped together in a
tightly coupled unit, forming a cluster. These clusters can then be replicated
and connected together to form a scalable, high-performance architecture.
The main difficulty with decentralized architectures resides in compiler code
generation. In a centralized Very Long Instruction Word (VLIW) processor, the
compiler must statically schedule each operation to both a functional unit and a
time slot for execution. In contrast, for a decentralized multicluster VLIW,
the compiler must consider the additional effects of cluster assignment,
recognizing that communication between clusters will result in a delay penalty.
In addition, if the multicluster processor also has partitioned data memories,
the compiler has the additional task of assigning data objects to their
respective memories. Each decision, of cluster, functional unit, memory, and
time slot, are highly interrelated and can have dramatic effects on the best
choice for every other decision.
This dissertation addresses the issues of extracting and exploiting inherent
parallelism across decentralized resources through compiler analysis and code
generation techniques. First, a static analysis technique to partition data
objects is presented, which maps data objects to decentralized scratchpad
memories. Second, an alternative profile-guided technique for memory
partitioning is presented which can effectively map data access operations onto
distributed caches. Finally, a detailed, resource-aware partitioning algorithm
is presented which can effectively split computation operations of an
application across a set of processing elements. These partitioners work in
tandem to create a high-performance partition assignment of both memory and
computation operations for decentralized multicluster or multicore processors.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57649/2/mchu_1.pd
Software-assisted cache mechanisms for embedded systems
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (leaves 120-135).Embedded systems are increasingly using on-chip caches as part of their on-chip memory system. This thesis presents cache mechanisms to improve cache performance and provide opportunities to improve data availability that can lead to more predictable cache performance. The first cache mechanism presented is an intelligent cache replacement policy that utilizes information about dead data and data that is very frequently used. This mechanism is analyzed theoretically to show that the number of misses using intelligent cache replacement is guaranteed to be no more than the number of misses using traditional LRU replacement. Hardware and software-assisted mechanisms to implement intelligent cache replacement are presented and evaluated. The second cache mechanism presented is that of cache partitioning which exploits disjoint access sequences that do not overlap in the memory space. A theoretical result is proven that shows that modifying an access sequence into a concatenation of disjoint access sequences is guaranteed to improve the cache hit rate. Partitioning mechanisms inspired by the concept of disjoint sequences are designed and evaluated. A profit-based analysis, annotation, and simulation framework has been implemented to evaluate the cache mechanisms. This framework takes a compiled benchmark program and a set of program inputs and evaluates various cache mechanisms to provide a range of possible performance improvement scenarios. The proposed cache mechanisms have been evaluated using this framework by measuring cache miss rates and Instructions Per Clock (IPC) information. The results show that the proposed cache mechanisms show promise in improving cache performance and predictability with a modest increase in silicon area.by Prabhat Jain.Ph.D